CVPR 2023 papers


Please select the LDA topic:

topic-1

Topic words :  video,  motion,  temporal,  videos,  frame,  action,  frames,  visual

MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Action Recognition
Wang, XiangandZhang, ShiweiandQing, ZhiwuandGao, ChangxinandZhang, YingyaandZhao, DeliandSang, Nong



研究问题:现有的少数镜头动作识别方法通过在已学习的视频特征上进行帧级别匹配来达到良好的性能,但它们通常存在两个限制:一是局部帧之间的匹配过程由于缺乏长范围时间感知的指导而往往不准确;二是显式运动学习通常被忽视,导致部分信息丢失。
动机:为了解决这些问题,我们开发了一种名为“运动增强的长短期对比学习”(MoLo)的方法,该方法包含两个关键组件,即长短期对比目标和运动自解码器。
方法:具体来说,长短期对比目标是通过最大化属于同一类别的视频的全局标记与局部帧特征的一致性,使局部帧特征具有长形式的时间感知能力。运动自解码器是一种轻量级架构,用于从差分特征重建像素运动,从而明确地将网络嵌入到运动动态中。
效果:通过这种方法,MoLo可以同时学习长范围的时间上下文和运动线索,以进行全面的少数镜头匹配。我们在五个标准基准上评估了MoLo的效果,结果显示MoLo优于最近先进的方法。

Current state-of-the-art approaches for few-shot action recognition achieve promising performance by conducting frame-level matching on learned visual features. However, they generally suffer from two limitations: i) the matching procedure between local frames tends to be inaccurate due to the lack of guidance to force long-range temporal perception; ii) explicit motion learning is usually ignored, leading to partial information loss. To address these issues, we develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder. Specifically, the long-short contrastive objective is to endow local frame features with long-form temporal awareness by maximizing their agreement with the global token of videos belonging to the same class. The motion autodecoder is a lightweight architecture to reconstruct pixel motions from the differential features, which explicitly embeds the network with motion dynamics. By this means, MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching. To demonstrate the effectiveness, we evaluate MoLo on five standard benchmarks, and the results show that MoLo favorably outperforms recent advanced methods. The source code is available at https://github.com/alibaba-mmai-research/MoLo.

Video Event Restoration Based on Keyframes for Video Anomaly Detection
Yang, ZhiweiandLiu, JingandWu, ZhaoyangandWu, PengandLiu, Xiaotao



研究问题:视频异常检测(VAD)是一个重要的计算机视觉问题。
动机:现有的基于深度神经网络的VAD方法主要通过帧重建或帧预测进行,但缺乏对视频中更高级别的视觉特征和时间上下文关系的挖掘和学习,这限制了这两种方法的进一步性能提升。
方法:受视频编解码理论的启发,我们引入了一种全新的VAD范式来突破这些限制:首先,我们提出了一种基于关键帧的视频事件恢复的新任务。鼓励深度神经网络根据视频关键帧推断缺失的多帧以恢复视频事件,从而更有效地激发深度神经网络挖掘和学习视频中的潜在的更高级别的视觉特征和全面的时序上下文关系。为此,我们提出了一种新的具有双跳跃连接的U形Swin Transformer网络(USTN-DSC)用于视频事件恢复,其中引入了一个交叉注意力和一个时间上采样的残差跳跃连接,以进一步帮助恢复视频中的复杂静态和动态运动对象特征。此外,我们还提出了一种简单而有效的相邻帧差异损失来约束视频序列的运动一致性。
效果:在基准测试上的大量实验表明,USTN-DSC优于大多数现有方法,验证了我们的方法的有效性。

Video anomaly detection (VAD) is a significant computer vision problem. Existing deep neural network (DNN) based VAD methods mostly follow the route of frame reconstruction or frame prediction. However, the lack of mining and learning of higher-level visual features and temporal context relationships in videos limits the further performance of these two approaches. Inspired by video codec theory, we introduce a brand-new VAD paradigm to break through these limitations: First, we propose a new task of video event restoration based on keyframes. Encouraging DNN to infer missing multiple frames based on video keyframes so as to restore a video event, which can more effectively motivate DNN to mine and learn potential higher-level visual features and comprehensive temporal context relationships in the video. To this end, we propose a novel U-shaped Swin Transformer Network with Dual Skip Connections (USTN-DSC) for video event restoration, where a cross-attention and a temporal upsampling residual skip connection are introduced to further assist in restoring complex static and dynamic motion object features in the video. In addition, we propose a simple and effective adjacent frame difference loss to constrain the motion consistency of the video sequence. Extensive experiments on benchmarks demonstrate that USTN-DSC outperforms most existing methods, validating the effectiveness of our method.

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection
Moon, WonJunandHyun, SangeekandPark, SangUkandPark, DongchanandHeo, Jae-Pil



研究问题:视频瞬间检索和精彩片段检测(MR/HD)在视频理解需求急剧增加的背景下受到关注,其主要目标是对给定的文本查询进行时刻定位并估计片段级别的一致性水平,即显著性分数。
动机:尽管最近的基于变压器的模型带来了一些进步,但我们发现这些方法并没有充分利用给定查询的信息。例如,在预测时刻和其显著性时,有时会忽视文本查询与视频内容之间的相关性。
方法:我们引入了Query-Dependent DETR(QD-DETR),一种专为MR/HD设计的检测变压器。我们的编码模块从交叉注意力层开始,明确地将文本查询的上下文注入到视频表示中。然后,为了提高模型利用查询信息的能力,我们处理视频-查询对以产生无关对。这种负(无关)的视频-查询对被训练以产生低显著性分数,从而鼓励模型精确估计查询-视频对之间的一致性。最后,我们提出了一种输入自适应显著性预测器,该预测器为给定的视频-查询对自适应地定义显著性分数的标准。
效果:我们广泛的研究表明,对于MR/HD来说,建立依赖于查询的表示是重要的。具体来说,QD-DETR在QVHighlights、TVSum和Charades-STA数据集上超越了最先进的方法。

Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR.

Adaptive Global Decay Process for Event Cameras
Nunes, UrbanoMiguelandBenosman, RyadandIeng, Sio-Hoi



研究问题:在几乎所有基于事件的视觉问题中,都需要选择最近的事件,这些事件被认为承载了最相关的信息内容。
动机:现有的策略存在至少一个主要限制,因此提出了一种新的事件相机的衰减过程,该过程适应全局场景动态,并且其延迟在纳秒级别。
方法:构建了一个自适应量来编码全局场景动态,称为事件活动。
效果:该方法在多个基于事件的视觉问题和数据集上进行了评估,始终能提高相应基线方法的性能。

In virtually all event-based vision problems, there is the need to select the most recent events, which are assumed to carry the most relevant information content. To achieve this, at least one of three main strategies is applied, namely: 1) constant temporal decay or fixed time window, 2) constant number of events, and 3) flow-based lifetime of events. However, these strategies suffer from at least one major limitation each. We instead propose a novel decay process for event cameras that adapts to the global scene dynamics and whose latency is in the order of nanoseconds. The main idea is to construct an adaptive quantity that encodes the global scene dynamics, denoted by event activity. The proposed method is evaluated in several event-based vision problems and datasets, consistently improving the corresponding baseline methods' performance. We thus believe it can have a significant widespread impact on event-based research. Code available: https://github.com/neuromorphic-paris/event_batch.

ScanDMM: A Deep Markov Model of Scanpath Prediction for 360deg Images
Sui, XiangjieandFang, YumingandZhu, HanweiandWang, ShiqiandWang, Zhou



研究问题:本文旨在解决360度图像的扫描路径预测问题,即如何根据人类的视觉感知机制产生动态的注视行为。
动机:现有的360度图像扫描路径预测方法在预测人类扫描路径时没有完全考虑时间依赖性,导致性能较差和泛化能力不足。
方法:本文提出了一种名为ScanDMM的新型深度马尔可夫模型架构来进行360度图像的扫描路径预测。我们设计了一个语义引导的转换函数来学习时间依赖的注意力景观的非线性动力学,并提出了一种状态初始化策略,通过考虑观察的起始点使模型能够以正确的"发射器"开始学习动态。
效果:实验结果表明,我们的模型在四个360度图像数据库上达到了最先进的性能,并通过将扫描路径预测模型应用于其他视觉任务(如显著性检测和图像质量评估)展示了其泛化能力,期望为这些领域提供深刻的洞察。

Scanpath prediction for 360deg images aims to produce dynamic gaze behaviors based on the human visual perception mechanism. Most existing scanpath prediction methods for 360deg images do not give a complete treatment of the time-dependency when predicting human scanpath, resulting in inferior performance and poor generalizability. In this paper, we present a scanpath prediction method for 360deg images by designing a novel Deep Markov Model (DMM) architecture, namely ScanDMM. We propose a semantics-guided transition function to learn the nonlinear dynamics of time-dependent attentional landscape. Moreover, a state initialization strategy is proposed by considering the starting point of viewing, enabling the model to learn the dynamics with the correct "launcher". We further demonstrate that our model achieves state-of-the-art performance on four 360deg image databases, and exhibit its generalizability by presenting two applications of applying scanpath prediction models to other visual tasks - saliency detection and image quality assessment, expecting to provide profound insights into these fields.

A Light Weight Model for Active Speaker Detection
Liao, JunhuaandDuan, HaihanandFeng, KanghuiandZhao, WanbingandYang, YanbingandChen, Liangyin



研究问题:在音频-视觉场景中,如何有效地检测出正在说话的人。
动机:现有的方法虽然能够提高性能,但需要大量的计算资源和内存,不适用于资源有限的环境。
方法:通过减少输入候选者的数量,将2D和3D卷积分开用于音频-视觉特征提取,并应用低计算复杂度的门控循环单元进行跨模态建模,构建了一个轻量级的主动说话人检测架构。
效果:实验结果表明,该方法在AVA-ActiveSpeaker数据集上取得了具有竞争力的mAP性能(94.1% vs. 94.2%),同时资源消耗显著低于现有方法,特别是在模型参数(1.0M vs. 22.5M, 大约23倍)和浮点运算次数(0.6G vs. 2.6G, 大约4倍)方面。此外,该方法在哥伦比亚数据集上也表现良好,显示出良好的鲁棒性。

Active speaker detection is a challenging task in audio-visual scenarios, with the aim to detect who is speaking in one or more speaker scenarios. This task has received considerable attention because it is crucial in many applications. Existing studies have attempted to improve the performance by inputting multiple candidate information and designing complex models. Although these methods have achieved excellent performance, their high memory and computational power consumption render their application to resource-limited scenarios difficult. Therefore, in this study, a lightweight active speaker detection architecture is constructed by reducing the number of input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent units with low computational complexity for cross-modal modeling. Experimental results on the AVA-ActiveSpeaker dataset reveal that the proposed framework achieves competitive mAP performance (94.1% vs. 94.2%), while the resource costs are significantly lower than the state-of-the-art method, particularly in model parameters (1.0M vs. 22.5M, approximately 23x) and FLOPs (0.6G vs. 2.6G, approximately 4x). Additionally, the proposed framework also performs well on the Columbia dataset, thus demonstrating good robustness. The code and model weights are available at https://github.com/Junhua-Liao/Light-ASD.

Self-Supervised Video Forensics by Audio-Visual Anomaly Detection
Feng, ChaoandChen, ZiyangandOwens, Andrew



研究问题:如何通过异常检测识别视频中的视听信号不一致,并训练一个仅使用真实无标签数据的视频取证方法。
动机:处理被篡改的视频,找出其视听信号的微妙不一致。
方法:训练一个自回归模型生成视听特征序列,捕捉视频帧和声音之间的时间同步性。在测试时,标记模型分配概率低的视频。
效果:尽管完全在真实视频上进行训练,但该模型在检测被篡改的语音视频任务上表现强劲。

Manipulated videos often contain subtle inconsistencies between their visual and audio signals. We propose a video forensics method, based on anomaly detection, that can identify these inconsistencies, and that can be trained solely using real, unlabeled data. We train an autoregressive model to generate sequences of audio-visual features, using feature sets that capture the temporal synchronization between video frames and sound. At test time, we then flag videos that the model assigns low probability. Despite being trained entirely on real videos, our model obtains strong performance on the task of detecting manipulated speech videos. Project site: https://cfeng16.github.io/audio-visual-forensics.

LOGO: A Long-Form Video Dataset for Group Action Quality Assessment
Zhang, ShiyiandDai, WenxunandWang, SujiaandShen, XiangweiandLu, JiwenandZhou, JieandTang, Yansong



研究问题:本文旨在解决现有动作质量评估(AQA)方法主要关注单人短序列场景,难以应对复杂情况的问题。
动机:为了扩大AQA的应用范围,我们构建了一个名为LOGO的多人物长视频数据集,以应对更复杂的场景。
方法:我们设计了一种简单而有效的方法来模拟运动员之间的关系,并推理长视频中的潜在时间逻辑。具体来说,我们设计了一个组感知注意力模块,可以很容易地插入到现有的AQA方法中,根据上下文群体信息丰富片段表示。
效果:实验结果表明,我们的方法在LOGO数据集上取得了最先进的效果。同时,我们的数据集和代码将在GitHub上发布。

Action quality assessment (AQA) has become an emerging topic since it can be extensively applied in numerous scenarios. However, most existing methods and datasets focus on single-person short-sequence scenes, hindering the application of AQA in more complex situations. To address this issue, we construct a new multi-person long-form video dataset for action quality assessment named LOGO. Distinguished in scenario complexity, our dataset contains 200 videos from 26 artistic swimming events with 8 athletes in each sample along with an average duration of 204.2 seconds. As for richness in annotations, LOGO includes formation labels to depict group information of multiple athletes and detailed annotations on action procedures. Furthermore, we propose a simple yet effective method to model relations among athletes and reason about the potential temporal logic in long-form videos. Specifically, we design a group-aware attention module, which can be easily plugged into existing AQA methods, to enrich the clip-wise representations based on contextual group information. To benchmark LOGO, we systematically conduct investigations on the performance of several popular methods in AQA and action segmentation. The results reveal the challenges our dataset brings. Extensive experiments also show that our approach achieves state-of-the-art on the LOGO dataset. The dataset and code will be released at https://github.com/shiyi-zh0408/LOGO.

Learning To Detect Mirrors From Videos via Dual Correspondences
Lin, JiayingandTan, XinandLau, RynsonW.H.



研究问题:如何从动态场景中检测镜子,由于缺乏高质量的数据集和有效的方法,视频镜子检测(VMD)仍然是一个未充分探索的领域。
动机:作者观察到镜子内外的内容之间通常存在对应关系,但这些对应关系可能不会在每一帧都出现,例如由于摄像机姿态的变化。这启发了作者提出一个能够容忍空间上缺失对应关系的视频镜子检测方法。
方法:作者提出了一种名为VMD-Net的视频镜子检测方法,该方法通过一个双对应模块在帧内和帧间考虑镜像对应关系,以寻找相关联的对应关系。此外,作者还提出了第一个大规模的VMD数据集(名为VMD-D),包含来自269个视频的14,987个图像帧和相应的手动标注掩码。
效果:实验结果表明,该方法优于相关领域的最新技术。为了实现实时VMD,该方法有效地利用了骨干特征,消除了现有方法中常用的多级模块设计和输出映射后处理,使其非常高效且适用于实时视频应用。

Detecting mirrors from static images has received significant research interest recently. However, detecting mirrors over dynamic scenes is still under-explored due to the lack of a high-quality dataset and an effective method for video mirror detection (VMD). To the best of our knowledge, this is the first work to address the VMD problem from a deep-learning-based perspective. Our observation is that there are often correspondences between the contents inside (reflected) and outside (real) of a mirror, but such correspondences may not always appear in every frame, e.g., due to the change of camera pose. This inspires us to propose a video mirror detection method, named VMD-Net, that can tolerate spatially missing correspondences by considering the mirror correspondences at both the intra-frame level as well as inter-frame level via a dual correspondence module that looks over multiple frames spatially and temporally for correlating correspondences. We further propose a first large-scale dataset for VMD (named VMD-D), which contains 14,987 image frames from 269 videos with corresponding manually annotated masks. Experimental results show that the proposed method outperforms SOTA methods from relevant fields. To enable real-time VMD, our method efficiently utilizes the backbone features by removing the redundant multi-level module design and gets rid of post-processing of the output maps commonly used in existing methods, making it very efficient and practical for real-time video-based applications. Code, dataset, and models are available at https://jiaying.link/cvpr2023-vmd/

Towards Scalable Neural Representation for Diverse Videos
He, BoandYang, XitongandWang, HanyuandWu, ZuxuanandChen, HaoandHuang, ShuaiyiandRen, YixuanandLim, Ser-NamandShrivastava, Abhinav



研究问题:如何有效地对大量多样化的视频进行编码。
动机:现有的隐式神经表示(INR)方法在处理少量重复视频时表现良好,但在处理大量多样化视频时存在局限性。
方法:提出了一种新的神经网络表示框架D-NeRV,通过将视频分解为运动信息和视觉内容,引入时间推理,并使用任务导向流作为中间输出来减少空间冗余。
效果:实验结果表明,D-NeRV在视频压缩任务上大大超过了NeRV和传统的视频压缩技术,同时在相同的压缩比下,D-NeRV在动作识别任务上的准确率也比NeRV高3%-10%。

Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images, and have been recently applied to encode videos (e.g., NeRV, E-NeRV). While achieving promising results, existing INR-based methods are limited to encoding a handful of short videos (e.g., seven 5-second videos in the UVG dataset) with redundant visual content, leading to a model design that fits individual video frames independently and is not efficiently scalable to a large number of diverse videos. This paper focuses on developing neural representations for a more practical setup -- encoding long and/or a large number of videos with diverse visual content. We first show that instead of dividing videos into small subsets and encoding them with separate models, encoding long and diverse videos jointly with a unified model achieves better compression results. Based on this observation, we propose D-NeRV, a novel neural representation framework designed to encode diverse videos by (i) decoupling clip-specific visual content from motion information, (ii) introducing temporal reasoning into the implicit neural network, and (iii) employing the task-oriented flow as intermediate output to reduce spatial redundancies. Our new model largely surpasses NeRV and traditional video compression techniques on UCF101 and UVG datasets on the video compression task. Moreover, when used as an efficient data-loader, D-NeRV achieves 3%-10% higher accuracy than NeRV on action recognition tasks on the UCF101 dataset under the same compression ratios.

Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Tan, ReubenandRay, ArijitandBurns, AndreaandPlummer, BryanA.andSalamon, JustinandNieto, OriolandRussell, BryanandSaenko, Kate



研究问题:提出一种基于自然语言查询的自监督学习方法,用于学习在视频中执行音频源分离。
动机:该任务的主要挑战在于学习将发出声音的对象的语言描述与其视觉特征和相应的音频波形成分关联起来,并且在训练过程中无法访问注释信息。
方法:通过两种新的损失函数,使现成的视觉-语言基础模型提供伪目标监督,并鼓励音频、视觉和自然语言模态之间更强的对齐。
效果:在MUSIC、SOLOS和AudioSet三个音频-视觉分离数据集上,该方法的效果超过了最先进的有监督方法,尽管在训练过程中没有使用对象检测器或文本标签。

We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform, all without access to annotations during training. To overcome this challenge, we adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions and encourage a stronger alignment between the audio, visual and natural language modalities. During inference, our approach can separate sounds given text, video and audio input, or given text and audio input alone. We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets, including MUSIC, SOLOS and AudioSet, where we outperform state-of-the-art strongly supervised approaches despite not using object detectors or text labels during training. Finally, we also include samples of our separated audios in the supplemental for reference.

CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition With Variational Alignment
Zheng, JiangbinandWang, YileandTan, ChengandLi, SiyuanandWang, GeandXia, JunandChen, YidongandLi, StanZ.



研究问题:手语识别(SLR)是一个弱监督任务,需要将手语视频标注为文本解释。由于缺乏大规模可用的手语数据集,训练不足成为SLR的主要瓶颈。
动机:目前的手语识别工作主要采用预训练的视觉模块,并发展出两种主流解决方案:多流架构和单流架构。多流架构扩展了多线索视觉特征,取得了当前最先进的性能,但设计复杂且可能引入噪音。相比之下,使用视觉和文本模态之间显式跨模态对齐的高级单流SLR框架简单有效,可能与多流框架竞争。
方法:我们提出了一种新的对比性视觉-文本转换模型CVT-SLR,以充分探索视觉和语言模态的预训练知识。基于单流跨模态对齐框架,我们提出了一种变分自编码器(VAE)用于预训练上下文知识,同时引入完整的预训练语言模块。VAE隐式地对齐视觉和文本模态,同时从预训练的上下文知识中受益,就像传统的上下文模块一样。同时,设计了一种对比性跨模态对齐算法,以明确增强一致性约束。
效果:我们在公共数据集(PHOENIX-2014和PHOENIX-2014T)上进行了大量的实验,结果表明我们的CVT-SLR模型始终优于现有的单流方法,甚至超过了最先进的多流方法。

Sign language recognition (SLR) is a weakly supervised task that annotates sign videos as textual glosses. Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR. Most SLR works thereby adopt pretrained visual modules and develop two mainstream solutions. The multi-stream architectures extend multi-cue visual features, yielding the current SOTA performances but requiring complex designs and might introduce potential noise. Alternatively, the advanced single-cue SLR frameworks using explicit cross-modal alignment between visual and textual modalities are simple and effective, potentially competitive with the multi-cue framework. In this work, we propose a novel contrastive visual-textual transformation for SLR, CVT-SLR, to fully explore the pretrained knowledge of both the visual and language modalities. Based on the single-cue cross-modal alignment framework, we propose a variational autoencoder (VAE) for pretrained contextual knowledge while introducing the complete pretrained language module. The VAE implicitly aligns visual and textual modalities while benefiting from pretrained contextual knowledge as the traditional contextual module. Meanwhile, a contrastive cross-modal alignment algorithm is designed to explicitly enhance the consistency constraints. Extensive experiments on public datasets (PHOENIX-2014 and PHOENIX-2014T) demonstrate that our proposed CVT-SLR consistently outperforms existing single-cue methods and even outperforms SOTA multi-cue methods.

Ego-Body Pose Estimation via Ego-Head Pose Estimation
Li, JiamanandLiu, KarenandWu, Jiajun



研究问题:如何从第一人称视角的视频序列中估计3D人体运动,以理解人类行为并在VR/AR中应用。
动机:由于用户的身体通常被放置在用户头部的前向相机所遮挡,直接学习第一人称视频和人体运动之间的映射是具有挑战性的。此外,收集大规模高质量的配对第一人称视频和3D人体运动数据集需要精确的运动捕捉设备,这往往限制了视频场景的多样性,使其仅限于类似实验室的环境。
方法:我们提出了一种新的方法,名为Ego-Body Pose Estimation via Ego-Head Pose Estimation(EgoEgo),该方法将问题分解为两个阶段,并通过中间表示的头部运动进行连接。首先,EgoEgo通过集成SLAM和学习方法来估计准确的头部运动。然后,利用估计的头部姿态作为输入,EgoEgo使用条件扩散生成多个可能的全身运动。这种头部和身体姿态的解耦消除了训练数据集需要配对的第一人称视频和3D人体运动的需求,使我们能够分别利用大规模的第一人称视频数据集和运动捕捉数据集。
效果:我们在ARES和真实数据上进行的系统基准测试表明,我们的EgoEgo模型在这两种数据集上都显著优于当前最先进的方法。

Estimating 3D human motion from an egocentric video sequence plays a critical role in human behavior understanding and has various applications in VR/AR. However, naively learning a mapping between egocentric videos and human motions is challenging, because the user's body is often unobserved by the front-facing camera placed on the head of the user. In addition, collecting large-scale, high-quality datasets with paired egocentric videos and 3D human motions requires accurate motion capture devices, which often limit the variety of scenes in the videos to lab-like environments. To eliminate the need for paired egocentric video and human motions, we propose a new method, Ego-Body Pose Estimation via Ego-Head Pose Estimation (EgoEgo), which decomposes the problem into two stages, connected by the head motion as an intermediate representation. EgoEgo first integrates SLAM and a learning approach to estimate accurate head motion. Subsequently, leveraging the estimated head pose as input, EgoEgo utilizes conditional diffusion to generate multiple plausible full-body motions. This disentanglement of head and body pose eliminates the need for training datasets with paired egocentric videos and 3D human motion, enabling us to leverage large-scale egocentric video datasets and motion capture datasets separately. Moreover, for systematic benchmarking, we develop a synthetic dataset, AMASS-Replica-Ego-Syn (ARES), with paired egocentric videos and human motion. On both ARES and real data, our EgoEgo model performs significantly better than the current state-of-the-art methods.

Hierarchical Video-Moment Retrieval and Step-Captioning
Zala, AbhayandCho, JaeminandKottur, SatwikandChen, XilunandOguz, BarlasandMehdad, YasharandBansal, Mohit



研究问题:本文旨在解决从大规模视频语料库中进行信息搜索的问题,以及如何联合搜索视频语料库并生成摘要。
动机:目前的研究大多将文本视频检索、时刻检索、视频摘要和视频字幕等任务分开研究,缺乏一个可以从视频语料库中进行联合搜索和生成摘要的端到端设置。
方法:作者提出了HiREST(分层检索和步进字幕)数据集,并设计了一个新的基准测试,该测试覆盖了从教学视频语料库中的分层信息检索和视觉/文本逐步总结的任务。
效果:实验结果表明,虽然基线模型显示出一些有希望的结果,但仍有很大的改进空间。

There is growing interest in searching for information from large video corpora. Prior works have studied relevant tasks, such as text-based video retrieval, moment retrieval, video summarization, and video captioning in isolation, without an end-to-end setup that can jointly search from video corpora and generate summaries. Such an end-to-end setup would allow for many interesting applications, e.g., a text-based search that finds a relevant video from a video corpus, extracts the most relevant moment from that video, and segments the moment into important steps with captions. To address this, we present the HiREST (HIerarchical REtrieval and STep-captioning) dataset and propose a new benchmark that covers hierarchical information retrieval and visual/textual stepwise summarization from an instructional video corpus. HiREST consists of 3.4K text-video pairs from an instructional video dataset, where 1.1K videos have annotations of moment spans relevant to text query and breakdown of each moment into key instruction steps with caption and timestamps (totaling 8.6K step captions). Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks. In moment segmentation, models break down a video moment into instruction steps and identify start-end boundaries. In step captioning, models generate a textual summary for each step. We also present starting point task-specific and end-to-end joint baseline models for our new benchmark. While the baseline models show some promising results, there still exists large room for future improvement by the community.

EvShutter: Transforming Events for Unconstrained Rolling Shutter Correction
Erbach, JuliusandTulyakov, StepanandVitoria, PatriciaandBochicchio, AlfredoandLi, Yuanyou



研究问题:如何利用单张RGB图像和高时间分辨率的事件信息,对具有运动模糊和RS失真的图像进行去扭曲处理。
动机:现有的基于恒定速度假设的RS失真校正算法需要多帧图像来预测密集位移场,而新提出的Eventful Shutter(EvShutter)方法通过使用事件信息和单张RGB图像,可以在不依赖恒定速度假设的情况下进行校正。
方法:EvShutter首先使用新颖的基于流的去模糊模块去除模糊,然后使用双编码器hourglass网络进行RS补偿。与以往的方法不同,它不依赖于恒定速度假设,并使用简单的架构,这得益于一种专门针对RS的称为Filter and Flip(FnF)的事件转换,该转换将输入事件编码为仅编码GS和RS图像之间的变化。
效果:在第一个包含真实事件和高质量可选模糊的RS-ERGB数据集上进行的评估表明,该方法在峰值信噪比(PSNR)方面比最先进的图像和事件基方法分别提高了9.16 dB和0.75 dB,在LPIPS方面提高了23%和21%。

Widely used Rolling Shutter (RS) CMOS sensors capture high resolution images at the expense of introducing distortions and artifacts in the presence of motion. In such situations, RS distortion correction algorithms are critical. Recent methods rely on a constant velocity assumption and require multiple frames to predict the dense displacement field. In this work, we introduce a new method, called Eventful Shutter (EvShutter), that corrects RS artifacts using a single RGB image and event information with high temporal resolution. The method firstly removes blur using a novel flow-based deblurring module and then compensates RS using a double encoder hourglass network. In contrast to previous methods, it does not rely on a constant velocity assumption and uses a simple architecture thanks to an event transformation dedicated to RS, called Filter and Flip (FnF), that transforms input events to encode only the changes between GS and RS images. To evaluate the proposed method and facilitate future research, we collect the first dataset with real events and high-quality RS images with optional blur, called RS-ERGB. We generate the RS images from GS images using a newly proposed simulator based on adaptive interpolation. The simulator permits the use of inexpensive cameras with long exposure to capture high-quality GS images. We show that on this realistic dataset the proposed method outperforms the state-of-the-art image- and event-based methods by 9.16 dB and 0.75 dB respectively in terms of PSNR and an improvement of 23% and 21% in LPIPS.

Hierarchical Neural Memory Network for Low Latency Event Processing
Hamaguchi, RyuheiandFurukawa, YasutakaandOnishi, MasakiandSakurada, Ken



研究问题:本文旨在提出一种低延迟神经网络架构,用于基于事件的密集预测任务。
动机:传统的架构会以固定的速率编码整个场景内容,而不考虑其时间特性。
方法:通过构建不同速率的堆叠潜在记忆来创建时间层次结构,实现根据运动速度在适当的时间尺度上编码内容。
效果:该架构不仅减少了传统架构的冗余,而且利用了长期依赖性。在三个基于事件的密集预测任务上进行广泛评估,该方法在准确性和延迟方面优于现有方法,同时展示了有效的事件和图像融合能力。

This paper proposes a low latency neural network architecture for event-based dense prediction tasks. Conventional architectures encode entire scene contents at a fixed rate regardless of their temporal characteristics. Instead, the proposed network encodes contents at a proper temporal scale depending on its movement speed. We achieve this by constructing temporal hierarchy using stacked latent memories that operate at different rates. Given low latency event steams, the multi-level memories gradually extract dynamic to static scene contents by propagating information from the fast to the slow memory modules. The architecture not only reduces the redundancy of conventional architectures but also exploits long-term dependencies. Furthermore, an attention-based event representation efficiently encodes sparse event streams into the memory cells. We conduct extensive evaluations on three event-based dense prediction tasks, where the proposed approach outperforms the existing methods on accuracy and latency, while demonstrating effective event and image fusion capabilities. The code is available at https://hamarh.github.io/hmnet/

Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video
Feng, RunyangandGao, YixingandMa, XueqingandTse, TzeHoEldenandChang, HyungJin



研究问题:如何有效地进行多帧人体姿态估计。
动机:现有的方法直接使用光流或可变形卷积预测全谱运动场,可能会引入许多无关的线索,如附近的人或背景,导致结果不理想。
方法:本文提出了一种新的多帧人体姿态估计框架,利用时间差编码动态上下文,并客观地参与互信息目标以促进有用的运动信息解耦。具体来说,设计了一个多阶段的时序差编码器,通过多阶段特征差序列的条件增量级联学习来获取有意义的运动表示。进一步从互信息的角度提出了一个表示解耦模块,通过明确定义原始运动特征的有用和噪声成分并最小化它们的互信息,可以抓取区分性的任务相关运动信号。
效果:在复杂事件挑战中的Crowd Pose Estimation任务中排名第一,并在三个基准测试集PoseTrack2017、PoseTrack2018和PoseTrack21上实现了最先进的性能。

Temporal modeling is crucial for multi-frame human pose estimation. Most existing methods directly employ optical flow or deformable convolution to predict full-spectrum motion fields, which might incur numerous irrelevant cues, such as a nearby person or background. Without further efforts to excavate meaningful motion priors, their results are suboptimal, especially in complicated spatiotemporal interactions. On the other hand, the temporal difference has the ability to encode representative motion information which can potentially be valuable for pose estimation but has not been fully exploited. In this paper, we present a novel multi-frame human pose estimation framework, which employs temporal differences across frames to model dynamic contexts and engages mutual information objectively to facilitate useful motion information disentanglement. To be specific, we design a multi-stage Temporal Difference Encoder that performs incremental cascaded learning conditioned on multi-stage feature difference sequences to derive informative motion representation. We further propose a Representation Disentanglement module from the mutual information perspective, which can grasp discriminative task-relevant motion signals by explicitly defining useful and noisy constituents of the raw motion features and minimizing their mutual information. These place us to rank No.1 in the Crowd Pose Estimation in Complex Events Challenge on benchmark dataset HiEve, and achieve state-of-the-art performance on three benchmarks PoseTrack2017, PoseTrack2018, and PoseTrack21.

SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision
Liu, XuboandLakomkin, EgorandVougioukas, KonstantinosandMa, PingchuanandChen, HonglieandXie, RuimingandDoulaty, MorrieandMoritz, NikoandKolar, JachymandPetridis, StavrosandPantic, MajaandFuegen, Christian



研究问题:如何利用合成视觉数据提升视觉语音识别(VSR)的性能。
动机:公开的转录视频数据集规模有限,而最新的VSR技术结果却需要大量的视频数据。
方法:首次提出一种名为SynthVSR的方法,通过使用基于语音驱动的唇部动画模型生成唇部运动来改善VSR系统的性能。该模型在未标记的视听数据集上进行训练,并在有标签视频可用时进一步优化预训练的VSR模型。
效果:在最大的公共VSR基准测试集——Lip Reading Sentences 3(LRS3)上评估,SynthVSR仅使用30小时的真实标记数据就实现了43.3%的WER,优于使用数千小时视频的现成方法。当使用LRS3中的所有438小时标记数据时,WER进一步降低到27.9%,与最先进的自我监督AV-HuBERT方法相当。此外,当与大规模的伪标记视听数据结合使用时,SynthVSR仅使用公开可用的数据就实现了16.9%的新VSR WER,超过了最近使用非公开机器转录视频数据(90,000小时)训练的最新方法。

Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems with synthetic lip movements. The key idea behind SynthVSR is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech. The speech-driven lip animation model is trained on an unlabeled audio-visual dataset and could be further optimized towards a pre-trained VSR model when labeled videos are available. As plenty of transcribed acoustic data and face images are available, we are able to generate large-scale synthetic data using the proposed lip animation model for semi-supervised VSR training. We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3). SynthVSR achieves a WER of 43.3% with only 30 hours of real labeled data, outperforming off-the-shelf approaches using thousands of hours of video. The WER is further reduced to 27.9% when using all 438 hours of labeled data from LRS3, which is on par with the state-of-the-art self-supervised AV-HuBERT method. Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16.9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90,000 hours). Finally, we perform extensive ablation studies to understand the effect of each component in our proposed method.

Search-Map-Search: A Frame Selection Paradigm for Action Recognition
Zhao, MingjunandYu, YakunandWang, XiaoliandYang, LeiandNiu, Di



研究问题:尽管深度学习在视频理解任务中取得了成功,但处理视频的每一帧在计算上是昂贵的,并且在实时应用中通常是不必要的。
动机:现有的帧选择方法要么根据每帧的重要性预测单独采样帧,没有考虑帧之间的交互,要么采用强化学习代理来连续找到代表性的帧,这既昂贵又可能导致潜在的稳定性问题。
方法:我们提出了一种搜索-映射-搜索的学习范式,该范式结合了启发式搜索和监督学习的优点,从视频中选择最佳的帧组合作为一个实体。通过将搜索与学习相结合,所提出的方法可以更好地捕获帧之间的交互,同时产生低推理开销。
效果:大量的实验表明,我们的帧选择方法有效地提高了动作识别模型的性能,并显著优于许多竞争性基线。

Despite the success of deep learning in video understanding tasks, processing every frame in a video is computationally expensive and often unnecessary in real-time applications. Frame selection aims to extract the most informative and representative frames to help a model better understand video content. Existing frame selection methods either individually sample frames based on per-frame importance prediction, without considering interaction among frames, or adopt reinforcement learning agents to find representative frames in succession, which are costly to train and may lead to potential stability issues. To overcome the limitations of existing methods, we propose a Search-Map-Search learning paradigm which combines the advantages of heuristic search and supervised learning to select the best combination of frames from a video as one entity. By combining search with learning, the proposed method can better capture frame interactions while incurring a low inference overhead. Specifically, we first propose a hierarchical search method conducted on each training video to search for the optimal combination of frames with the lowest error on the downstream task. A feature mapping function is then learned to map the frames of a video to the representation of its target optimal frame combination. During inference, another search is performed on an unseen video to select a combination of frames whose feature representation is close to the projected feature representation. Extensive experiments based on several action recognition benchmarks demonstrate that our frame selection method effectively improves performance of action recognition models, and significantly outperforms a number of competitive baselines.

Uncovering the Missing Pattern: Unified Framework Towards Trajectory Imputation and Prediction
Xu, YiandBazarjani, ArminandChi, Hyung-gunandChoi, ChihoandFu, Yun



研究问题:当前轨迹预测方法常假设观察序列完整,忽视了由于物体遮挡、范围限制、传感器故障等原因导致的缺失值,这限制了轨迹预测的准确性。
动机:为了解决这一问题,本文提出了一种统一的框架——基于图的条件变分循环神经网络(GC-VRNN),可以同时进行轨迹填充和预测。
方法:我们引入了一种新颖的多空间图神经网络(MS-GNN),可以从不完整的观察中提取空间特征并利用缺失模式。此外,我们还采用了带有特定设计的时态衰减(TD)模块的条件变分循环神经网络来捕捉不完整轨迹中的时序依赖性和时序缺失模式。
效果:通过广泛的实验验证了我们提出的方法的卓越性能。据我们所知,这是首次以统一的方式解决轨迹填充和预测问题的缺乏基准和技巧的工作。

Trajectory prediction is a crucial undertaking in understanding entity movement or human behavior from observed sequences. However, current methods often assume that the observed sequences are complete while ignoring the potential for missing values caused by object occlusion, scope limitation, sensor failure, etc. This limitation inevitably hinders the accuracy of trajectory prediction. To address this issue, our paper presents a unified framework, the Graph-based Conditional Variational Recurrent Neural Network (GC-VRNN), which can perform trajectory imputation and prediction simultaneously. Specifically, we introduce a novel Multi-Space Graph Neural Network (MS-GNN) that can extract spatial features from incomplete observations and leverage missing patterns. Additionally, we employ a Conditional VRNN with a specifically designed Temporal Decay (TD) module to capture temporal dependencies and temporal missing patterns in incomplete trajectories. The inclusion of the TD module allows for valuable information to be conveyed through the temporal flow. We also curate and benchmark three practical datasets for the joint problem of trajectory imputation and prediction. Extensive experiments verify the exceptional performance of our proposed method. As far as we know, this is the first work to address the lack of benchmarks and techniques for trajectory imputation and prediction in a unified manner.

3D Video Loops From Asynchronous Input
Ma, LiandLi, XiaoyuandLiao, JingandSander, PedroV.



研究问题:如何实现动态3D循环场景的沉浸式体验。
动机:现有的方法大多局限于2D表示,我们提出一种新颖的方法来处理异步输入的每个视图的循环条件,同时保持3D表示的视图一致性。
方法:我们提出了一种新的稀疏3D视频表示方法,即多图视频(MTV),它不仅提供了一致的视图先验,而且大大减少了内存使用,使得4D体积的优化变得可行。然后,我们引入了一个两阶段管道,从完全异步的多视图视频中构建3D循环MTV,这些视频没有时间重叠。在优化过程中,我们采用了基于视频时间重定向算法的新型循环损失来循环3D场景。
效果:我们的框架实验已显示出在实时生成和渲染逼真的3D循环视频方面具有潜力,即使在移动设备上也能实现。

Looping videos are short video clips that can be looped endlessly without visible seams or artifacts. They provide a very attractive way to capture the dynamism of natural scenes. Existing methods have been mostly limited to 2D representations. In this paper, we take a step forward and propose a practical solution that enables an immersive experience on dynamic 3D looping scenes. The key challenge is to consider the per-view looping conditions from asynchronous input while maintaining view consistency for the 3D representation. We propose a novel sparse 3D video representation, namely Multi-Tile Video (MTV), which not only provides a view-consistent prior, but also greatly reduces memory usage, making the optimization of a 4D volume tractable. Then, we introduce a two-stage pipeline to construct the 3D looping MTV from completely asynchronous multi-view videos with no time overlap. A novel looping loss based on video temporal retargeting algorithms is adopted during the optimization to loop the 3D scene. Experiments of our framework have shown promise in successfully generating and rendering photorealistic 3D looping videos in real time even on mobile devices. The code, dataset, and live demos are available in https://limacv.github.io/VideoLoop3D_web/.

Frame Interpolation Transformer and Uncertainty Guidance
Plack, MarkusandBriedis, KarlisMartinsandDjelouah, AbdelazizandHullin, MatthiasB.andGross, MarkusandSchroers, Christopher



研究问题:近年来,视频帧插值技术取得了重要进展,但在复杂光照或大运动等更具挑战性的条件下仍存在问题。
动机:为了解决这些问题,我们提出了一种新的基于变换器的插值网络架构,能够估计预期误差和插值帧。
方法:我们的方法通过直接预测或转换器来探索替代方案,并利用改进的光学流方法和改善的喷射策略或来自深度的额外线索。
效果:实验结果表明,我们的方法在多个数据集上提高了视觉质量,并通过用户研究进一步证明了质量的提升。此外,我们的方法还能估计插值帧的错误图,这对于标记有问题的帧的长视频序列的实际应用至关重要。最后,对于渲染内容,我们可以使用预计算的错误指导中间帧的部分渲染过程来生成质量更高的新帧。

Video frame interpolation has seen important progress in recent years, thanks to developments in several directions. Some works leverage better optical flow methods with improved splatting strategies or additional cues from depth, while others have investigated alternative approaches through direct predictions or transformers. Still, the problem remains unsolved in more challenging conditions such as complex lighting or large motion. In this work, we are bridging the gap towards video production with a novel transformer-based interpolation network architecture capable of estimating the expected error together with the interpolated frame. This offers several advantages that are of key importance for frame interpolation usage: First, we obtained improved visual quality over several datasets. The improvement in terms of quality is also clearly demonstrated through a user study. Second, our method estimates error maps for the interpolated frame, which are essential for real-life applications on longer video sequences where problematic frames need to be flagged. Finally, for rendered content a partial rendering pass of the intermediate frame, guided by the predicted error, can be utilized during the interpolation to generate a new frame of superior quality. Through this error estimation, our method can produce even higher-quality intermediate frames using only a fraction of the time compared to a full rendering.

QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation
Yang, SichengandWu, ZhiyongandLi, MingleiandZhang, ZhensongandHao, LeiandBao, WeihongandZhuang, Haolin



研究问题:如何有效地解决由人类运动随机抖动和语音与手势之间固有的异步关系引起的语音驱动手势生成的挑战。
动机:为了解决这些挑战,我们提出了一种新颖的基于量化和相位引导的运动匹配框架。
方法:我们首先提出了一个手势VQ-VAE模块,用于学习一个码本以总结有意义的手势单元。然后,我们使用Levenshtein距离来对齐不同的手势和语音。此外,我们还引入了相位,以根据音频的上下文语义或节奏指导最优的手势匹配。
效果:广泛的实验表明,我们的方法在语音驱动的手势生成方面优于最近的方法。

Speech-driven gesture generation is highly challenging due to the random jitters of human motion. In addition, there is an inherent asynchronous relationship between human speech and gestures. To tackle these challenges, we introduce a novel quantization-based and phase-guided motion matching framework. Specifically, we first present a gesture VQ-VAE module to learn a codebook to summarize meaningful gesture units. With each code representing a unique gesture, random jittering problems are alleviated effectively. We then use Levenshtein distance to align diverse gestures with different speech. Levenshtein distance based on audio quantization as a similarity metric of corresponding speech of gestures helps match more appropriate gestures with speech, and solves the alignment problem of speech and gestures well. Moreover, we introduce phase to guide the optimal gesture matching based on the semantics of context or rhythm of audio. Phase guides when text-based or speech-based gestures should be performed to make the generated gestures more natural. Extensive experiments show that our method outperforms recent approaches on speech-driven gesture generation. Our code, database, pre-trained models and demos are available at https://github.com/YoungSeng/QPGesture.

On the Benefits of 3D Pose and Tracking for Human Action Recognition
Rajasegaran, JathushanandPavlakos, GeorgiosandKanazawa, AngjooandFeichtenhofer, ChristophandMalik, Jitendra



研究问题:本文旨在研究使用追踪和3D姿态进行动作识别的好处。
动机:通过在人类运动轨迹上分析动作,而非在空间的固定点上,可以更好地预测人们的动作。
方法:首先,我们展示了使用3D姿态推断动作的好处,并研究了人与人之间的互动。然后,我们提出了一个拉格朗日动作识别模型,该模型通过融合3D姿态和上下文化的外观信息来识别动作。
效果:在AVA v2.2数据集上,我们的方法在仅使用姿态设置和标准基准设置上都取得了最先进的性能。仅使用姿态线索推理动作时,我们的姿态模型比相应的最新技术提高了+10.0 mAP,而我们的融合模型比最佳最新技术提高了+2.8 mAP。

In this work we study the benefits of using tracking and 3D poses for action recognition. To achieve this, we take the Lagrangian view on analysing actions over a trajectory of human motion rather than at a fixed point in space. Taking this stand allows us to use the tracklets of people to predict their actions. In this spirit, first we show the benefits of using 3D pose to infer actions, and study person-person interactions. Subsequently, we propose a Lagrangian Action Recognition model by fusing 3D pose and contextualized appearance over tracklets. To this end, our method achieves state-of-the-art performance on the AVA v2.2 dataset on both pose only settings and on standard benchmark settings. When reasoning about the action using only pose cues, our pose model achieves +10.0 mAP gain over the corresponding state-of-the-art while our fused model has a gain of +2.8 mAP over the best state-of-the-art model. Code and results are available at: https://brjathu.github.io/LART

Continuous Sign Language Recognition With Correlation Network
Hu, LianyuandGao, LiqingandLiu, ZekangandFeng, Wei



研究问题:当前连续手语识别(CSLR)方法通常独立处理帧以捕获帧特征,无法有效识别手势。
动机:人体轨迹是视频中识别动作的重要线索,主要通过连续帧中的手和脸来传达。
方法:提出相关性网络(CorrNet),明确利用跨帧的人体轨迹进行手势识别。具体包括强调每帧中表达手势的有益信息的识别模块,以及动态计算当前帧与相邻邻居之间的相关性图以捕捉跨帧轨迹的关联模块。
效果:由于对体轨的关注,CorrNet在四个大规模数据集PHOENIX14、PHOENIX14-T、CSL-Daily和CSL上实现了新的最先进准确性。与先前的空间-时间推理方法的全面比较验证了其有效性。可视化展示了CorrNet强调相邻帧之间人体轨迹的效果。

Human body trajectories are a salient cue to identify actions in video. Such body trajectories are mainly conveyed by hands and face across consecutive frames in sign language. However, current methods in continuous sign language recognition(CSLR) usually process frames independently to capture frame-wise features, thus failing to capture cross-frame trajectories to effectively identify a sign. To handle this limitation, we propose correlation network (CorrNet) to explicitly leverage body trajectories across frames to identify signs. In specific, an identification module is first presented to emphasize informative regions in each frame that are beneficial in expressing a sign. A correlation module is then proposed to dynamically compute correlation maps between current frame and adjacent neighbors to capture cross-frame trajectories. As a result, the generated features are able to gain an overview of local temporal movements to identify a sign. Thanks to its special attention on body trajectories, CorrNet achieves new state-of-the-art accuracy on four large-scale datasets, PHOENIX14, PHOENIX14-T, CSL-Daily, and CSL. A comprehensive comparison between CorrNet and previous spatial-temporal reasoning methods verifies its effectiveness. Visualizations are given to demonstrate the effects of CorrNet on emphasizing human body trajectories across adjacent frames.

An Actor-Centric Causality Graph for Asynchronous Temporal Inference in Group Activity
Xie, ZhaoandGao, TianandWu, KeweiandChang, Jiao



研究问题:组活动识别中因果关系建模仍然是一个挑战。
动机:现有的图模型主要关注学习演员关系的同步时序特征,这对于处理具有异步时序特征的因果关系来说是不够的。
方法:本文提出了一种以演员为中心的因果关系图模型,该模型通过三个模块学习异步时序的因果关系,即异步时序因果关系检测模块、因果关系特征融合模块和因果关系关系图推理模块。
效果:大量实验表明,该方法在排球数据集和集体活动数据集上取得了最先进的性能。

The causality relation modeling remains a challenging task for group activity recognition. The causality relations describe the influence of some actors (cause actors) on other actors (effect actors). Most existing graph models focus on learning the actor relation with synchronous temporal features, which is insufficient to deal with the causality relation with asynchronous temporal features. In this paper, we propose an Actor-Centric Causality Graph Model, which learns the asynchronous temporal causality relation with three modules, i.e., an asynchronous temporal causality relation detection module, a causality feature fusion module, and a causality relation graph inference module. First, given a centric actor and correlative actor, we analyze their influences to detect causality relation. We estimate the self influence of the centric actor with self regression. We estimate the correlative influence from the correlative actor to the centric actor with correlative regression, which uses asynchronous features at different timestamps. Second, we synchronize the two action features by estimating the temporal delay between the cause action and the effect action. The synchronized features are used to enhance the feature of the effect action with a channel-wise fusion. Third, we describe the nodes (actors) with causality features and learn the edges by fusing the causality relation with the appearance relation and distance relation. The causality relation graph inference provides crucial features of effect action, which are complementary to the base model using synchronous relation inference. The two relation inferences are aggregated to enhance group relation learning. Extensive experiments show that our method achieves state-of-the-art performance on the Volleyball dataset and Collective Activity dataset.

How You Feelin'? Learning Emotions and Mental States in Movie Scenes
Srivastava, DhruvandSingh, AdityaKumarandTapaswi, Makarand



研究问题:如何通过理解角色的情绪和心理状态来分析电影故事。
动机:现有的方法无法全面预测电影场景中每个角色的多种情绪和心理状态。
方法:提出EmoTx,一种基于Transformer的多模态架构,通过整合视频、多个角色和对话语句进行联合预测。
效果:实验证明EmoTx在预测经典情绪和其他心理状态上有效,且优于其他最先进的情感识别方法。

Movie story analysis requires understanding characters' emotions and mental states. Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a movie scene and for each character. We propose EmoTx, a multimodal Transformer-based architecture that ingests videos, multiple characters, and dialog utterances to make joint predictions. By leveraging annotations from the MovieGraphs dataset, we aim to predict classic emotions (e.g. happy, angry) and other mental states (e.g. honest, helpful). We conduct experiments on the most frequently occurring 10 and 25 labels, and a mapping that clusters 181 labels to 26. Ablation studies and comparison against adapted state-of-the-art emotion recognition approaches shows the effectiveness of EmoTx. Analyzing EmoTx's self-attention scores reveals that expressive emotions often look at character tokens while other mental states rely on video and dialog cues.

Diverse Embedding Expansion Network and Low-Light Cross-Modality Benchmark for Visible-Infrared Person Re-Identification
Zhang, YukangandWang, Hanzi



研究问题:可见光-红外人脸再识别(VIReID)任务中,主要挑战是可见光(VIS)和红外(IR)图像之间的模态差距。
动机:由于训练样本通常有限,而模态差距过大,导致现有方法无法有效挖掘多样的跨模态线索。
方法:提出一种新颖的嵌入空间中的增强网络,称为多样化嵌入扩展网络(DEEN)。DEEN可以有效地生成多样化的嵌入,学习信息丰富的特征表示,并减小VIS和IR图像之间的模态差异。
效果:在SYSU-MM01、RegDB和LLCM数据集上的大量实验表明,提出的DEEN优于其他几种最先进的方法。

For the visible-infrared person re-identification (VIReID) task, one of the major challenges is the modality gaps between visible (VIS) and infrared (IR) images. However, the training samples are usually limited, while the modality gaps are too large, which leads that the existing methods cannot effectively mine diverse cross-modality clues. To handle this limitation, we propose a novel augmentation network in the embedding space, called diverse embedding expansion network (DEEN). The proposed DEEN can effectively generate diverse embeddings to learn the informative feature representations and reduce the modality discrepancy between the VIS and IR images. Moreover, the VIReID model may be seriously affected by drastic illumination changes, while all the existing VIReID datasets are captured under sufficient illumination without significant light changes. Thus, we provide a low-light cross-modality (LLCM) dataset, which contains 46,767 bounding boxes of 1,064 identities captured by 9 RGB/IR cameras. Extensive experiments on the SYSU-MM01, RegDB and LLCM datasets show the superiority of the proposed DEEN over several other state-of-the-art methods. The code and dataset are released at: https://github.com/ZYK100/LLCM

Weakly Supervised Video Representation Learning With Unaligned Text for Sequential Videos
Dong, SixunandHu, HuazhangandLian, DongzeandLuo, WeixinandQian, YichengandGao, Shenghua



研究问题:本文旨在解决弱监督的序列视频理解问题,即在没有精确的时间戳级别文本-视频对齐的情况下进行视频理解。
动机:由于其目标导向的特性,新兴的序列视频理解任务引起了研究者的广泛关注。
方法:本文借鉴了CLIP的思想,使用转换器聚合帧级特征以表示视频,并使用预训练的文本编码器分别对每个动作和整个视频的文本进行编码。为了建立文本和视频之间的对应关系,提出了多粒度损失,其中包括视频-段落对比损失(强制整个视频与完整脚本匹配)和细粒度的帧-句子对比损失(强制每个动作与其描述匹配)。由于帧-句子对应关系不可用,本文提出利用视频动作在时间域中的顺序发生这一事实生成伪帧-句子对应关系,并用伪标签监督网络训练。
效果:在视频序列验证和文本到视频匹配等任务上的大量实验表明,该方法比基线方法有大幅度的提升,验证了所提出方法的有效性。代码可在https://github.com/svip-lab/WeakSVR获取。

Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at https://github.com/svip-lab/WeakSVR.

Chat2Map: Efficient Scene Mapping From Multi-Ego Conversations
Majumder, SagnikandJiang, HaoandMoulon, PierreandHenderson, EthanandCalamia, PaulandGrauman, KristenandIthapu, VamsiKrishna



研究问题:能否通过从多个自我中心视角捕捉到的对话视频,以低成本的方式揭示场景的地图?
动机:我们希望通过让多个人在场景中移动并进行对话,利用他们接收到的丰富的视听线索来揭示场景中未被看到的区域。
方法:我们提出了一种新的问题解决方法,即通过利用参与者在自然对话中的自我中心音频视觉观察中的共享信息,高效地构建以前未见过的环境的3D地图。
效果:我们的模型优于先前最先进的映射方法,并实现了优秀的成本-准确性权衡。

Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way? We seek to answer this question by proposing a new problem: efficiently building the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation. Our hypothesis is that as multiple people ("egos") move in a scene and talk among themselves, they receive rich audio-visual cues that can help uncover the unseen areas of the scene. Given the high cost of continuously processing egocentric visual streams, we further explore how to actively coordinate the sampling of visual information, so as to minimize redundancy and reduce power use. To that end, we present an audio-visual deep reinforcement learning approach that works with our shared scene mapper to selectively turn on the camera to efficiently chart out the space. We evaluate the approach using a state-of-the-art audio-visual simulator for 3D scenes as well as real-world video. Our model outperforms previous state-of-the-art mapping methods, and achieves an excellent cost-accuracy tradeoff. Project: https://vision.cs.utexas.edu/projects/chat2map.

Executing Your Commands via Motion Diffusion in Latent Space
Chen, XinandJiang, BiaoandLiu, WenandHuang, ZilongandFu, BinandChen, TaoandYu, Gang



研究问题:如何根据各种条件输入(如动作类别或文本描述符)生成合理的人体运动序列。
动机:由于人体运动具有高度多样性,其分布与条件模态(如自然语言的文本描述符)差异较大,因此难以学习从所需条件模态到人体运动序列的概率映射。此外,来自运动捕捉系统的原始运动数据可能在序列中存在冗余并包含噪声;直接对原始运动序列和条件模态建立联合分布需要大量的计算开销,并可能导致由捕获的噪声引入的伪像。
方法:设计了一个强大的变分自编码器(VAE),为人体运动序列得到一个代表性和低维的潜在代码。然后,在运动潜在空间上执行扩散过程,而不是使用扩散模型来建立原始运动序列和条件输入之间的连接。
效果:所提出的运动潜在基扩散模型(MLD)能够生成符合给定条件输入的生动运动序列,并在训练和推理阶段大幅减少计算开销。在各种人体运动生成任务上的大量实验表明,我们的MLD在广泛的人体运动生成任务上取得了显著优于现有方法的效果,比之前的原始运动序列扩散模型快两个数量级。

We study a challenging task, conditional human motion generation, which produces plausible human motion sequences according to various conditional inputs, such as action classes or textual descriptors. Since human motions are highly diverse and have a property of quite different distribution from conditional modalities, such as textual descriptors in natural languages, it is hard to learn a probabilistic mapping from the desired conditional modality to the human motion sequences. Besides, the raw motion data from the motion capture system might be redundant in sequences and contain noises; directly modeling the joint distribution over the raw motion sequences and conditional modalities would need a heavy computational overhead and might result in artifacts introduced by the captured noises. To learn a better representation of the various human motion sequences, we first design a powerful Variational AutoEncoder (VAE) and arrive at a representative and low-dimensional latent code for a human motion sequence. Then, instead of using a diffusion model to establish the connections between the raw motion sequences and the conditional inputs, we perform a diffusion process on the motion latent space. Our proposed Motion Latent-based Diffusion model (MLD) could produce vivid motion sequences conforming to the given conditional inputs and substantially reduce the computational overhead in both the training and inference stages. Extensive experiments on various human motion generation tasks demonstrate that our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks, with two orders of magnitude faster than previous diffusion models on raw motion sequences.

Adaptive Human Matting for Dynamic Videos
Lin, Chung-ChingandWang, JiangandLuo, KunandLin, KevinandLi, LinjieandWang, LijuanandLiu, Zicheng



研究问题:视频抠图技术中,由于需要人工标注的trimap成本高且不适用于实时应用,因此如何消除trimap依赖性是一个重要的问题。
动机:尽管最新的无trimap方法在实验中表现出了良好的效果,但在处理高度多样化和非结构化的视频时,其性能往往会下降。
方法:我们提出了一种名为AdaM的自适应动态视频抠图框架,该框架同时进行前景和背景的区分,并捕捉前景人物的alpha通道细节。我们采用了两个相互连接的网络设计来实现这一目标:(1)一个编码器-解码器网络,用于生成alpha通道和中间遮罩,这些遮罩被用来指导transformer自适应地解码前景和背景;(2)一个transformer网络,其中的长短期注意力结合,以保留空间和时间上下文,有助于解码前景细节。
效果:我们在新近推出的数据集上对模型进行了基准测试和研究,结果显示,我们的模型显著提高了复杂真实世界视频的抠图真实性和时间连贯性,并在一般化能力上取得了新的最优结果。

The most recent efforts in video matting have focused on eliminating trimap dependency since trimap annotations are expensive and trimap-based methods are less adaptable for real-time applications. Despite the latest tripmap-free methods showing promising results, their performance often degrades when dealing with highly diverse and unstructured videos. We address this limitation by introducing Adaptive Matting for Dynamic Videos, termed AdaM, which is a framework designed for simultaneously differentiating foregrounds from backgrounds and capturing alpha matte details of human subjects in the foreground. Two interconnected network designs are employed to achieve this goal: (1) an encoder-decoder network that produces alpha mattes and intermediate masks which are used to guide the transformer in adaptively decoding foregrounds and backgrounds, and (2) a transformer network in which long- and short-term attention combine to retain spatial and temporal contexts, facilitating the decoding of foreground details. We benchmark and study our methods on recently introduced datasets, showing that our model notably improves matting realism and temporal coherence in complex real-world videos and achieves new best-in-class generalizability. Further details and examples are available at https://github.com/microsoft/AdaM.

UDE: A Unified Driving Engine for Human Motion Generation
Zhou, ZixiangandWang, Baoyuan



研究问题:如何生成可控、可编辑的三维人形运动序列。
动机:虽然学习基础的方法已经被开发和应用,但生成和动画化人类运动仍然是一项劳动密集型的任务,且这些方法通常是任务特定的或模态特定的。
方法:本文提出了“UDE”,这是第一个能够从自然语言或音频序列生成人类运动序列的统一驱动引擎。它包括基于VQVAE的运动量化模块(将连续运动序列表示为离散潜在代码)、模态无关的变压器编码器(学习将模态感知的驱动信号映射到联合空间)以及统一的令牌变压器网络(以自回归方式预测量化潜在代码索引)。
效果:在HumanML3D和AIST++基准测试中进行评估,实验结果表明该方法实现了最先进的性能。

Generating controllable and editable human motion sequences is a key challenge in 3D Avatar generation. It has been labor-intensive to generate and animate human motion for a long time until learning-based approaches have been developed and applied recently. However, these approaches are still task-specific or modality-specific. In this paper, we propose "UDE", the first unified driving engine that enables generating human motion sequences from natural language or audio sequences (see Fig. 1). Specifically, UDE consists of the following key components: 1) a motion quantization module based on VQVAE that represents continuous motion sequence as discrete latent code, 2) a modality-agnostic transformer encoder that learns to map modality-aware driving signals to a joint space, and 3) a unified token transformer (GPT-like) network to predict the quantized latent code index in an auto-regressive manner. 4) a diffusion motion decoder that takes as input the motion tokens and decodes them into motion sequences with high diversity. We evaluate our method on HumanML3D and AIST++ benchmarks, and the experiment results demonstrate our method achieves state-of-the-art performance.

PivoTAL: Prior-Driven Supervision for Weakly-Supervised Temporal Action Localization
Rizve, MamshadNayeemandMittal, GauravandYu, YeandHall, MatthewandSajeev, SandraandShah, MubarakandChen, Mei



研究问题:如何利用弱监督在未修剪的视频中定位动作?
动机:现有的方法主要从分类的角度进行定位,缺乏对动作边界的明确理解,导致动作定位不完整。
方法:提出PivoTAL模型,通过学习直接定位动作片段来直接进行定位,利用视频中的时空规律和可学习的高斯先验进行监督训练。
效果:在THUMOS-14和ActivitNet-v1.3等基准数据集上,PivoTAL相比现有方法有至少3%的平均mAP提升。

Weakly-supervised Temporal Action Localization (WTAL) attempts to localize the actions in untrimmed videos using only video-level supervision. Most recent works approach WTAL from a localization-by-classification perspective where these methods try to classify each video frame followed by a manually-designed post-processing pipeline to aggregate these per-frame action predictions into action snippets. Due to this perspective, the model lacks any explicit understanding of action boundaries and tends to focus only on the most discriminative parts of the video resulting in incomplete action localization. To address this, we present PivoTAL, Prior-driven Supervision for Weakly-supervised Temporal Action Localization, to approach WTAL from a localization-by-localization perspective by learning to localize the action snippets directly. To this end, PivoTAL leverages the underlying spatio-temporal regularities in videos in the form of action-specific scene prior, action snippet generation prior, and learnable Gaussian prior to supervise the localization-based training. PivoTAL shows significant improvement (of at least 3% avg mAP) over all existing methods on the benchmark datasets, THUMOS-14 and ActivitNet-v1.3.

The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction
Stergiou, AlexandrosandDamen, Dima



研究问题:本文旨在解决从部分观察的视频中预测正在进行的动作的问题。
动机:早期的行动预测需要从视频的开始阶段就推断出正在进行的动作,这在许多应用中都非常重要。
方法:提出了一种基于瓶颈的注意力模型,通过在细到粗的不同尺度上进行渐进采样来捕捉动作的演变。该模型由多个注意力塔组成,每个尺度对应一个塔。预测的动作标签是基于这些塔的集体协议和置信度来确定的。
效果:通过对四个视频数据集的大量实验,证明了TemPr模型在早期动作预测任务上的优越性能,并在各种编码器架构上都表现出色。通过详细的消融实验,展示了TemPr模型的有效性和一致性。

Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video. We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales. Our proposed Temporal Progressive (TemPr) model is composed of multiple attention towers, one for each scale. The predicted action label is based on the collective agreement considering confidences of these towers. Extensive experiments over four video datasets showcase state-of-the-art performance on the task of Early Action Prediction across a range of encoder architectures. We demonstrate the effectiveness and consistency of TemPr through detailed ablations.

StepFormer: Self-Supervised Step Discovery and Localization in Instructional Videos
Dvornik, NikitaandHadji, IsmaandZhang, RanandDerpanis, KonstantinosG.andWildes, RichardP.andJepson, AllanD.



研究问题:如何有效地从人类演示的视频中学习程序性任务,特别是在视频中的指导步骤。
动机:传统的视频指导步骤定位方法需要人工标注,不适用于大型数据集。
方法:提出StepFormer模型,这是一种自我监督的模型,可以在视频中自动发现和定位指导步骤。StepFormer是一个注意力机制的转换器解码器,通过学习查询来关注视频,并生成一个捕获视频中关键步骤的序列。
效果:在三个具有挑战性的基准测试中,StepFormer模型在步骤检测和定位方面优于所有先前的无监督和弱监督方法。此外,该模型还表现出解决零样本多步定位的能力,并在该任务上超越了所有相关基线。

Instructional videos are an important resource to learn procedural tasks from human demonstrations. However, the instruction steps in such videos are typically short and sparse, with most of the video being irrelevant to the procedure. This motivates the need to temporally localize the instruction steps in such videos, i.e. the task called key-step localization. Traditional methods for key-step localization require video-level human annotations and thus do not scale to large datasets. In this work, we tackle the problem with no human supervision and introduce StepFormer, a self-supervised model that discovers and localizes instruction steps in a video. StepFormer is a transformer decoder that attends to the video with learnable queries, and produces a sequence of slots capturing the key-steps in the video. We train our system on a large dataset of instructional videos, using their automatically-generated subtitles as the only source of supervision. In particular, we supervise our system with a sequence of text narrations using an order-aware loss function that filters out irrelevant phrases. We show that our model outperforms all previous unsupervised and weakly-supervised approaches on step detection and localization by a large margin on three challenging benchmarks. Moreover, our model demonstrates an emergent property to solve zero-shot multi-step localization and outperforms all relevant baselines at this task.

Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations
Zhong, YiwuandYu, LichengandBai, YangandLi, ShangwenandYan, XuetingandLi, Yin



研究问题:如何通过大规模网络教学视频和其旁白,学习编码动作步骤及其时间顺序的视频表示,而无需使用人工注释。
动机:互联网上丰富的教学视频和旁白为理解程序活动提供了激动人心的途径。
方法:基于大规模的网络教学视频和旁白数据集,提出一种联合学习视频表示的方法,该方法同时编码单个步骤概念和一个深度概率模型,以捕捉时间依赖性和步骤顺序的巨大个体差异。
效果:实验证明,学习时间顺序不仅可以增强程序推理的新能力,还可以加强单个步骤的识别。该方法在步骤分类(在COIN / EPIC-Kitchens上+2.8% / +3.3%)和步骤预测(在COIN上+7.4%)方面显著提高了最先进的结果。此外,该方法在零样本推理、预测不完整程序的多样化和合理步骤方面也取得了有希望的结果。

The abundance of instructional videos and their narrations over the Internet offers an exciting avenue for understanding procedural activities. In this work, we propose to learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations, without using human annotations. Our method jointly learns a video representation to encode individual step concepts, and a deep probabilistic model to capture both temporal dependencies and immense individual variations in the step ordering. We empirically demonstrate that learning temporal ordering not only enables new capabilities for procedure reasoning, but also reinforces the recognition of individual steps. Our model significantly advances the state-of-the-art results on step classification (+2.8% / +3.3% on COIN / EPIC-Kitchens) and step forecasting (+7.4% on COIN). Moreover, our model attains promising results in zero-shot inference for step classification and forecasting, as well as in predicting diverse and plausible steps for incomplete procedures. Our code is available at https://github.com/facebookresearch/ProcedureVRL.

NeuralPCI: Spatio-Temporal Neural Field for 3D Point Cloud Multi-Frame Non-Linear Interpolation
Zheng, ZehanandWu, DanniandLu, RuisiandLu, FanandChen, GuangandJiang, Changjun



研究问题:本文旨在解决计算机视觉中的点云插值问题,特别是针对非线性大运动场景的插值。
动机:尽管视频插值取得了显著进展,但点云插值仍鲜有探索。同时,真实世界中存在大量非线性大运动,使得点云插值任务更具挑战性。
方法:我们提出了NeuralPCI,一种用于3D点云插值的端到端4D时空神经场。该模型隐式地整合了多帧信息,以处理室内外场景中的非线性大运动。
效果:我们在DHB(动态人体)和NL-Drive数据集上测试了NeuralPCI,结果显示其在这两个数据集上都达到了最先进的性能。此外,该方法还可以自然地扩展到点云外推、变形和自动标记等其他领域。

In recent years, there has been a significant increase in focus on the interpolation task of computer vision. Despite the tremendous advancement of video interpolation, point cloud interpolation remains insufficiently explored. Meanwhile, the existence of numerous nonlinear large motions in real-world scenarios makes the point cloud interpolation task more challenging. In light of these issues, we present NeuralPCI: an end-to-end 4D spatio-temporal Neural field for 3D Point Cloud Interpolation, which implicitly integrates multi-frame information to handle nonlinear large motions for both indoor and outdoor scenarios. Furthermore, we construct a new multi-frame point cloud interpolation dataset called NL-Drive for large nonlinear motions in autonomous driving scenes to better demonstrate the superiority of our method. Ultimately, NeuralPCI achieves state-of-the-art performance on both DHB (Dynamic Human Bodies) and NL-Drive datasets. Beyond the interpolation task, our method can be naturally extended to point cloud extrapolation, morphing, and auto-labeling, which indicates substantial potential in other domains. Codes are available at https://github.com/ispc-lab/NeuralPCI.

A Generalized Framework for Video Instance Segmentation
Heo, MiranandHwang, SukjunandHyun, JeongseokandKim, HanjungandOh, SeoungWugandLee, Joon-YoungandKim, SeonJoo



研究问题:视频实例分割(VIS)中,如何处理复杂、遮挡的长视频序列是一个新挑战。
动机:现有方法在处理这一挑战时存在局限,主要问题在于训练与推理之间的差异。
方法:提出一种通用的视频实例分割框架GenVIS,通过学习策略和新颖的目标标签分配进行顺序学习,并引入一个有效获取先前状态信息的内存。
效果:在YouTube-VIS 2019/2021/2022和Occluded VIS (OVIS)等流行VIS基准测试中,该方法取得了最先进的结果,特别是在长VIS基准测试(OVIS)上,使用ResNet-50主干网络提高了5.6 AP。

The handling of long videos with complex and occluded sequences has recently emerged as a new challenge in the video instance segmentation (VIS) community. However, existing methods have limitations in addressing this challenge. We argue that the biggest bottleneck in current approaches is the discrepancy between training and inference. To effectively bridge this gap, we propose a Generalized framework for VIS, namely GenVIS, that achieves state-of-the-art performance on challenging benchmarks without designing complicated architectures or requiring extra post-processing. The key contribution of GenVIS is the learning strategy, which includes a query-based training pipeline for sequential learning with a novel target label assignment. Additionally, we introduce a memory that effectively acquires information from previous states. Thanks to the new perspective, which focuses on building relationships between separate frames or clips, GenVIS can be flexibly executed in both online and semi-online manner. We evaluate our approach on popular VIS benchmarks, achieving state-of-the-art results on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS). Notably, we greatly outperform the state-of-the-art on the long VIS benchmark (OVIS), improving 5.6 AP with ResNet-50 backbone. Code is available at https://github.com/miranheo/GenVIS.

Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning
Tan, ChengandGao, ZhangyangandWu, LirongandXu, YongjieandXia, JunandLi, SiyuanandLi, StanZ.



研究问题:本文旨在调查现有的时空预测学习方法,并提出一个通用的时空预测学习框架。
动机:主流方法使用循环单元来捕获长期时间依赖性,但由于其无法比拟的架构,计算效率低下。
方法:我们提出了一种时空注意力单元(TAU),将时间注意力分解为帧内静态注意力和帧间动态注意力,以并行化时间模块。同时,引入了一种新的微分散度正则化方法,以考虑帧间变化。
效果:实验结果表明,该方法使模型在各种时空预测基准测试中实现了竞争性能。

Spatiotemporal predictive learning aims to generate future frames by learning from historical frames. In this paper, we investigate existing methods and present a general framework of spatiotemporal predictive learning, in which the spatial encoder and decoder capture intra-frame features and the middle temporal module catches inter-frame correlations. While the mainstream methods employ recurrent units to capture long-term temporal dependencies, they suffer from low computational efficiency due to their unparallelizable architectures. To parallelize the temporal module, we propose the Temporal Attention Unit (TAU), which decomposes temporal attention into intra-frame statical attention and inter-frame dynamical attention. Moreover, while the mean squared error loss focuses on intra-frame errors, we introduce a novel differential divergence regularization to take inter-frame variations into account. Extensive experiments demonstrate that the proposed method enables the derived model to achieve competitive performance on various spatiotemporal prediction benchmarks.

Listening Human Behavior: 3D Human Pose Estimation With Acoustic Signals
Shibata, YutoandKawashima, YutakaandIsogawa, MarikoandIrie, GoandKimura, AkisatoandAoki, Yoshimitsu



研究问题:仅通过声音信号,我们能推断出多少关于人类行为的信息?
动机:现有的方法由于使用包含人类语音或特定动作的声音的信号,因此存在隐私问题。我们探索了如何通过一对麦克风和扬声器的主动声感测来估计3D人体姿势,利用低级别的声学信号提供足够的线索。
方法:我们引入了一个框架,将多通道音频特征编码为3D人体姿势。为了捕捉微妙的声音变化以揭示详细的体位信息,我们从声学信号中显式提取相位特征以及典型的频谱特征,并将它们输入到我们的人体姿态估计网络中。
效果:实验表明,仅使用低维的声学信息,我们的方法就优于基线方法。本项目使用的数据集和代码将公开发布。

Given only acoustic signals without any high-level information, such as voices or sounds of scenes/actions, how much can we infer about the behavior of humans? Unlike existing methods, which suffer from privacy issues because they use signals that include human speech or the sounds of specific actions, we explore how low-level acoustic signals can provide enough clues to estimate 3D human poses by active acoustic sensing with a single pair of microphones and loudspeakers (see Fig. 1). This is a challenging task since sound is much more diffractive than other signals and therefore covers up the shape of objects in a scene. Accordingly, we introduce a framework that encodes multichannel audio features into 3D human poses. Aiming to capture subtle sound changes to reveal detailed pose information, we explicitly extract phase features from the acoustic signals together with typical spectrum features and feed them into our human pose estimation network. Also, we show that reflected or diffracted sounds are easily influenced by subjects' physique differences e.g., height and muscularity, which deteriorates prediction accuracy. We reduce these gaps by using a subject discriminator to improve accuracy. Our experiments suggest that with the use of only low-dimensional acoustic information, our method outperforms baseline methods. The datasets and codes used in this project will be publicly available.

SViTT: Temporal Learning of Sparse Video-Text Transformers
Li, YiandMin, KyleandTripathi, SubarnaandVasconcelos, Nuno



研究问题:视频-文本转换器是否学习了跨帧的时间关系?
动机:尽管视频-文本模型具有巨大的容量和丰富的多模态训练数据,但最近的研究表明,这些模型往往偏向于基于帧的空间表示,而时间推理仍未得到解决。
方法:我们提出了一种稀疏的视频-文本架构SViTT,该架构通过限制自我注意中令牌之间的查询-键通信和丢弃无信息的视觉令牌,以明显低于简单转换器的计算成本进行多帧推理。
效果:在多个视频-文本检索和问答基准测试中,SViTT优于密集变换器基线,并且计算成本仅为其一小部分。

Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. In this work, we identify several key challenges in temporal learning of video-text transformers: the spatiotemporal trade-off from limited network size; the curse of dimensionality for multi-frame modeling; and the diminishing returns of semantic information by extending clip length. Guided by these findings, we propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost. Project page: http://svcl.ucsd.edu/projects/svitt.

Large-Capacity and Flexible Video Steganography via Invertible Neural Network
Mou, ChongandXu, YouminandSong, JiechongandZhao, ChenandGhanem, BernardandZhang, Jian



研究问题:本文旨在解决视频隐写术中容量小、方法固定的问题。
动机:目前的大部分视频隐写术方法容量有限,方法固定,无法满足多样化的需求。
方法:提出了一种大容量、灵活的视频隐写术网络(LF-VSN)。通过一个可逆的神经网络实现多部视频的隐藏和恢复,提高了容量;并通过密钥控制方案和可扩展的多视频隐藏策略,增强了灵活性。
效果:实验证明,LF-VSN在视频隐写术性能上有显著提升,具有高安全性、大容量和灵活性。

Video steganography is the art of unobtrusively concealing secret data in a cover video and then recovering the secret data through a decoding protocol at the receiver end. Although several attempts have been made, most of them are limited to low-capacity and fixed steganography. To rectify these weaknesses, we propose a Large-capacity and Flexible Video Steganography Network (LF-VSN) in this paper. For large-capacity, we present a reversible pipeline to perform multiple videos hiding and recovering through a single invertible neural network (INN). Our method can hide/recover 7 secret videos in/from 1 cover video with promising performance. For flexibility, we propose a key-controllable scheme, enabling different receivers to recover particular secret videos from the same cover video through specific keys. Moreover, we further improve the flexibility by proposing a scalable strategy in multiple videos hiding, which can hide variable numbers of secret videos in a cover video with a single model and a single training session. Extensive experiments demonstrate that with the significant improvement of the video steganography performance, our proposed LF-VSN has high security, large hiding capacity, and flexibility. The source code is available at https://github.com/MC-E/LF-VSN.

EVAL: Explainable Video Anomaly Localization
Singh, AshishandJones, MichaelJ.andLearned-Miller, ErikG.



研究问题:开发一种新的单场景视频异常定位框架,使系统做出的决策具有人类可理解的原因。
动机:现有的视频异常检测方法缺乏对异常原因的解释性,我们希望通过深度学习模型提供人类可理解的异常原因。
方法:首先使用深度网络学习物体及其运动的基本表示,然后利用这些表示构建特定场景的高级位置相关模型,用于检测新视频中的异常。
效果:在标准的视频异常检测数据集上进行实验,结果显著优于先前最先进的方法。所有代码和额外数据集将公开发布。

We develop a novel framework for single-scene video anomaly localization that allows for human-understandable reasons for the decisions the system makes. We first learn general representations of objects and their motions (using deep networks) and then use these representations to build a high-level, location-dependent model of any particular scene. This model can be used to detect anomalies in new videos of the same scene. Importantly, our approach is explainable -- our high-level appearance and motion features can provide human-understandable reasons for why any part of a video is classified as normal or anomalous. We conduct experiments on standard video anomaly detection datasets (Street Scene, CUHK Avenue, ShanghaiTech and UCSD Ped1, Ped2) and show significant improvements over the previous state-of-the-art. All of our code and extra datasets will be made publicly available.

SeqTrack: Sequence to Sequence Learning for Visual Object Tracking
Chen, XinandPeng, HouwenandWang, DongandLu, HuchuanandHu, Han



研究问题:本文提出了一种新的视觉跟踪序列学习框架SeqTrack。
动机:将视觉跟踪视为一个序列生成问题,通过自动预测物体边界框,避免了复杂的头部网络设计。
方法:SeqTrack仅采用简单的编码器-解码器转换器架构,其中编码器使用双向转换器提取视觉特征,解码器则使用因果转换器自动生成边界框序列。
效果:这种序列学习范式不仅简化了跟踪框架,还在基准测试中实现了竞争性能,例如在LaSOT上获得72.5%的AUC,创造了新的最先进的性能。

In this paper, we present a new sequence-to-sequence learning framework for visual tracking, dubbed SeqTrack. It casts visual tracking as a sequence generation problem, which predicts object bounding boxes in an autoregressive fashion. This is different from prior Siamese trackers and transformer trackers, which rely on designing complicated head networks, such as classification and regression heads. SeqTrack only adopts a simple encoder-decoder transformer architecture. The encoder extracts visual features with a bidirectional transformer, while the decoder generates a sequence of bounding box values autoregressively with a causal transformer. The loss function is a plain cross-entropy. Such a sequence learning paradigm not only simplifies tracking framework, but also achieves competitive performance on benchmarks. For instance, SeqTrack gets 72.5% AUC on LaSOT, establishing a new state-of-the-art performance. Code and models are available at https://github.com/microsoft/VideoX.

Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
Zhang, JiahaoandCherian, AnoopandLiu, YanbinandBen-Shabat, YizhakandRodriguez, CristianandGould, Stephen



研究问题:如何实现多模态对齐,即通过一种模态的查询来检索另一种模态的实例?
动机:本研究旨在解决一种新颖的多模态对齐问题,即装配图(常见于宜家装配手册)中的指令步骤与来自真实世界视频片段的视频之间的对齐。
方法:引入一种新的监督对比学习方法,通过一组新的损失函数指导学习将视频与装配图中的细微细节进行对齐。
效果:在IAW(宜家装配在野外)数据集上进行的大量实验表明,该方法在两个任务上都优于其他替代方案,这两个任务是:在视频片段和插图之间进行最近邻检索,以及为每个视频对指令步骤和片段进行对齐。

Multimodal alignment facilitates the retrieval of instances from one modality when queried using another. In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) video segments from in-the-wild videos; these videos comprising an enactment of the assembly actions in the real world. To learn this alignment, we introduce a novel supervised contrastive learning method that learns to align videos with the subtle details in the assembly diagrams, guided by a set of novel losses. To study this problem and demonstrate the effectiveness of our method, we introduce a novel dataset: IAW---for Ikea assembly in the wild---consisting of 183 hours of videos from diverse furniture assembly collections and nearly 8,300 illustrations from their associated instruction manuals and annotated for their ground truth alignments. We define two tasks on this dataset: First, nearest neighbor retrieval between video segments and illustrations, and, second, alignment of instruction steps and the segments for each video. Extensive experiments on IAW demonstrate superior performances of our approach against alternatives.

Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception
Gao, JunyuandChen, MengyuanandXu, Changsheng



研究问题:本文旨在解决弱监督视听事件感知(WS-AVEP)任务,即在只有视频级别事件标签的情况下,对视听事件进行时间定位和分类。
动机:尽管现有方法取得了一些进展,但大多数方法要么忽略了视听轨道的不同步特性,要么忽视了互补模态的显式增强。
方法:本文提出了一个统一的框架来收集跨模态存在-不存在证据(CMPAE)。具体来说,通过利用单模态和跨模态表示,设计了一个基于主观逻辑理论的存在-不存在证据收集器(PAEC)。为了学习可靠范围内的证据,提出了一种联合模态互学习(JML)过程,该过程自适应地动态校准各种可听、可见和可听可见事件的 evidence。
效果:大量实验表明,该方法超越了现有技术(例如,在事件级别的视觉和音频指标上分别实现了3.6%和6.1%的绝对增益)。代码可在github.com/MengyuanChen21/CVPR2023-CMPAE获取。

With only video-level event labels, this paper targets at the task of weakly-supervised audio-visual event perception (WS-AVEP), which aims to temporally localize and categorize events belonging to each modality. Despite the recent progress, most existing approaches either ignore the unsynchronized property of audio-visual tracks or discount the complementary modality for explicit enhancement. We argue that, for an event residing in one modality, the modality itself should provide ample presence evidence of this event, while the other complementary modality is encouraged to afford the absence evidence as a reference signal. To this end, we propose to collect Cross-Modal Presence-Absence Evidence (CMPAE) in a unified framework. Specifically, by leveraging uni-modal and cross-modal representations, a presence-absence evidence collector (PAEC) is designed under Subjective Logic theory. To learn the evidence in a reliable range, we propose a joint-modal mutual learning (JML) process, which calibrates the evidence of diverse audible, visible, and audi-visible events adaptively and dynamically. Extensive experiments show that our method surpasses state-of-the-arts (e.g., absolute gains of 3.6% and 6.1% in terms of event-level visual and audio metrics). Code is available in github.com/MengyuanChen21/CVPR2023-CMPAE.

Co-Speech Gesture Synthesis by Reinforcement Learning With Contrastive Pre-Trained Rewards
Sun, MingyangandZhao, MengchenandHou, YaqingandLi, MingleiandXu, HuangandXu, SongcenandHao, Jianye



研究问题:自动为虚拟角色合成共现手势的需求日益增长,但输入语音和目标手势之间的复杂关系使其成为一项挑战。
动机:大多数现有工作都专注于预测最适合数据的下一个手势,但这些方法短视且缺乏对未来手势进行规划的能力。
方法:本文提出了一种名为RACER的新型强化学习(RL)框架,用于生成最大化整体满意度的手势序列。RACER使用矢量量化变分自编码器来学习手势的紧凑表示,并使用基于GPT的策略架构来自动生成连贯的手势序列。
效果:实验结果表明,我们的方法在客观度量和主观人类判断方面均显著优于现有基线。

There is a growing demand of automatically synthesizing co-speech gestures for virtual characters. However, it remains a challenge due to the complex relationship between input speeches and target gestures. Most existing works focus on predicting the next gesture that fits the data best, however, such methods are myopic and lack the ability to plan for future gestures. In this paper, we propose a novel reinforcement learning (RL) framework called RACER to generate sequences of gestures that maximize the overall satisfactory. RACER employs a vector quantized variational autoencoder to learn compact representations of gestures and a GPT-based policy architecture to generate coherent sequence of gestures autoregressively. In particular, we propose a contrastive pre-training approach to calculate the rewards, which integrates contextual information into action evaluation and successfully captures the complex relationships between multi-modal speech-gesture data. Experimental results show that our method significantly outperforms existing baselines in terms of both objective metrics and subjective human judgements. Demos can be found at https://github.com/RLracer/RACER.git.

Reconstructing Signing Avatars From Video Using Linguistic Priors
Forte, Maria-PaolaandKulits, PeterandHuang, Chun-HaoP.andChoutas, VasileiosandTzionas, DimitriosandKuchenbecker, KatherineJ.andBlack, MichaelJ.



研究问题:如何从手语视频中自动提取精细的手势、面部表情和身体动作,以创建具有表现力的3D化身。
动机:现有的手语学习工具主要是孤立的手语视频字典,而使用3D化身可以改善学习效果,并实现AR/VR应用,提高技术访问和在线媒体的使用。
方法:提出一种新的语言先验方法,该方法对手语普遍适用,并为3D手势提供约束,有助于解决孤立手语中的模糊性。这种方法被称为SGNify,可以从自然环境中的单目手语视频中全自动捕获精细的手势、面部表情和身体运动。
效果:通过使用商业运动捕捉系统计算与单目视频同步的3D化身,对SGNify进行了定量评估。实验结果表明,SGNify在手语视频上的3D身体姿势和形状估计方面优于最先进的方法。感知研究表明,SGNify的3D重建比先前的方法更易于理解和自然,并与源视频相当。

Sign language (SL) is the primary method of communication for the 70 million Deaf people around the world. Video dictionaries of isolated signs are a core SL learning tool. Replacing these with 3D avatars can aid learning and enable AR/VR applications, improving access to technology and online media. However, little work has attempted to estimate expressive 3D avatars from SL video; occlusion, noise, and motion blur make this task difficult. We address this by introducing novel linguistic priors that are universally applicable to SL and provide constraints on 3D hand pose that help resolve ambiguities within isolated signs. Our method, SGNify, captures fine-grained hand pose, facial expression, and body movement fully automatically from in-the-wild monocular SL videos. We evaluate SGNify quantitatively by using a commercial motion-capture system to compute 3D avatars synchronized with monocular video. SGNify outperforms state-of-the-art 3D body-pose- and shape-estimation methods on SL videos. A perceptual study shows that SGNify's 3D reconstructions are significantly more comprehensible and natural than those of previous methods and are on par with the source videos. Code and data are available at sgnify.is.tue.mpg.de.

TempSAL - Uncovering Temporal Information for Deep Saliency Prediction
Aydemir, BaharandHoffstetter, LudoandZhang, TongandSalzmann, MathieuandS\"usstrunk, Sabine



研究问题:本文旨在提出一种新的显著性预测模型,该模型通过利用人类的时间注意力模式来学习在连续的时间间隔中输出显著性图。
动机:现有的显著性预测算法通常依赖于额外的信息,如场景上下文、语义关系、注视方向和对象差异性,但并未考虑到图像观察期间注视转移的时间性质。
方法:我们的方法通过结合学习到的时间映射局部调整显著性预测。
效果:实验结果表明,我们的方法在SALICON基准测试和CodeCharts1k数据集上优于包括多持续时间显著性模型在内的最先进的模型。

Deep saliency prediction algorithms complement the object recognition features, they typically rely on additional information such as scene context, semantic relationships, gaze direction, and object dissimilarity. However, none of these models consider the temporal nature of gaze shifts during image observation. We introduce a novel saliency prediction model that learns to output saliency maps in sequential time intervals by exploiting human temporal attention patterns. Our approach locally modulates the saliency predictions by combining the learned temporal maps. Our experiments show that our method outperforms the state-of-the-art models, including a multi-duration saliency model, on the SALICON benchmark and CodeCharts1k dataset. Our code is publicly available on GitHub.

A Unified Pyramid Recurrent Network for Video Frame Interpolation
Jin, XinandWu, LonghaiandChen, JieandChen, YouxinandKoo, JayoonandHahm, Cheul-hee



研究问题:本文旨在提出一种统一的金字塔循环网络(UPR-Net)用于帧插值,以解决光流估计和中间帧合成的问题。
动机:现有的帧插值方法在处理大运动情况时的稳定性有待提高。
方法:通过构建一个灵活的金字塔框架,利用轻量级的循环模块进行双向光流估计和中间帧合成。在每个金字塔级别上,利用估计的双向光流生成用于帧合成的前向扭曲表示;在整个金字塔级别上,实现对光流和中间帧的迭代细化。
效果:实验结果表明,我们的迭代合成策略可以显著提高大运动情况下帧插值的稳定性。尽管非常轻量级(1.7M参数),但我们的UPR-Net基本版本在各种基准测试中表现出色。

Flow-guided synthesis provides a common framework for frame interpolation, where optical flow is estimated to guide the synthesis of intermediate frames between consecutive inputs. In this paper, we present UPR-Net, a novel Unified Pyramid Recurrent Network for frame interpolation. Cast in a flexible pyramid framework, UPR-Net exploits lightweight recurrent modules for both bi-directional flow estimation and intermediate frame synthesis. At each pyramid level, it leverages estimated bi-directional flow to generate forward-warped representations for frame synthesis; across pyramid levels, it enables iterative refinement for both optical flow and intermediate frame. In particular, we show that our iterative synthesis strategy can significantly improve the robustness of frame interpolation on large motion cases. Despite being extremely lightweight (1.7M parameters), our base version of UPR-Net achieves excellent performance on a large range of benchmarks. Code and trained models of our UPR-Net series are available at: https://github.com/srcn-ivl/UPR-Net.

PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation
Zhao, QitaoandZheng, CeandLiu, MengyuanandWang, PichaoandChen, Chen



研究问题:现有的基于变换器的方法在人体姿态估计任务中受到输入关节序列长度和二维关节检测质量的限制。
动机:为了解决这些问题,本文提出了PoseFormerV2,该方法利用了频率域中的长骨骼序列的紧凑表示形式,以有效地扩大感受野并提高对噪声的鲁棒性。
方法:PoseFormerV2通过在时间和频率域中有效地融合特征,实现了比其前身更好的速度-准确性权衡。
效果:在两个基准数据集(即Human3.6M和MPI-INF-3DHP)上的大量实验表明,该方法显著优于原始的PoseFormer和其他基于变换器的变体。

Recently, transformer-based methods have gained significant success in sequential 2D-to-3D lifting human pose estimation. As a pioneering work, PoseFormer captures spatial relations of human joints in each video frame and human dynamics across frames with cascaded transformer layers and has achieved impressive performance. However, in real scenarios, the performance of PoseFormer and its follow-ups is limited by two factors: (a) The length of the input joint sequence; (b) The quality of 2D joint detection. Existing methods typically apply self-attention to all frames of the input sequence, causing a huge computational burden when the frame number is increased to obtain advanced estimation accuracy, and they are not robust to noise naturally brought by the limited capability of 2D joint detectors. In this paper, we propose PoseFormerV2, which exploits a compact representation of lengthy skeleton sequences in the frequency domain to efficiently scale up the receptive field and boost robustness to noisy 2D joint detection. With minimum modifications to PoseFormer, the proposed method effectively fuses features both in the time domain and frequency domain, enjoying a better speed-accuracy trade-off than its precursor. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that the proposed approach significantly outperforms the original PoseFormer and other transformer-based variants. Code is released at https://github.com/QitaoZhao/PoseFormerV2.

Neural Koopman Pooling: Control-Inspired Temporal Dynamics Encoding for Skeleton-Based Action Recognition
Wang, XinghanandXu, XinandMu, Yadong



研究问题:现有的基于骨架的动作识别方法在捕捉高阶动态信息方面存在不足。
动机:为了解决这一问题,我们提出了一种基于Koopman理论的参数化高阶池化技术,称为Koopman池化。
方法:我们训练了一个CNN或GCN作为主干网络来提取空间-时间特征,并使用Koopman池化模块进行聚合。我们还提出了一种特征值归一化方法,以鼓励学习到的动态保持稳定且不衰减。此外,我们还展示了当与动态模式分解结合时,我们的Koopman池化框架可以很容易地扩展到单次动作识别。
效果:我们在三个基准数据集上进行了评估,即NTU RGB+D 60、120和NW-UCLA。实验结果表明,Koopman池化在全数据集和单次设置下都显著提高了性能。

Skeleton-based human action recognition is becoming increasingly important in a variety of fields. Most existing works train a CNN or GCN based backbone to extract spatial-temporal features, and use temporal average/max pooling to aggregate the information. However, these pooling methods fail to capture high-order dynamics information. To address the problem, we propose a plug-and-play module called Koopman pooling, which is a parameterized high-order pooling technique based on Koopman theory. The Koopman operator linearizes a non-linear dynamics system, thus providing a way to represent the complex system through the dynamics matrix, which can be used for classification. We also propose an eigenvalue normalization method to encourage the learned dynamics to be non-decaying and stable. Besides, we also show that our Koopman pooling framework can be easily extended to one-shot action recognition when combined with Dynamic Mode Decomposition. The proposed method is evaluated on three benchmark datasets, namely NTU RGB+D 60, 120 and NW-UCLA. Our experiments clearly demonstrate that Koopman pooling significantly improves the performance under both full-dataset and one-shot settings.

Few-Shot Referring Relationships in Videos
Kumar, YogeshandMishra, Anand



研究问题:如何通过给定的查询视觉关系,在测试视频中定位出通过谓词连接的主体和对象。
动机:虽然现代视觉语言理解能力可以解决这个问题,但为每个主体、对象和谓词的组合进行注释既繁琐又昂贵,甚至可能无法实现。因此,需要一种模型能够仅使用共享相同谓词的支持集视频,学习空间和时间上定位通过未见过的谓词连接的主体和对象。
方法:将此问题定义为最小化目标函数,该函数定义在一个T部分随机场中。随机场的顶点对应于主体和对象的候选边界框,T表示测试视频中的帧数。目标函数由帧级别和视觉关系相似性势组成。为了学习这些势能,我们使用一个关系网络,该网络以查询条件转换关系嵌入作为输入,并使用支持集视频进行元训练。此外,通过在随机场上进行基于信念传播的消息传递来最小化目标函数,以获取主体和对象的时空定位或轨迹。
效果:我们在两个公共基准测试(即ImageNet-VidVRD和VidOR)上进行了大量实验,并将所提出的方法与具有竞争力的基线进行比较,以评估其有效性。

Interpreting visual relationships is a core aspect of comprehensive video understanding. Given a query visual relationship as and a test video, our objective is to localize the subject and object that are connected via the predicate. Given modern visio-lingual understanding capabilities, solving this problem is achievable, provided that there are large-scale annotated training examples available. However, annotating for every combination of subject, object, and predicate is cumbersome, expensive, and possibly infeasible. Therefore, there is a need for models that can learn to spatially and temporally localize subjects and objects that are connected via an unseen predicate using only a few support set videos sharing the common predicate. We address this challenging problem, referred to as few-shot referring relationships in videos for the first time. To this end, we pose the problem as a minimization of an objective function defined over a T-partite random field. Here, the vertices of the random field correspond to candidate bounding boxes for the subject and object, and T represents the number of frames in the test video. This objective function is composed of frame level and visual relationship similarity potentials. To learn these potentials, we use a relation network that takes query-conditioned translational relationship embedding as inputs and is meta-trained using support set videos in an episodic manner. Further, the objective function is minimized using a belief propagation-based message passing on the random field to obtain the spatiotemporal localization or subject and object trajectories. We perform extensive experiments using two public benchmarks, namely ImageNet-VidVRD and VidOR, and compare the proposed approach with competitive baselines to assess its efficacy.

Frame-Event Alignment and Fusion Network for High Frame Rate Tracking
Zhang, JiqingandWang, YuanchenandLiu, WenxiandLi, MengandBai, JinpengandYin, BaocaiandYang, Xin



研究问题:如何结合常规帧和事件,实现高帧率的目标跟踪。
动机:现有的基于RGB的跟踪器主要针对30帧每秒的低帧率基准,限制了其在现实世界中的功能性,特别是在快速运动中。而事件相机由于其高时间分辨率,为高帧率跟踪提供了巨大的潜力。然而,事件相机无法像传统相机那样提供精细的纹理信息。这种独特的互补性促使我们将常规帧和事件结合起来,以应对各种具有挑战性的情况。
方法:我们提出了一个端到端的网络,包括多模态对齐和融合模块,以有效地从两种模态中提取有意义的信息。对齐模块负责在事件提供的移动线索的指导下,进行帧和事件模态之间的跨模态和跨帧速率对齐。融合模块则负责通过两种模态的相互补充,强调有价值的特征并抑制噪声信息。
效果:大量的实验表明,我们的方法在高帧率跟踪方面显著优于最先进的跟踪器。使用FE240hz数据集,我们的方法实现了高达240Hz的高帧率跟踪。

Most existing RGB-based trackers target low frame rate benchmarks of around 30 frames per second. This setting restricts the tracker's functionality in the real world, especially for fast motion. Event-based cameras as bioinspired sensors provide considerable potential for high frame rate tracking due to their high temporal resolution. However, event-based cameras cannot offer fine-grained texture information like conventional cameras. This unique complementarity motivates us to combine conventional frames and events for high frame rate object tracking under various challenging conditions. In this paper, we propose an end-to-end network consisting of multi-modality alignment and fusion modules to effectively combine meaningful information from both modalities at different measurement rates. The alignment module is responsible for cross-modality and cross-frame-rate alignment between frame and event modalities under the guidance of the moving cues furnished by events. While the fusion module is accountable for emphasizing valuable features and suppressing noise information by the mutual complement between the two modalities. Extensive experiments show that the proposed approach outperforms state-of-the-art trackers by a significant margin in high frame rate tracking. With the FE240hz dataset, our approach achieves high frame rate tracking up to 240Hz.

Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields
Kim, TaewooandChae, YujeongandJang, Hyun-KurlandYoon, Kuk-Jin



研究问题:视频帧插值(VFI)旨在生成连续输入帧之间的中间视频帧,但研究问题:视频帧插值(VFI)旨在生成连续输入帧之间的中间视频帧,但现有方法在估计双向帧间运动场时只考虑了事件或近似值,无法考虑到真实世界中的复杂运动。
动机:由于事件相机是一种生物启发的传感器,仅以微秒级的时序分辨率编码亮度变化,因此一些工作利用事件相机来提高VFI的性能。
方法:我们提出了一种新颖的事件基础VFI框架,具有跨模态不对称双向运动场估计。具体来说,我们的EIF-BiOFNet直接估计帧间运动场,无需任何近似方法,充分利用了事件和图像的每个有价值的特征。此外,我们还开发了一个交互式注意力基础的帧合成网络,有效地利用了基于扭曲和基于合成的特征。
效果:我们构建了一个大规模的事件基础VFI数据集ERF-X170FPS,具有高帧率、极端运动和动态纹理,克服了以前事件基础VFI数据集的限制。广泛的实验结果表明,我们的方法在各种数据集上比最先进的VFI方法表现出显著的性能改进。

Video Frame Interpolation (VFI) aims to generate intermediate video frames between consecutive input frames. Since the event cameras are bio-inspired sensors that only encode brightness changes with a micro-second temporal resolution, several works utilized the event camera to enhance the performance of VFI. However, existing methods estimate bidirectional inter-frame motion fields with only events or approximations, which can not consider the complex motion in real-world scenarios. In this paper, we propose a novel event-based VFI framework with cross-modal asymmetric bidirectional motion field estimation. In detail, our EIF-BiOFNet utilizes each valuable characteristic of the events and images for direct estimation of inter-frame motion fields without any approximation methods.Moreover, we develop an interactive attention-based frame synthesis network to efficiently leverage the complementary warping-based and synthesis-based features. Finally, we build a large-scale event-based VFI dataset, ERF-X170FPS, with a high frame rate, extreme motion, and dynamic textures to overcome the limitations of previous event-based VFI datasets. Extensive experimental results validate that our method shows significant performance improvement over the state-of-the-art VFI methods on various datasets.Our project pages are available at: https://github.com/intelpro/CBMNet

MD-VQA: Multi-Dimensional Quality Assessment for UGC Live Videos
Zhang, ZichengandWu, WeiandSun, WeiandTu, DanyangandLu, WeiandMin, XiongkuoandChen, YingandZhai, Guangtao



研究问题:UGC直播视频在采集过程中常受各种失真影响,视觉质量多样,且在分发过程中还会被媒体服务器提供商压缩和转码。
动机:由于UGC直播视频的盛行,需要有效的视频质量评估工具来监控和优化直播流视频的质量。
方法:构建了一个包含418个源UGC视频和3762个不同比特率压缩视频的UGC直播VQA数据库,并基于此数据库开发了一个多维度VQA(MD-VQA)评估器,从语义、失真和运动三个方面测量UGC直播视频的视觉质量。
效果:实验结果表明,MD-VQA在UGC直播VQA数据库和现有的压缩UGC VQA数据库上均取得了最先进的性能。

User-generated content (UGC) live videos are often bothered by various distortions during capture procedures and thus exhibit diverse visual qualities. Such source videos are further compressed and transcoded by media server providers before being distributed to end-users. Because of the flourishing of UGC live videos, effective video quality assessment (VQA) tools are needed to monitor and perceptually optimize live streaming videos in the distributing process. Unfortunately, existing compressed UGC VQA databases are either small in scale or employ high-quality UGC videos as source videos, so VQA models developed on these databases have limited abilities to evaluate UGC live videos. In this paper, we address UGC Live VQA problems by constructing a first-of-a-kind subjective UGC Live VQA database and developing an effective evaluation tool. Concretely, 418 source UGC videos are collected in real live streaming scenarios and 3,762 compressed ones at different bit rates are generated for the subsequent subjective VQA experiments. Based on the built database, we develop a Multi-Dimensional VQA (MD-VQA) evaluator to measure the visual quality of UGC live videos from semantic, distortion, and motion aspects respectively. Extensive experimental results show that MD-VQA achieves state-of-the-art performance on both our UGC Live VQA database and existing compressed UGC VQA databases.

Natural Language-Assisted Sign Language Recognition
Zuo, RonglaiandWei, FangyunandMak, Brian



研究问题:手语识别中存在大量视觉上无法区分的手势(VISigns),限制了视觉神经网络的识别能力。
动机:为了解决这个问题,提出了自然语言辅助手语识别(NLA-SLR)框架,利用语义信息来提高识别效果。
方法:首先,对于具有相似语义意义的VISigns,提出语言感知标签平滑技术,通过生成每个训练手势的软标签来缓解训练难度。其次,对于具有不同语义意义的VISigns,提出模态混合技术,将视觉和标签特征混合在一起,以进一步最大化不同手势之间的可分性。此外,还引入了一种新的骨干网络——视频关键点网络,该网络不仅能够对RGB视频和人体关键点进行建模,还能从不同时间感受野的手语视频中获取知识。
效果:实验结果表明,该方法在三个广泛采用的基准测试(MSASL、WLASL和NMFs-CSL)上取得了最先进的性能。

Sign languages are visual languages which convey information by signers' handshape, facial expression, body movement, and so forth. Due to the inherent restriction of combinations of these visual ingredients, there exist a significant number of visually indistinguishable signs (VISigns) in sign languages, which limits the recognition capacity of vision neural networks. To mitigate the problem, we propose the Natural Language-Assisted Sign Language Recognition (NLA-SLR) framework, which exploits semantic information contained in glosses (sign labels). First, for VISigns with similar semantic meanings, we propose language-aware label smoothing by generating soft labels for each training sign whose smoothing weights are computed from the normalized semantic similarities among the glosses to ease training. Second, for VISigns with distinct semantic meanings, we present an inter-modality mixup technique which blends vision and gloss features to further maximize the separability of different signs under the supervision of blended labels. Besides, we also introduce a novel backbone, video-keypoint network, which not only models both RGB videos and human body keypoints but also derives knowledge from sign videos of different temporal receptive fields. Empirically, our method achieves state-of-the-art performance on three widely-adopted benchmarks: MSASL, WLASL, and NMFs-CSL. Codes are available at https://github.com/FangyunWei/SLRT.

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Ruan, LudanandMa, YiyangandYang, HuanandHe, HuiguoandLiu, BeiandFu, JianlongandYuan, NicholasJingandJin, QinandGuo, Baining



研究问题:提出首个同时生成音频和视频的框架,以实现高质量的真实视频。
动机:现有的模型无法同时提供吸引人的视听体验,我们的目标是通过联合训练音频和视频来生成高质量的真实视频。
方法:我们提出了一种新的多模态扩散模型(MM-Diffusion),它包含两个耦合的去噪自动编码器。与现有的单模态扩散模型不同,MM-Diffusion由一个顺序的多模态U-Net组成,用于设计联合去噪过程。两个子网络分别学习从高斯噪声中逐渐生成对齐的音频-视频对。为了确保跨模态的语义一致性,我们提出了一种基于随机移位的注意力块,该模块跨越两个子网络,实现了有效的跨模态对齐,从而增强了音频和视频之间的相互保真度。
效果:大量的实验表明,我们的模型在无条件音频-视频生成和零样本条件任务(如视频到音频)上表现出优越的结果。特别是在Landscape和AIST++舞蹈数据集上,我们实现了最好的FVD和FAD。在10k次图灵测试中,我们的模型得到了明显的青睐。

We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises. To ensure semantic consistency across modalities, we propose a novel random-shift based attention block bridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other. Extensive experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of 10k votes further demonstrate dominant preferences for our model.

MAGVIT: Masked Generative Video Transformer
Yu, LijunandCheng, YongandSohn, KihyukandLezama, Jos\'eandZhang, HanandChang, HuiwenandHauptmann, AlexanderG.andYang, Ming-HsuanandHao, YuanandEssa, IrfanandJiang, Lu



研究问题:本文旨在利用一个单一的模型解决各种视频合成任务。
动机:现有的方法在处理视频合成任务时,通常需要多个模型,效率低下。
方法:我们引入了Masked Generative VIdeo Transformer(MAGVIT)和3D标记器,将视频量化为时空视觉标记,并提出了一种用于掩蔽视频标记建模的嵌入方法,以便于多任务学习。
效果:实验结果表明,MAGVIT在各项指标上都优于现有方法,并在三个视频生成基准测试中建立了最好的FVD,包括具有挑战性的Kinetics-600。此外,与扩散模型相比,MAGVIT的推理速度快两个数量级,与自回归模型相比快60倍。单个MAGVIT模型支持十种不同的生成任务,并能适应不同视觉领域的视频。

We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.

End-to-End Video Matting With Trimap Propagation
Huang, Wei-LunandLee, Ming-Sui



研究问题:视频遮罩技术主要关注时间连贯性,并已通过神经网络取得了显著改进。然而,遮罩通常依赖用户标注的trimaps来估计alpha值,这是一个劳动密集型的问题。
动机:尽管最近的一些研究利用视频对象分割方法传播给定的trimaps,但其结果并不一致。因此,我们提出了一种更强大、更快的端到端视频遮罩模型,名为FTP-VM(快速Trimap传播-视频遮罩)。
方法:FTP-VM将trimap传播和视频遮罩结合在一个模型中,其中额外的内存匹配主干被替换为提出的轻量级trimap融合模块。我们还采用了来自汽车分割的分割一致性损失,以适应trimap分割,并利用RNN(循环神经网络)提高时间连贯性。
效果:实验结果表明,FTP-VM在合成和真实视频中仅使用少量给定的trimaps就能表现出竞争力。其效率比最先进的方法高出八倍,这证实了其在实时场景中的鲁棒性和适用性。

The research of video matting mainly focuses on temporal coherence and has gained significant improvement via neural networks. However, matting usually relies on user-annotated trimaps to estimate alpha values, which is a labor-intensive issue. Although recent studies exploit video object segmentation methods to propagate the given trimaps, they suffer inconsistent results. Here we present a more robust and faster end-to-end video matting model equipped with trimap propagation called FTP-VM (Fast Trimap Propagation - Video Matting). The FTP-VM combines trimap propagation and video matting in one model, where the additional backbone in memory matching is replaced with the proposed lightweight trimap fusion module. The segmentation consistency loss is adopted from automotive segmentation to fit trimap segmentation with the collaboration of RNN (Recurrent Neural Network) to improve the temporal coherence. The experimental results demonstrate that the FTP-VM performs competitively both in composited and real videos only with few given trimaps. The efficiency is eight times higher than the state-of-the-art methods, which confirms its robustness and applicability in real-time scenarios. The code is available at https://github.com/csvt32745/FTP-VM

DropMAE: Masked Autoencoders With Spatial-Attention Dropout for Tracking Tasks
Wu, QiangqiangandYang, TianyuandLiu, ZiquanandWu, BaoyuanandShan, YingandChan, AntoniB.



研究问题:本文旨在研究视频中基于匹配的下游任务,包括视觉目标跟踪(VOT)和视频对象分割(VOS)。
动机:现有的掩码自动编码器(MAE)在视频帧重建时过于依赖空间线索而忽视时间关系,导致其在VOT和VOS等匹配任务上的表现不佳。
方法:提出DropMAE,通过在帧重建过程中自适应地执行空间注意力丢弃,以促进视频中的时间对应关系学习。
效果:实验证明,DropMAE是一种强大且高效的时间匹配学习器,其预训练速度比基于ImageNet的MAE快2倍,并在匹配任务上的微调结果优于后者。同时,发现预训练视频中的动作多样性比场景多样性更能提高VOT和VOS的性能。

In this paper, we study masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS). A simple extension of MAE is to randomly mask out frame patches in videos and reconstruct the frame pixels. However, we find that this simple baseline heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations for VOT and VOS. To alleviate this problem, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We show that our DropMAE is a strong and efficient temporal matching learner, which achieves better finetuning results on matching-based tasks than the ImageNetbased MAE with 2x faster pre-training speed. Moreover, we also find that motion diversity in pre-training videos is more important than scene diversity for improving the performance on VOT and VOS. Our pre-trained DropMAE model can be directly loaded in existing ViT-based trackers for fine-tuning without further modifications. Notably, DropMAE sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets. Our code and pre-trained models are available at https://github.com/jimmy-dq/DropMAE.git.

Are Binary Annotations Sufficient? Video Moment Retrieval via Hierarchical Uncertainty-Based Active Learning
Ji, WeiandLiang, RenjieandZheng, ZhedongandZhang, WenqiaoandZhang, ShengyuandLi, JunchengandLi, MengzeandChua, Tat-seng



研究问题:视频时刻检索的研究大多集中在提高准确性、效率和鲁棒性,这些都需要大量高质量的标注。
动机:精确的帧级标注既耗时又昂贵,而对标注过程的关注却很少。
方法:我们探索了一种新的交互方式来刺激视频时刻检索任务中的人类参与注释过程。关键挑战是选择“模糊”的帧和视频进行二元标注以便于网络训练。具体来说,我们提出了一种新的基于不确定性的层次建模方法,明确考虑了在整个与查询描述对应的视频序列中对每个帧的不确定性进行建模,并选择不确定性最高的帧进行标注。
效果:在获得专家提供的少量标签后,我们发现这足以在这种恶劣环境中学习出具有竞争力的视频时刻检索模型。此外,我们将视频中帧的不确定性分数作为一个整体来处理,并估计每个视频的难度,这可以进一步减轻视频选择的负担。总的来说,我们的主动学习策略不仅在帧级别上起作用,而且在序列级别上也起作用。在两个公共数据集上的实验验证了我们提出的方法的有效性。

Recent research on video moment retrieval has mostly focused on enhancing the performance of accuracy, efficiency, and robustness, all of which largely rely on the abundance of high-quality annotations. While the precise frame-level annotations are time-consuming and cost-expensive, few attentions have been paid to the labeling process. In this work, we explore a new interactive manner to stimulate the process of human-in-the-loop annotation in video moment retrieval task. The key challenge is to select "ambiguous" frames and videos for binary annotations to facilitate the network training. To be specific, we propose a new hierarchical uncertainty-based modeling that explicitly considers modeling the uncertainty of each frame within the entire video sequence corresponding to the query description, and selecting the frame with the highest uncertainty. Only selected frame will be annotated by the human experts, which can largely reduce the workload. After obtaining a small number of labels provided by the expert, we show that it is sufficient to learn a competitive video moment retrieval model in such a harsh environment. Moreover, we treat the uncertainty score of frames in a video as a whole, and estimate the difficulty of each video, which can further relieve the burden of video selection. In general, our active learning strategy for video moment retrieval works not only at the frame level but also at the sequence level. Experiments on two public datasets validate the effectiveness of our proposed method.

Multi-Label Compound Expression Recognition: C-EXPR Database \& Network
Kollias, Dimitrios



研究问题:本文旨在解决复合表情识别(CER)的问题,因为现有的面部表情分析主要集中在基本七种表情上,而复合表情更能准确反映我们日常生活中的情感复杂性和微妙性。
动机:由于只有少数几个小型、实验室控制、不平衡和静态的数据库存在,对复合表情识别的研究非常有限。
方法:本文提出了一个在野外环境下的音频/视频数据库C-EXPR-DB,包含400个视频,20万个帧,13种复合表情的标注,以及情感描述符、动作单元、语音、面部地标和属性的标注。同时,提出了C-EXPR-NET,一种用于复合表情识别和动作单元检测的多任务学习方法。
效果:通过广泛的实验研究,验证了C-EXPR-NET的出色性能,并证明了其在新的情境下能有效地进行零样本泛化。

Research in automatic analysis of facial expressions mainly focuses on recognising the seven basic ones. However, compound expressions are more diverse and represent the complexity and subtlety of our daily affective displays more accurately. Limited research has been conducted for compound expression recognition (CER), because only a few databases exist, which are small, lab controlled, imbalanced and static. In this paper we present an in-the-wild A/V database, C-EXPR-DB, consisting of 400 videos of 200K frames, annotated in terms of 13 compound expressions, valence-arousal emotion descriptors, action units, speech, facial landmarks and attributes. We also propose C-EXPR-NET, a multi-task learning (MTL) method for CER and AU detection (AU-D); the latter task is introduced to enhance CER performance. For AU-D we incorporate AU semantic description along with visual information. For CER we use a multi-label formulation and the KL-divergence loss. We also propose a distribution matching loss for coupling CER and AU-D tasks to boost their performance and alleviate negative transfer (i.e., when MT model's performance is worse than that of at least one single-task model). An extensive experimental study has been conducted illustrating the excellent performance of C-EXPR-NET, validating the theoretical claims. Finally, C-EXPR-NET is shown to effectively generalize its knowledge in new emotion recognition contexts, in a zero-shot manner.

Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos
Su, KunandQian, KaizhiandShlizerman, EliandTorralba, AntonioandGan, Chuang



研究问题:如何有效地合成物体交互声音,以实现真实和虚拟世界的沉浸式感知体验。
动机:传统的冲击声音合成方法需要详细的物体几何形状和碰撞位置信息,这在现实世界中很少见,也无法用于合成常见视频的冲击声音。现有的基于深度学习的视频驱动方法由于缺乏物理知识,只能捕捉到视觉内容和冲击声音之间的弱对应关系。
方法:我们提出了一种基于物理的扩散模型,可以合成无声视频片段的高保真冲击声音。除了视频内容外,我们还使用额外的物理先验来指导冲击声音合成过程。这些物理先验包括直接从嘈杂的真实世界冲击声音示例中估计出的物理参数,以及通过神经网络解释声音环境的学习的剩余参数。
效果:实验结果表明,我们的模型在生成现实冲击声音方面优于几个现有的系统。更重要的是,基于物理的表示是完全可解释和透明的,从而使我们能够灵活地进行声音编辑。

Modeling sounds emitted from physical object interactions is critical for immersive perceptual experiences in real and virtual worlds. Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. However, they require fine details of both the object geometries and impact locations, which are rarely available in the real world and can not be applied to synthesize impact sounds from common videos. On the other hand, existing video-driven deep learning-based approaches could only capture the weak correspondence between visual content and impact sounds since they lack of physics knowledge. In this work, we propose a physics-driven diffusion model that can synthesize high-fidelity impact sound for a silent video clip. In addition to the video content, we propose to use additional physics priors to guide the impact sound synthesis procedure. The physics priors include both physics parameters that are directly estimated from noisy real-world impact sound examples without sophisticated setup and learned residual parameters that interpret the sound environment via neural networks. We further implement a novel diffusion model with specific training and inference strategies to combine physics priors and visual information for impact sound synthesis. Experimental results show that our model outperforms several existing systems in generating realistic impact sounds. More importantly, the physics-based representations are fully interpretable and transparent, thus enabling us to perform sound editing flexibly. We encourage the readers to visit our project page to watch demo videos with audio turned on to experience the results.

Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding
Lin, ZihangandTan, ChaoleiandHu, Jian-FangandJin, ZhiandYe, TiancaiandZheng, Wei-Shi



研究问题:本文旨在解决空间-时间视频定位(STVG)问题,即根据给定的语言查询在空间和时间上定位目标对象。
动机:该任务具有挑战性,模型需要理解语言描述中的动态视觉线索(如运动)和静态视觉线索(如物体外观),这需要有效地联合建模空间-时间视觉-语言依赖关系。
方法:我们提出了一个新的框架,其中开发了一个静态视觉-语言流和一个动态视觉-语言流来共同推理目标对象。静态流在单帧中进行跨模态理解,并根据像物体外观这样的帧内视觉线索学习关注目标对象的 spatially。动态流在多个连续帧中建模视觉-语言依赖关系,以捕捉像运动这样的动态线索。我们还设计了两个流之间的新的跨流协同模块,使静态流和动态流能够相互传递有用和互补的信息,实现协同推理。
效果:实验结果表明两个流的协作效果显著,我们的整个框架在HCSTVG和VidSTG数据集上都取得了新的最先进的性能。

Spatio-Temporal Video Grounding (STVG) aims to localize the target object spatially and temporally according to the given language query. It is a challenging task in which the model should well understand dynamic visual cues (e.g., motions) and static visual cues (e.g., object appearances) in the language description, which requires effective joint modeling of spatio-temporal visual-linguistic dependencies. In this work, we propose a novel framework in which a static vision-language stream and a dynamic vision-language stream are developed to collaboratively reason the target tube. The static stream performs cross-modal understanding in a single frame and learns to attend to the target object spatially according to intra-frame visual cues like object appearances. The dynamic stream models visual-linguistic dependencies across multiple consecutive frames to capture dynamic cues like motions. We further design a novel cross-stream collaborative block between the two streams, which enables the static and dynamic streams to transfer useful and complementary information from each other to achieve collaborative reasoning. Experimental results show the effectiveness of the collaboration of the two streams and our overall framework achieves new state-of-the-art performance on both HCSTVG and VidSTG datasets.

Micron-BERT: BERT-Based Facial Micro-Expression Recognition
Nguyen, Xuan-BacandDuong, ChiNhanandLi, XinandGauch, SusanandSeo, Han-SeokandLuu, Khoa



研究问题:如何更准确地识别短暂且微小的面部表情,即微表情。
动机:尽管预训练的深度双向转换器(BERT)在计算机视觉的自我监督学习任务上取得了显著的进步,但标准的BERT在视觉问题上的设计仅能从完整的图像或视频中学习,无法准确检测到面部微表情的细节。
方法:提出了Micron-BERT(u-BERT),一种新颖的面部微表情识别方法。该方法基于两个关键思想自动捕获这些运动,一是采用对角微注意力(DMA)来检测两帧之间的微小差异;二是引入新的关注区域(PoI)模块来定位和突出微表情的关注区域,同时减少噪声背景和干扰。
效果:通过将这些组件集成到一个端到端的深层网络中,所提出的u-BERT在各种微表情任务上都大大超过了以往的工作。u-BERT可以在大规模的未标记数据集上进行训练,并在新的未见过的面部微表情数据集上实现高精度。实验证明,u-BERT在四个微表情基准测试中,包括SAMM、CASME II、SMIC和CASME3,始终优于最先进的性能,且优势明显。代码将在https://github.com/uark-cviu/Micron-BERT上提供。

Micro-expression recognition is one of the most challenging topics in affective computing. It aims to recognize tiny facial movements difficult for humans to perceive in a brief period, i.e., 0.25 to 0.5 seconds. Recent advances in pre-training deep Bidirectional Transformers (BERT) have significantly improved self-supervised learning tasks in computer vision. However, the standard BERT in vision problems is designed to learn only from full images or videos, and the architecture cannot accurately detect details of facial micro-expressions. This paper presents Micron-BERT (u-BERT), a novel approach to facial micro-expression recognition. The proposed method can automatically capture these movements in an unsupervised manner based on two key ideas. First, we employ Diagonal Micro-Attention (DMA) to detect tiny differences between two frames. Second, we introduce a new Patch of Interest (PoI) module to localize and highlight micro-expression interest regions and simultaneously reduce noisy backgrounds and distractions. By incorporating these components into an end-to-end deep network, the proposed u-BERT significantly outperforms all previous work in various micro-expression tasks. u-BERT can be trained on a large-scale unlabeled dataset, i.e., up to 8 million images, and achieves high accuracy on new unseen facial micro-expression datasets. Empirical experiments show u-BERT consistently outperforms state-of-the-art performance on four micro-expression benchmarks, including SAMM, CASME II, SMIC, and CASME3, by significant margins. Code will be available at https://github.com/uark-cviu/Micron-BERT

SVFormer: Semi-Supervised Video Transformer for Action Recognition
Xing, ZhenandDai, QiandHu, HanandChen, JingjingandWu, ZuxuanandJiang, Yu-Gang



研究问题:半监督动作识别是一个重要但困难的任务,因为视频标注的成本高。
动机:现有的方法主要使用卷积神经网络,而目前革命性的视频变换器模型尚未得到充分探索。
方法:本文提出了一种名为SVFormer的变换器模型,该模型采用了稳定的伪标签框架(即EMA-Teacher)来处理未标记的视频样本。同时,还引入了两种新的数据增强策略:Tube TokenMix和时间扭曲增强。
效果:在Kinetics-400、UCF-101和HMDB-51三个数据集上的大量实验验证了SVFormer的优势。特别是在Kinetics-400上,SVFormer在1%的标注率下,仅用较少的训练轮次就比最先进的方法提高了31.5%的性能。

Semi-supervised action recognition is a challenging but critical task due to the high cost of video annotations. Existing approaches mainly use convolutional neural networks, yet current revolutionary vision transformer models have been less explored. In this paper, we investigate the use of transformer models under the SSL setting for action recognition. To this end, we introduce SVFormer, which adopts a steady pseudo-labeling framework (ie, EMA-Teacher) to cope with unlabeled video samples. While a wide range of data augmentations have been shown effective for semi-supervised image classification, they generally produce limited results for video recognition. We therefore introduce a novel augmentation strategy, Tube TokenMix, tailored for video data where video clips are mixed via a mask with consistent masked tokens over the temporal axis. In addition, we propose a temporal warping augmentation to cover the complex temporal variation in videos, which stretches selected frames to various temporal durations in the clip. Extensive experiments on three datasets Kinetics-400, UCF-101, and HMDB-51 verify the advantage of SVFormer. In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400. Our method can hopefully serve as a strong benchmark and encourage future search on semi-supervised action recognition with Transformer networks.

NaQ: Leveraging Narrations As Queries To Supervise Episodic Memory
Ramakrishnan, SanthoshKumarandAl-Halah, ZiadandGrauman, Kristen



研究问题:如何有效地利用自然语言查询(NLQ)在长者心视频中进行搜索,以增强现实和机器人技术。
动机:由于学习问题的结构化特性(自由形式的文本查询输入,局部化的视频时间窗口输出)及其大海捞针的特性,使得其具有挑战性和昂贵的监督成本。
方法:我们引入了叙述作为查询(NaQ)的数据增强策略,将标准的录像-文本叙述转化为视频查询定位模型的训练数据。
效果:在Ego4D基准测试中验证了我们的想法,发现在实践中它具有巨大的影响力。NaQ大大提高了多个顶级模型的性能(甚至将其准确性提高了一倍),并在Ego4D NLQ挑战中取得了迄今为止最好的结果,全面超越了CVPR和ECCV 2022竞赛的所有挑战获胜者,并登上了当前公开排行榜的首位。

Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories. Code and models: http://vision.cs.utexas.edu/projects/naq.

Unsupervised Space-Time Network for Temporally-Consistent Segmentation of Multiple Motions
Meunier, EtienneandBouthemy, Patrick



研究问题:本文旨在解决计算机视觉中的主要任务之一,即运动分割,并提出一种新颖的无监督时空框架。
动机:运动分割是计算机视觉中的重要任务,但其关键特性——时间一致性往往被忽视。
方法:本文提出了一种新颖的、完全无监督的时空框架,用于从光流进行运动分割。具体来说,我们定义了一个3D网络来进行多次运动分割,该网络以连续的光流子体积为输入,并相应地生成一个连贯的分割图子体积。
效果:在几个VOS基准测试上进行的实验表明,该方法在没有使用外观和任何真实数据训练的情况下,取得了令人信服的定量结果。通过可视化结果,我们还突出了我们的光流分割方法带来的短期和长期时间一致性的独特贡献。

Motion segmentation is one of the main tasks in computer vision and is relevant for many applications. The optical flow (OF) is the input generally used to segment every frame of a video sequence into regions of coherent motion. Temporal consistency is a key feature of motion segmentation, but it is often neglected. In this paper, we propose an original unsupervised spatio-temporal framework for motion segmentation from optical flow that fully investigates the temporal dimension of the problem. More specifically, we have defined a 3D network for multiple motion segmentation that takes as input a sub-volume of successive optical flows and delivers accordingly a sub-volume of coherent segmentation maps. Our network is trained in a fully unsupervised way, and the loss function combines a flow reconstruction term involving spatio-temporal parametric motion models, and a regularization term enforcing temporal consistency on the masks. We have specified an easy temporal linkage of the predicted segments. Besides, we have proposed a flexible and efficient way of coding U-nets. We report experiments on several VOS benchmarks with convincing quantitative results, while not using appearance and not training with any ground-truth data. We also highlight through visual results the distinctive contribution of the short- and long-term temporal consistency brought by our OF segmentation method.

iQuery: Instruments As Queries for Audio-Visual Sound Separation
Chen, JiabenandZhang, RenruiandLian, DongzeandYang, JiaqiandZeng, ZiyaoandShi, Jianbo



研究问题:目前的视听分离方法存在一个共同的架构设计问题,即音频研究问题:目前的视听分离方法存在一个共同的架构设计问题,即音频编码器-解码器网络与视觉编码特征在编码器瓶颈处融合,这混淆了多模态特征编码和稳健声音解码的学习。
动机:为了推广到新的乐器,必须对所有乐器微调整个视觉和音频网络。
方法:我们重新定义了视听分离任务,并提出了“Instruments as Queries”(iQuery)方法,该方法具有灵活的查询扩展机制。我们利用“视觉命名”的查询来启动音频查询的学习,并使用跨模态注意力来消除估计波形中的潜在声音源干扰。为了推广到新的乐器或事件类别,我们从文本提示设计中获得灵感,插入额外的查询作为音频提示,同时冻结注意力机制。
效果:实验结果表明,我们的iQuery方法提高了视听音源分离的性能。

Current audio-visual separation methods share a standard architecture design where an audio encoder-decoder network is fused with visual encoding features at the encoder bottleneck. This design confounds the learning of multi-modal feature encoding with robust sound decoding for audio separation. To generalize to a new instrument, one must fine-tune the entire visual and audio network for all musical instruments. We re-formulate the visual-sound separation task and propose Instruments as Queries (iQuery) with a flexible query expansion mechanism. Our approach ensures cross-modal consistency and cross-instrument disentanglement. We utilize "visually named" queries to initiate the learning of audio queries and use cross-modal attention to remove potential sound source interference at the estimated waveforms. To generalize to a new instrument or event class, drawing inspiration from the text-prompt design, we insert additional queries as audio prompts while freezing the attention mechanism. Experimental results on three benchmarks demonstrate that our iQuery improves audio-visual sound source separation performance. Code is available at https://github.com/JiabenChen/iQuery.

Look Around for Anomalies: Weakly-Supervised Anomaly Detection via Context-Motion Relational Learning
Cho, MyeongAhandKim, MinjungandHwang, SangwonandPark, ChaewonandLee, KyungjaeandLee, Sangyoun



研究问题:如何利用弱监督的视频标注训练数据进行视频异常检测。
动机:在现实世界中,正常和异常的界限模糊且因情况而异,难以通过单一的主干分支探索类别代表性特征。
方法:提出类激活特征学习(CLAV)和上下文-运动关联模块(CoMo),提取依赖于类别权重的特征并扩大类别特征之间的相对差距,同时模型化周围环境和运动之间的关系,而不仅仅使用时间依赖性或运动信息。
效果:该方法在四个基准测试中表现出最先进的性能,包括大规模的真实世界数据集,并通过定性结果和泛化能力分析展示了关系信息的重要性。

Weakly-supervised Video Anomaly Detection is the task of detecting frame-level anomalies using video-level labeled training data. It is difficult to explore class representative features using minimal supervision of weak labels with a single backbone branch. Furthermore, in real-world scenarios, the boundary between normal and abnormal is ambiguous and varies depending on the situation. For example, even for the same motion of running person, the abnormality varies depending on whether the surroundings are a playground or a roadway. Therefore, our aim is to extract discriminative features by widening the relative gap between classes' features from a single branch. In the proposed Class-Activate Feature Learning (CLAV), the features are extracted as per the weights that are implicitly activated depending on the class, and the gap is then enlarged through relative distance learning. Furthermore, as the relationship between context and motion is important in order to identify the anomalies in complex and diverse scenes, we propose a Context--Motion Interrelation Module (CoMo), which models the relationship between the appearance of the surroundings and motion, rather than utilizing only temporal dependencies or motion information. The proposed method shows SOTA performance on four benchmarks including large-scale real-world datasets, and we demonstrate the importance of relational information by analyzing the qualitative results and generalization ability.

Continuous Intermediate Token Learning With Implicit Motion Manifold for Keyframe Based Motion Interpolation
Mo, ClintonA.andHu, KunandLong, ChengjiangandWang, Zhiyong



研究问题:从稀疏关键帧中推导复杂的3D运动是一个特别具有挑战性的问题,因为需要保持连续性和骨骼的精确性。
动机:现有的方法通常使用基础插值方法对关键帧进行插值以生成中间帧,这在训练过程中会导致一个微不足道的局部最小值。
方法:本文提出了一种新的框架,通过关键帧约束来形成潜在的运动流形,考虑了中间表示的连续性。该框架包括两个阶段来确定潜在的运动子空间,即关键帧编码阶段和中间标记生成阶段,以及随后的运动合成阶段,从流形中推断和合成运动数据。
效果:通过在LaFAN1和CMU Mocap数据集上进行的大量实验,所提出的方法在插值精度和与真实运动的高度视觉相似性方面都表现出优越性。

Deriving sophisticated 3D motions from sparse keyframes is a particularly challenging problem, due to continuity and exceptionally skeletal precision. The action features are often derivable accurately from the full series of keyframes, and thus, leveraging the global context with transformers has been a promising data-driven embedding approach. However, existing methods are often with inputs of interpolated intermediate frame for continuity using basic interpolation methods with keyframes, which result in a trivial local minimum during training. In this paper, we propose a novel framework to formulate latent motion manifolds with keyframe-based constraints, from which the continuous nature of intermediate token representations is considered. Particularly, our proposed framework consists of two stages for identifying a latent motion subspace, i.e., a keyframe encoding stage and an intermediate token generation stage, and a subsequent motion synthesis stage to extrapolate and compose motion data from manifolds. Through our extensive experiments conducted on both the LaFAN1 and CMU Mocap datasets, our proposed method demonstrates both superior interpolation accuracy and high visual similarity to ground truth motions.

HierVL: Learning Hierarchical Video-Language Embeddings
Ashutosh, KumarandGirdhar, RohitandTorresani, LorenzoandGrauman, Kristen



研究问题:本文旨在解决现有视频-语言嵌入方法只能捕捉到几秒钟长的视频片段与其附带文本之间的短期关联性的问题。
动机:现有的视频-语言嵌入方法只能捕捉到几秒钟长的视频片段与其附带文本之间的短期关联性,而无法同时考虑到长期和短期的关联性。
方法:本文提出了一种新的分层视频-语言嵌入方法HierVL,该方法同时考虑了长期和短期的关联性。通过使用带有时间戳的动作描述文本以及整个视频的高级活动总结文本作为训练数据,引入了一个分层对比训练目标,鼓励在剪辑级别和视频级别上实现文本-视觉对齐。
效果:实验结果表明,HierVL的剪辑表示优于其单层对应物,同时其在需要长期视频建模的任务上也取得了最佳结果。HierVL成功转移到多个具有挑战性的下游任务中,并在零样本和微调设置下都表现出色。

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart, as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.

Watch or Listen: Robust Audio-Visual Speech Recognition With Visual Corruption Modeling and Reliability Scoring
Hong, JoannaandKim, MinsuandChoi, JeongsooandRo, YongMan



研究问题:本文研究了在音频和视觉输入都被破坏的情况下的视听语音识别(AVSR),这是以前的研究方向没有很好解决的问题。
动机:以往的研究主要关注如何用干净的视觉输入来补充被破坏的音频输入,但在实际生活中,干净的视觉输入并不总是可用的,甚至可能被遮挡的嘴唇区域或噪音所破坏。
方法:我们首先分析出以前的AVSR模型对多模态输入流的破坏并不像单模态模型那样鲁棒。然后,我们设计了多模态输入破坏模型来开发鲁棒的AVSR模型。最后,我们提出了一种新的AVSR框架,即音频-视觉可靠性评分模块(AV-RelScore),它可以确定哪个输入模态流是可靠的,也可以在预测中利用更可靠的流。
效果:通过在流行的基准数据库LRS2和LRS3上进行全面的实验,我们评估了所提出方法的有效性。我们还发现,AV-RelScore获得的可靠性分数很好地反映了破坏的程度,并使提出的模型专注于可靠的多模态表示。

This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input corruption situation where audio inputs and visual inputs are both corrupted, which is not well addressed in previous research directions. Previous studies have focused on how to complement the corrupted audio inputs with the clean visual inputs with the assumption of the availability of clean visual inputs. However, in real life, the clean visual inputs are not always accessible and can even be corrupted by occluded lip region or with noises. Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models. Then, we design multimodal input corruption modeling to develop robust AVSR models. Lastly, we propose a novel AVSR framework, namely Audio-Visual Reliability Scoring module (AV-RelScore), that is robust to the corrupted multimodal inputs. The AV-RelScore can determine which input modal stream is reliable or not for the prediction and also can exploit the more reliable streams in prediction. The effectiveness of the proposed method is evaluated with comprehensive experiments on popular benchmark databases, LRS2 and LRS3. We also show that the reliability scores obtained by AV-RelScore well reflect the degree of corruption and make the proposed model focus on the reliable multimodal representations.

Real-Time Multi-Person Eyeblink Detection in the Wild for Untrimmed Video
Zeng, WenzhengandXiao, YangandWei, SichengandGan, JinfangandZhang, XintaoandCao, ZhiguoandFang, ZhiwenandZhou, JoeyTianyi



研究问题:目前,实时的眼动检测主要关注于修剪视频中的单人情况,但在未修剪的视频中多人场景的眼动检测在实际应用中也非常重要,但尚未得到充分关注。
动机:为了解决这一问题,我们首次对这一研究领域进行了深入研究,并在数据集、理论和实践方面做出了重要贡献。
方法:我们提出了一种实时的多人眼动检测方法,该方法采用端到端的学习方式,一阶段同时处理面部检测、面部追踪和人体实例级的眼动检测子任务。
效果:我们在MPEblink数据集上进行的实验验证了实时多人眼动检测在未修剪视频中的主要挑战。我们的方法比现有的方法有显著的性能提升,并且具有很高的推理速度。

Real-time eyeblink detection in the wild can widely serve for fatigue detection, face anti-spoofing, emotion analysis, etc. The existing research efforts generally focus on single-person cases towards trimmed video. However, multi-person scenario within untrimmed videos is also important for practical applications, which has not been well concerned yet. To address this, we shed light on this research field for the first time with essential contributions on dataset, theory, and practices. In particular, a large-scale dataset termed MPEblink that involves 686 untrimmed videos with 8748 eyeblink events is proposed under multi-person conditions. The samples are captured from unconstrained films to reveal "in the wild" characteristics. Meanwhile, a real-time multi-person eyeblink detection method is also proposed. Being different from the existing counterparts, our proposition runs in a one-stage spatio-temporal way with an end-to-end learning capacity. Specifically, it simultaneously addresses the sub-tasks of face detection, face tracking, and human instance-level eyeblink detection. This paradigm holds 2 main advantages: (1) eyeblink features can be facilitated via the face's global context (e.g., head pose and illumination condition) with joint optimization and interaction, and (2) addressing these sub-tasks in parallel instead of sequential manner can save time remarkably to meet the real-time running requirement. Experiments on MPEblink verify the essential challenges of real-time multi-person eyeblink detection in the wild for untrimmed video. Our method also outperforms existing approaches by large margins and with a high inference speed.

PDPP:Projected Diffusion for Procedure Planning in Instructional Videos
Wang, HanlinandWu, YiluandGuo, ShengandWang, Limin



研究问题:本文研究了在无结构的真实生活视频中,根据当前的视觉观察进行目标导向的计划的问题。
动机:以往的工作将此问题视为序列规划问题,并利用繁重的中间视觉观察或自然语言指令作为监督,导致复杂的学习方案和昂贵的注释成本。
方法:我们将此问题视为分布拟合问题,用扩散模型(PDPP)来模拟整个中间动作序列分布,从而将规划问题转化为从这个分布中进行采样的过程。同时,我们移除了昂贵的中间监督,仅使用来自教学视频的任务标签作为监督。
效果:我们的模型是一个基于U-Net的扩散模型,可以直接从给定的开始和结束观察中采样动作序列。实验表明,即使在没有任务监督的情况下,我们的PDPP模型也能在多个指标上达到最先进的性能。

In this paper, we study the problem of procedure planning in instructional videos, which aims to make goal-directed plans given the current visual observations in unstructured real-life videos. Previous works cast this problem as a sequence planning problem and leverage either heavy intermediate visual observations or natural language instructions as supervision, resulting in complex learning schemes and expensive annotation costs. In contrast, we treat this problem as a distribution fitting problem. In this sense, we model the whole intermediate action sequence distribution with a diffusion model (PDPP), and thus transform the planning problem to a sampling process from this distribution. In addition, we remove the expensive intermediate supervision, and simply use task labels from instructional videos as supervision instead. Our model is a U-Net based diffusion model, which directly samples action sequences from the learned distribution with the given start and end observations. Furthermore, we apply an efficient projection method to provide accurate conditional guides for our model during the learning and sampling process. Experiments on three datasets with different scales show that our PDPP model can achieve the state-of-the-art performance on multiple metrics, even without the task supervision. Code and trained models are available at https://github.com/MCG-NJU/PDPP.

ProTeGe: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding
Wang, LanandMittal, GauravandSajeev, SandraandYu, YeandHall, MatthewandBoddeti, VishnuNareshandChen, Mei



研究问题:本文旨在解决视频时序定位(VTG)任务中,由于缺乏大规模标注的未修剪视频数据集进行预训练,导致预训练特征缺乏时间边界概念,影响视频-文本对齐的问题。
动机:目前所有的视频时序定位方法都依赖于修剪视频上预训练的视频主干特征,这主要是由于缺乏大规模的良好标注的未修剪视频数据集进行预训练。因此,预训练的特征缺乏时间边界的概念,导致视频-文本对齐在正确和错误的位置上的区别不明显。
方法:本文提出了ProTeGe,这是第一种基于未修剪预训练的视频时序定位方法,用于弥合修剪预训练主干和下游视频时序定位任务之间的差距。ProTeGe重新配置了HowTo100M数据集,将其转换为一个视频时序定位数据集,并引入了一个新颖的视频-文本相似性基础模块和一个预训练目标,使预训练能够抵抗HowTo100M中的噪声。
效果:通过在多个下游任务的各种监督变体上的大量实验,验证了ProTeGe预训练的特征可以显著优于修剪预训练主干的特征在视频时序定位上的表现。

Video temporal grounding (VTG) is the task of localizing a given natural language text query in an arbitrarily long untrimmed video. While the task involves untrimmed videos, all existing VTG methods leverage features from video backbones pretrained on trimmed videos. This is largely due to the lack of large-scale well-annotated VTG dataset to perform pretraining. As a result, the pretrained features lack a notion of temporal boundaries leading to the video-text alignment being less distinguishable between correct and incorrect locations. We present ProTeGe as the first method to perform VTG-based untrimmed pretraining to bridge the gap between trimmed pretrained backbones and downstream VTG tasks. ProTeGe reconfigures the HowTo100M dataset, with noisily correlated video-text pairs, into a VTG dataset and introduces a novel Video-Text Similarity-based Grounding Module and a pretraining objective to make pretraining robust to noise in HowTo100M. Extensive experiments on multiple datasets across downstream tasks with all variations of supervision validate that pretrained features from ProTeGe can significantly outperform features from trimmed pretrained backbones on VTG.

MotionTrack: Learning Robust Short-Term and Long-Term Motions for Multi-Object Tracking
Qin, ZhengandZhou, SanpingandWang, LeandDuan, JinghaiandHua, GangandTang, Wei



研究问题:多目标跟踪(MOT)的主要挑战在于为每个目标保持连续的轨迹。
动机:现有的方法通常学习可靠的运动模式以匹配相邻帧之间的同一目标,并利用鉴别性外观特征在长时间丢失后重新识别目标。然而,密集的人群和极端的遮挡可能会轻易影响运动预测的可靠性和外观的辨别能力。
方法:本文提出了一种简单而有效的多目标跟踪器,即MotionTrack,它在一个统一的框架中学习鲁棒的短期和长期运动,以关联从短到长范围的轨迹。对于密集的人群,设计了一个新颖的交互模块,从短期轨迹中学习交互感知的运动,可以估计每个目标的复杂移动。对于极端遮挡,构建了一个新颖的重找模块,从目标的历史轨迹中学习可靠的长期运动,可以将中断的轨迹与其相应的检测连接起来。我们的交互模块和重找模块嵌入在知名的通过检测进行跟踪的方法中,可以协同工作以保持优越的性能。
效果:在MOT17和MOT20数据集上的大量实验结果表明,我们的方法在具有挑战性的场景中具有优越性,并在各种MOT指标上实现了最先进的性能。我们将公开代码和训练好的模型。

The main challenge of Multi-Object Tracking (MOT) lies in maintaining a continuous trajectory for each target. Existing methods often learn reliable motion patterns to match the same target between adjacent frames and discriminative appearance features to re-identify the lost targets after a long period. However, the reliability of motion prediction and the discriminability of appearances can be easily hurt by dense crowds and extreme occlusions in the tracking process. In this paper, we propose a simple yet effective multi-object tracker, i.e., MotionTrack, which learns robust short-term and long-term motions in a unified framework to associate trajectories from a short to long range. For dense crowds, we design a novel Interaction Module to learn interaction-aware motions from short-term trajectories, which can estimate the complex movement of each target. For extreme occlusions, we build a novel Refind Module to learn reliable long-term motions from the target's history trajectory, which can link the interrupted trajectory with its corresponding detection. Our Interaction Module and Refind Module are embedded in the well-known tracking-by-detection paradigm, which can work in tandem to maintain superior performance. Extensive experimental results on MOT17 and MOT20 datasets demonstrate the superiority of our approach in challenging scenarios, and it achieves state-of-the-art performances at various MOT metrics. We will make the code and trained models publicly available.

TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning With Structure-Trajectory Prompted Reconstruction for Person Re-Identification
Rao, HaocongandMiao, Chunyan



研究问题:本文旨在通过3D骨架数据进行人员重识别,并提出了一种新的方法。
动机:现有的方法通常设计骨架描述符或执行骨架序列表示学习,但它们不能同时建模不同的身体组件关系,也很少从身体关节的细粒度表示中探索有用的语义。
方法:本文提出了一种基于Transformer的骨架图原型对比学习方法(TranSG),通过结构-轨迹提示重建来充分捕捉骨架图中的骨骼关系和有价值的空间-时间语义,用于人员重识别。
效果:实验结果表明,TranSG显著优于现有的最先进方法,并在不同的图形建模、RGB估计骨架和无监督场景下显示出其通用性。

Person re-identification (re-ID) via 3D skeleton data is an emerging topic with prominent advantages. Existing methods usually design skeleton descriptors with raw body joints or perform skeleton sequence representation learning. However, they typically cannot concurrently model different body-component relations, and rarely explore useful semantics from fine-grained representations of body joints. In this paper, we propose a generic Transformer-based Skeleton Graph prototype contrastive learning (TranSG) approach with structure-trajectory prompted reconstruction to fully capture skeletal relations and valuable spatial-temporal semantics from skeleton graphs for person re-ID. Specifically, we first devise the Skeleton Graph Transformer (SGT) to simultaneously learn body and motion relations within skeleton graphs, so as to aggregate key correlative node features into graph representations. Then, we propose the Graph Prototype Contrastive learning (GPC) to mine the most typical graph features (graph prototypes) of each identity, and contrast the inherent similarity between graph representations and different prototypes from both skeleton and sequence levels to learn discriminative graph representations. Last, a graph Structure-Trajectory Prompted Reconstruction (STPR) mechanism is proposed to exploit the spatial and temporal contexts of graph nodes to prompt skeleton graph reconstruction, which facilitates capturing more valuable patterns and graph semantics for person re-ID. Empirical evaluations demonstrate that TranSG significantly outperforms existing state-of-the-art methods. We further show its generality under different graph modeling, RGB-estimated skeletons, and unsupervised scenarios.

You Can Ground Earlier Than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos
Fang, XiangandLiu, DaizongandZhou, PanandNan, Guoshun



研究问题:本文旨在解决现有视频时空句子定位方法在处理压缩视频和查询模型时的问题,如表示能力不足和训练测试计算复杂度高。
动机:现有的视频时空句子定位方法仅关注从连续解码帧中提取的高级视觉特征,无法处理用于查询建模的压缩视频,导致表示能力不足和训练测试计算复杂度高。
方法:本文提出了一种新的压缩域时空句子定位(TSG)设置,直接使用压缩视频而非完全解码的帧作为视觉输入。为了处理原始视频比特流输入,提出了一种新颖的三分支压缩域时空融合(TCSF)框架,提取并聚合了三种低级别的视觉特征(I帧、运动矢量和残差特征)进行有效的时空定位。
效果:实验结果表明,TCSF在三个具有挑战性的数据集上的表现优于其他最先进的方法,同时计算复杂度更低。

Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target moment semantically according to a sentence query. Although previous respectable works have made decent success, they only focus on high-level visual features extracted from the consecutive decoded frames and fail to handle the compressed videos for query modelling, suffering from insufficient representation capability and significant computational complexity during training and testing. In this paper, we pose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input. To handle the raw video bit-stream input, we propose a novel Three-branch Compressed-domain Spatial-temporal Fusion (TCSF) framework, which extracts and aggregates three kinds of low-level visual features (I-frame, motion vector and residual features) for effective and efficient grounding. Particularly, instead of encoding the whole decoded frames like previous works, we capture the appearance representation by only learning the I-frame feature to reduce delay or latency. Besides, we explore the motion information not only by learning the motion vector feature, but also by exploring the relations of neighboring frames via the residual feature. In this way, a three-branch spatial-temporal attention layer with an adaptive motion-appearance fusion module is further designed to extract and aggregate both appearance and motion information for the final grounding. Experiments on three challenging datasets shows that our TCSF achieves better performance than other state-of-the-art methods with lower complexity.

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation
Zhu, LingtingandLiu, XianandLiu, XuanyuandQian, RuiandLiu, ZiweiandYu, Lequan



研究问题:如何有效地捕捉跨模态的音频到手势关联,并保持时间连贯性,以生成高保真度的音频驱动的共语音手势。
动机:现有的方法主要依赖于生成对抗网络(GANs),但这种方法通常存在著名的模式崩溃和训练不稳定的问题,使得学习准确的音频-手势联合分布变得困难。
方法:提出了一种新的基于扩散的框架,名为Diffusion Co-Speech Gesture (DiffGesture),通过在骨架序列片段和音频上建立扩散条件生成过程,来有效地捕捉音频到手势的关联,并保持时间连贯性。
效果:实验结果表明,DiffGesture实现了最先进的性能,能够生成具有更好模式覆盖和更强音频相关性的连贯手势。

Animating virtual avatars to make co-speech gestures facilitates various applications in human-machine interaction. The existing methods mainly rely on generative adversarial networks (GANs), which typically suffer from notorious mode collapse and unstable training, thus making it difficult to learn accurate audio-gesture joint distributions. In this work, we propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture), to effectively capture the cross-modal audio-to-gesture associations and preserve temporal coherence for high-fidelity audio-driven co-speech gesture generation. Specifically, we first establish the diffusion-conditional generation process on clips of skeleton sequences and audio to enable the whole framework. Then, a novel Diffusion Audio-Gesture Transformer is devised to better attend to the information from multiple modalities and model the long-term temporal dependency. Moreover, to eliminate temporal inconsistency, we propose an effective Diffusion Gesture Stabilizer with an annealed noise sampling strategy. Benefiting from the architectural advantages of diffusion models, we further incorporate implicit classifier-free guidance to trade off between diversity and gesture quality. Extensive experiments demonstrate that DiffGesture achieves state-of-the-art performance, which renders coherent gestures with better mode coverage and stronger audio correlations. Code is available at https://github.com/Advocate99/DiffGesture.

HaLP: Hallucinating Latent Positives for Skeleton-Based Self-Supervised Learning of Actions
Shah, AnshulandRoy, AniketandShah, KetulandMishra, ShlokandJacobs, DavidandCherian, AnoopandChellappa, Rama



研究问题:如何在无标签的情况下训练用于动作识别的骨架序列编码器。
动机:虽然对比学习在姿态序列上的应用已经取得了一些成果,但学习到的表示质量往往与用于构造正例的数据增强紧密相关,而对姿态序列进行数据增强是一项困难的任务。
方法:提出了一种新的对比学习方法来训练无标签的骨架动作识别模型。主要贡献是一个新的模块HaLP,通过探索姿态的潜在空间生成新的正例。
效果:实验表明,使用这些生成的正例在标准的对比学习框架中可以显著提高在NTU-60、NTU-120和PKU-II等基准测试中的表现。

Supervised learning of skeleton sequence encoders for action recognition has received significant attention in recent times. However, learning such encoders without labels continues to be a challenging problem. While prior works have shown promising results by applying contrastive learning to pose sequences, the quality of the learned representations is often observed to be closely tied to data augmentations that are used to craft the positives. However, augmenting pose sequences is a difficult task as the geometric constraints among the skeleton joints need to be enforced to make the augmentations realistic for that action. In this work, we propose a new contrastive learning approach to train models for skeleton-based action recognition without labels. Our key contribution is a simple module, HaLP - to Hallucinate Latent Positives for contrastive learning. Specifically, HaLP explores the latent space of poses in suitable directions to generate new positives. To this end, we present a novel optimization formulation to solve for the synthetic positives with an explicit control on their hardness. We propose approximations to the objective, making them solvable in closed form with minimal overhead. We show via experiments that using these generated positives within a standard contrastive learning framework leads to consistent improvements across benchmarks such as NTU-60, NTU-120, and PKU-II on tasks like linear evaluation, transfer learning, and kNN evaluation. Our code can be found at https://github.com/anshulbshah/HaLP.

Context-Aware Relative Object Queries To Unify Video Instance and Panoptic Segmentation
Choudhuri, AnwesaandChowdhary, GirishandSchwing, AlexanderG.



研究问题:如何有效地处理视频帧并跨帧无缝传播对象查询,以及如何产生在时间和表达上一致的对象查询。
动机:现有的对象查询方法无法很好地处理视频分割等时间任务,如跟踪、遮挡和物体再出现等问题。
方法:提出“上下文相关的相对对象查询”,这是一种连续逐帧传播的对象查询方法,可以无缝跟踪物体,处理遮挡和物体再出现的问题,无需后处理。
效果:实验结果表明,上下文相关的相对对象查询能更好地捕捉运动物体的位置变化,并在视频实例分割、多目标跟踪和分割、视频全景分割等任务上达到或超过最先进的结果。

Object queries have emerged as a powerful abstraction to generically represent object proposals. However, their use for temporal tasks like video segmentation poses two questions: 1) How to process frames sequentially and propagate object queries seamlessly across frames. Using independent object queries per frame doesn't permit tracking, and requires post-processing. 2) How to produce temporally consistent, yet expressive object queries that model both appearance and position changes. Using the entire video at once doesn't capture position changes and doesn't scale to long videos. As one answer to both questions we propose 'context-aware relative object queries', which are continuously propagated frame-by-frame. They seamlessly track objects and deal with occlusion and re-appearance of objects, without post-processing. Further, we find context-aware relative object queries better capture position changes of objects in motion. We evaluate the proposed approach across three challenging tasks: video instance segmentation, multi-object tracking and segmentation, and video panoptic segmentation. Using the same approach and architecture, we match or surpass state-of-the art results on the diverse and challenging OVIS, Youtube-VIS, Cityscapes-VPS, MOTS 2020 and KITTI-MOTS data.

Learning To Dub Movies via Hierarchical Prosody Models
Cong, GaoxiangandLi, LiangandQi, YuankaiandZha, Zheng-JunandWu, QiandWang, WenyuandJiang, BinandYang, Ming-HsuanandHuang, Qingming



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone, V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference. V2C is more challenging than conventional text-to-speech tasks as it additionally requires the generated speech to exactly match the varying emotions and speaking speed presented in the video. Unlike previous works, we propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modeling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene. Specifically, we align lip movement to the speech duration, and convey facial expression to speech energy and pitch via attention mechanism based on valence and arousal representations inspired by the psychology findings. Moreover, we design an emotion booster to capture the atmosphere from global video scenes. All these embeddings are used together to generate mel-spectrogram, which is then converted into speech waves by an existing vocoder. Extensive experimental results on the V2C and Chem benchmark datasets demonstrate the favourable performance of the proposed method. The code and trained models will be made available at https://github.com/GalaxyCong/HPMDubbing.

Progressive Spatio-Temporal Alignment for Efficient Event-Based Motion Estimation
Huang, XueyanandZhang, YueyiandXiong, Zhiwei



研究问题:本文旨在提出一种高效的基于事件的动作估计框架,用于各种动作模型。
动机:与以往工作不同,我们设计了一种渐进的事件到映射对齐方案,并利用时空相关性进行对齐。
方法:我们逐步将一个事件批次中采样的事件对齐到时间-表面图,并通过最小化一种新的时间-表面损失来获取更新的动作模型。此外,我们还应用了动态的批量大小策略,以自适应地调整批量大小,使批次中的所有事件都与当前的动作模型保持一致。
效果:我们的框架有三个优点:a) 渐进的方案迭代地细化动作参数,实现了精确的动作估计;b) 在一个迭代中,只有一小部分事件参与优化,大大减少了总运行时间;c) 动态的批量大小策略确保恒定速度假设始终成立。我们在具有三个动作模型(旋转、单应和6自由度)的挑战性高速场景上进行了全面实验。实验结果表明,我们的框架在估计精度和效率上都达到了最先进的水平。

In this paper, we propose an efficient event-based motion estimation framework for various motion models. Different from previous works, we design a progressive event-to-map alignment scheme and utilize the spatio-temporal correlations to align events. In detail, we progressively align sampled events in an event batch to the time-surface map and obtain the updated motion model by minimizing a novel time-surface loss. In addition, a dynamic batch size strategy is applied to adaptively adjust the batch size so that all events in the batch are consistent with the current motion model. Our framework has three advantages: a) the progressive scheme refines motion parameters iteratively, achieving accurate motion estimation; b) within one iteration, only a small portion of events are involved in optimization, which greatly reduces the total runtime; c) the dynamic batch size strategy ensures that the constant velocity assumption always holds. We conduct comprehensive experiments to evaluate our framework on challenging high-speed scenes with three motion models: rotational, homography, and 6-DOF models. Experimental results demonstrate that our framework achieves state-of-the-art estimation accuracy and efficiency.

VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud
Wang, ZiqinandCheng, BowenandZhao, LichenandXu, DongandTang, YangandSheng, Lu



研究问题:在点云中进行3D语义场景图预测的任务具有挑战性,因为与2D图像相比,3D点云只捕捉到有限的语义几何结构,且长尾关系分布阻碍了无偏预测的学习。
动机:由于2D图像提供丰富的语义信息,而场景图本质上是使用语言处理的,因此本研究提出视觉-语言语义辅助训练(VL-SAT)方案,以显著增强3DSSG预测模型对长尾和模糊语义关系的辨别能力。
方法:我们训练了一个强大的多模态神谕模型来协助3D模型。这个神谕模型基于视觉、语言和3D几何的语义学习可靠的结构表示,其优势可以在训练阶段异构地传递给3D模型。通过有效地利用视觉-语言语义进行训练,我们的VL-SAT可以显著提升仅使用3D输入进行推理的常见3DSSG预测模型,特别是在处理长尾关系三元组时。
效果:我们在3DSSG数据集上进行了全面评估和消融研究,验证了所提出方案的有效性。代码可在https://github.com/wz7in/CVPR2023-VLSAT获取。

The task of 3D semantic scene graph (3DSSG) prediction in the point cloud is challenging since (1) the 3D point cloud only captures geometric structures with limited semantics compared to 2D images, and (2) long-tailed relation distribution inherently hinders the learning of unbiased prediction. Since 2D images provide rich semantics and scene graphs are in nature coped with languages, in this study, we propose Visual-Linguistic Semantics Assisted Training (VL-SAT) scheme that can significantly empower 3DSSG prediction models with discrimination about long-tailed and ambiguous semantic relations. The key idea is to train a powerful multi-modal oracle model to assist the 3D model. This oracle learns reliable structural representations based on semantics from vision, language, and 3D geometry, and its benefits can be heterogeneously passed to the 3D model during the training stage. By effectively utilizing visual-linguistic semantics in training, our VL-SAT can significantly boost common 3DSSG prediction models, such as SGFN and SGGpoint, only with 3D inputs in the inference stage, especially when dealing with tail relation triplets. Comprehensive evaluations and ablation studies on the 3DSSG dataset have validated the effectiveness of the proposed scheme. Code is available at https://github.com/wz7in/CVPR2023-VLSAT.

Learning Emotion Representations From Verbal and Nonverbal Communication
Zhang, SitaoandPan, YimuandWang, JamesZ.



研究问题:本文旨在解决人工智能中情感理解这一重要但极具挑战性的问题,特别是缺乏大量标注数据集的问题。
动机:现有的情感理解方法主要依赖于数值标签或描述,而人类的情感信息是通过语言和非语言交流自然包含的。因此,从交流中提取情感表示更符合人类的学习过程。
方法:本文提出了EmotionCLIP,这是第一种仅使用未分类数据从言语和非言语交流中提取视觉情感表示的预训练范式。通过主题感知的上下文编码引导EmotionCLIP关注非语言情感线索,使用情感引导的对比学习方法获取语言情感线索。
效果:大量的实验验证了EmotionCLIP的有效性和可转移性。在仅仅使用线性探测评估协议的情况下,EmotionCLIP在各种基准测试中优于最先进的有监督视觉情感识别方法,并与许多多模态方法相媲美。

Emotion understanding is an essential but highly challenging component of artificial general intelligence. The absence of extensive annotated datasets has significantly impeded advancements in this field. We present EmotionCLIP, the first pre-training paradigm to extract visual emotion representations from verbal and nonverbal communication using only uncurated data. Compared to numerical labels or descriptions used in previous methods, communication naturally contains emotion information. Furthermore, acquiring emotion representations from communication is more congruent with the human learning process. We guide EmotionCLIP to attend to nonverbal emotion cues through subject-aware context encoding and verbal emotion cues using sentiment-guided contrastive learning. Extensive experiments validate the effectiveness and transferability of EmotionCLIP. Using merely linear-probe evaluation protocol, EmotionCLIP outperforms the state-of-the-art supervised visual emotion recognition methods and rivals many multimodal approaches across various benchmarks. We anticipate that the advent of EmotionCLIP will address the prevailing issue of data scarcity in emotion understanding, thereby fostering progress in related domains. The code and pre-trained models are available at https://github.com/Xeaver/EmotionCLIP.

Blur Interpolation Transformer for Real-World Motion From Blur
Zhong, ZhihangandCao, MingdengandJi, XiangandZheng, YinqiangandSato, Imari



研究问题:本文旨在解决从模糊中恢复运动这一挑战性问题,也被称为联合去模糊和插值或模糊时间超分辨率。
动机:当前的方法在视觉质量上仍有改进空间,尤其是在合成数据集上,并且对真实世界数据的泛化能力较差。
方法:我们提出了一种模糊插值变换器(BiT),以有效揭示编码在模糊中的潜在时间相关性。基于多尺度残差Swin变换器块,我们引入了双端时间监督和时间对称集成策略,以生成用于时变运动渲染的有效特征。此外,我们还设计了一个混合相机系统,收集了第一个一对多的模糊-清晰视频对的真实世界数据集。
效果:实验结果表明,BiT在公开的Adobe240数据集上比最先进的方法有显著的改进。此外,提出的真实世界数据集有效地帮助模型泛化到真实的模糊场景。

This paper studies the challenging problem of recovering motion from blur, also known as joint deblurring and interpolation or blur temporal super-resolution. The challenges are twofold: 1) the current methods still leave considerable room for improvement in terms of visual quality even on the synthetic dataset, and 2) poor generalization to real-world data. To this end, we propose a blur interpolation transformer (BiT) to effectively unravel the underlying temporal correlation encoded in blur. Based on multi-scale residual Swin transformer blocks, we introduce dual-end temporal supervision and temporally symmetric ensembling strategies to generate effective features for time-varying motion rendering. In addition, we design a hybrid camera system to collect the first real-world dataset of one-to-many blur-sharp video pairs. Experimental results show that BiT has a significant gain over the state-of-the-art methods on the public dataset Adobe240. Besides, the proposed real-world dataset effectively helps the model generalize well to real blurry scenarios. Code and data are available at https://github.com/zzh-tech/BiT.

Procedure-Aware Pretraining for Instructional Video Understanding
Zhou, HongluandMart{\'\i



研究问题:如何从无标签的视频中提取出程序性知识,如任务的身份(例如,“制作拿铁”),步骤(例如,“倒牛奶”)或在执行过程中可能的下一步。
动机:由于可用注释数量有限,因此从无标签视频中提取程序性知识是理解教学视频中过程的一个关键挑战。
方法:通过将来自基于文本的程序性知识数据库和未标记的教学视频语料库的信息相结合来构建一个过程知识图(PKG),然后使用它生成伪标签来训练一种编码过程知识的易于理解的视频表示形式,以便推广到多个过程理解任务。
效果:Paprika模型在COIN和CrossTask上的任务识别、步骤识别和步骤预测等过程理解任务上的表现优于现有技术,准确率提高了11.23%。

Our goal is to learn a video representation that is useful for downstream procedure understanding tasks in instructional videos. Due to the small amount of available annotations, a key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge such as the identity of the task (e.g., 'make latte'), its steps (e.g., 'pour milk'), or the potential next steps given partial progress in its execution. Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks, and that this structure can be well represented by a Procedural Knowledge Graph (PKG), where nodes are discrete steps and edges connect steps that occur sequentially in the instructional activities. This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form to generalize to multiple procedure understanding tasks. We build a PKG by combining information from a text-based procedural knowledge database and an unlabeled instructional video corpus and then use it to generate training pseudo labels with four novel pre-training objectives. We call this PKG-based pre-training procedure and the resulting model Paprika, Procedure-Aware PRe-training for Instructional Knowledge Acquisition. We evaluate Paprika on COIN and CrossTask for procedure understanding tasks such as task recognition, step recognition, and step forecasting. Paprika yields a video representation that improves over the state of the art: up to 11.23% gains in accuracy in 12 evaluation settings. Implementation is available at https://github.com/salesforce/paprika.

Therbligs in Action: Video Understanding Through Motion Primitives
Dessalene, EadomandMaynord, MichaelandFerm\"uller, CorneliaandAloimonos, Yiannis



研究问题:本文介绍了一种基于规则、组合和分层的动作建模方法,使用Therbligs作为基本单位。
动机:现有的动作表示方法在表达一致性和联系中心性方面存在不足,因此提出了使用Therbligs进行动作表示的方法。
方法:通过引入可微分的规则推理方法对逻辑一致性进行正则化,构建了一个以Therbligs为中心的模型。
效果:在两个流行的视频数据集EPIC Kitchens 100和50-Salads上进行了实验,观察到平均相对改进分别为10.5%/7.53%/6.5%和8.9%/6.63%/4.8%。同时,该方法可以与其他现有架构的表示互补,不会取代它们。

In this paper we introduce a rule-based, compositional, and hierarchical modeling of action using Therbligs as our atoms. Introducing these atoms provides us with a consistent, expressive, contact-centered representation of action. Over the atoms we introduce a differentiable method of rule-based reasoning to regularize for logical consistency. Our approach is complementary to other approaches in that the Therblig-based representations produced by our architecture augment rather than replace existing architectures' representations. We release the first Therblig-centered annotations over two popular video datasets - EPIC Kitchens 100 and 50-Salads. We also broadly demonstrate benefits to adopting Therblig representations through evaluation on the following tasks: action segmentation, action anticipation, and action recognition - observing an average 10.5%/7.53%/6.5% relative improvement, respectively, over EPIC Kitchens and an average 8.9%/6.63%/4.8% relative improvement, respectively, over 50 Salads. Code and data will be made publicly available.

Tell Me What Happened: Unifying Text-Guided Video Completion via Multimodal Masked Video Generation
Fu, Tsu-JuiandYu, LichengandZhang, NingandFu, Cheng-YangandSu, Jong-ChyiandWang, WilliamYangandBell, Sean



研究问题:如何利用自然语言指导完成视频生成任务,包括视频预测、倒放和填充。
动机:目前的模型在处理视频生成任务时,往往忽视了时间连贯性的问题,而基于少量帧的提示可能有多种结果,因此需要一种能遵循自然语言进行视频完成的系统来提高可控性。
方法:提出一种新的任务——文本引导的视频完成(TVC),并设计了多模态掩蔽视频生成(MMVG)模型来解决这一问题。在训练过程中,MMVG将视频帧离散化为视觉标记并进行大部分掩蔽以完成任意时间点的视频生成;在推理阶段,通过应用相应的掩蔽条件,一个MMVG模型就可以解决TVC的所有三种情况。
效果:在多种视频场景下进行评估,包括自我中心、动画和游戏等,实验结果表明MMVG能有效利用文本指导完成高质量的视频生成。

Generating a video given the first several static frames is challenging as it anticipates reasonable future frames with temporal coherence. Besides video prediction, the ability to rewind from the last frame or infilling between the head and tail is also crucial, but they have rarely been explored for video completion. Since there could be different outcomes from the hints of just a few frames, a system that can follow natural language to perform video completion may significantly improve controllability. Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction. We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task. During training, MMVG discretizes the video frames into visual tokens and masks most of them to perform video completion from any time point. At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions. We evaluate MMVG in various video scenarios, including egocentric, animation, and gaming. Extensive experimental results indicate that MMVG is effective in generating high-quality visual appearances with text guidance for TVC.

Graph Representation for Order-Aware Visual Transformation
Qiu, YueandSun, YanjunandMatsuzawa, FumiyaandIwata, KenjiandKataoka, Hirokatsu



研究问题:如何发现图像对之间的变化及其时间顺序,这是人类认知的基本方面。
动机:尽管现有的AI系统已经能够识别和描述图像对之间的变化,但它们主要考虑的是同步发生的变化,忽视了这些变化中可能存在的顺序。
方法:我们首先提出了一个视觉变换图结构来传达有顺序的变化,然后在我们的新生成的数据集上对以前的方法进行基准测试,并找出了现有方法在识别变化顺序方面的问题。最后,我们通过引入一个新的模型,该模型明确地关联不同的变化,然后在图形表示中识别变化及其顺序,从而显著提高了有顺序的变化识别能力。
效果:实验结果表明,我们的新模型在识别图像对之间有顺序的变化方面取得了显著的改进。

This paper proposes a new visual reasoning formulation that aims at discovering changes between image pairs and their temporal orders. Recognizing scene dynamics and their chronological orders is a fundamental aspect of human cognition. The aforementioned abilities make it possible to follow step-by-step instructions, reason about and analyze events, recognize abnormal dynamics, and restore scenes to their previous states. However, it remains unclear how well current AI systems perform in these capabilities. Although a series of studies have focused on identifying and describing changes from image pairs, they mainly consider those changes that occur synchronously, thus neglecting potential orders within those changes. To address the above issue, we first propose a visual transformation graph structure for conveying order-aware changes. Then, we benchmarked previous methods on our newly generated dataset and identified the issues of existing methods for change order recognition. Finally, we show a significant improvement in order-aware change recognition by introducing a new model that explicitly associates different changes and then identifies changes and their orders in a graph representation.

Exploring Discontinuity for Video Frame Interpolation
Lee, SangjinandLee, HyeongminandShin, ChajinandSon, HanbinandLee, Sangyoun



研究问题:本文旨在解决视频帧插值(VFI)任务中,现有深度学习模型对非连续运动物体(如标志、用户界面和字幕)的处理能力不足的问题。
动机:大部分现有的VFI研究主要关注于如何进行适当的帧扭曲操作和优化扭曲后的帧,但这些方法在处理包含非连续运动物体的实际视频时效果不佳。
方法:本文提出了三种技术来增强现有的深度学习基础的VFI架构对这些元素的鲁棒性。首先,提出了一种新的数据增强策略——图形文本混合(FTM),使模型在训练阶段学习非连续运动,而无需额外的数据集。其次,提出了一种简单但有效的模块,该模块预测一个名为“不连续性图”(D-map)的映射,该映射可以密集地区分连续和非连续运动区域。最后,提出了针对非连续运动区域的监督损失函数,这些损失函数可以与FTM和D-map一起使用。
效果:通过应用到各种最新的VFI网络,该方法不仅在GDM数据集上显著提高了插值质量,而且在只包含连续运动的现有基准测试集(如Vimeo90K、UCF101和DAVIS)上也取得了显著的改善。

Video frame interpolation (VFI) is the task that synthesizes the intermediate frame given two consecutive frames. Most of the previous studies have focused on appropriate frame warping operations and refinement modules for the warped frames. These studies have been conducted on natural videos containing only continuous motions. However, many practical videos contain various unnatural objects with discontinuous motions such as logos, user interfaces and subtitles. We propose three techniques that can make the existing deep learning-based VFI architectures robust to these elements. First is a novel data augmentation strategy called figure-text mixing (FTM) which can make the models learn discontinuous motions during training stage without any extra dataset. Second, we propose a simple but effective module that predicts a map called discontinuity map (D-map), which densely distinguishes between areas of continuous and discontinuous motions. Lastly, we propose loss functions to give supervisions of the discontinuous motion areas which can be applied along with FTM and D-map. We additionally collect a special test benchmark called Graphical Discontinuous Motion (GDM) dataset consisting of some mobile games and chatting videos. Applied to the various state-of-the-art VFI networks, our method significantly improves the interpolation qualities on the videos from not only GDM dataset, but also the existing benchmarks containing only continuous motions such as Vimeo90K, UCF101, and DAVIS.

DynamicStereo: Consistent Dynamic Depth From Stereo Videos
Karaev, NikitaandRocco, IgnacioandGraham, BenjaminandNeverova, NataliaandVedaldi, AndreaandRupprecht, Christian



研究问题:如何从立体相机观察到的动态场景中重建深度信息,并解决现有方法在时间一致性上的不足。
动机:为了提高沉浸式AR或VR场景中的用户体验,需要预测出具有时间一致性的深度信息。
方法:提出DynamicStereo,一种基于变压器的新型架构,通过学习邻近帧的信息来提高预测的时间一致性,并通过分割注意力层有效处理立体视频。
效果:创建了Dynamic Replica数据集,用于训练和评估更接近真实应用的动态立体方法。使用此数据集进行训练后,所提出的DynamicStereo以及先前的方法的预测质量都有所提高。

We consider the problem of reconstructing a dynamic scene observed from a stereo camera. Most existing methods for depth from stereo treat different stereo frames independently, leading to temporally inconsistent depth predictions. Temporal consistency is especially important for immersive AR or VR scenarios, where flickering greatly diminishes the user experience. We propose DynamicStereo, a novel transformer-based architecture to estimate disparity for stereo videos. The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions. Our architecture is designed to process stereo videos efficiently through divided attention layers. We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments, which provides complementary training and evaluation data for dynamic stereo closer to real applications than existing datasets. Training with this dataset further improves the quality of predictions of our proposed DynamicStereo as well as prior methods. Finally, it acts as a benchmark for consistent stereo methods.

Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition
Wang, HanyangandLi, BoandWu, ShuangandShen, SiyuanandLiu, FengandDing, ShouhongandZhou, Aimin



研究问题:如何更准确地识别视频中的动态面部表情。
动机:以前的研究将非目标帧视为噪声,但作者认为应将其视为弱监督问题。同时,作者发现在动态面部表情识别中存在短期和长期时间关系的不平衡。
方法:提出了多实例学习(MIL)的多3D动态面部表情学习(M3DFEL)框架,用于处理不精确的标签。该框架生成3D实例来模拟强烈的短期时间关系,并使用3DCNN进行特征提取。然后利用动态长期实例聚合模块(DLIAM)学习长期时间关系并动态聚合实例。
效果:在DFER和FERV39K数据集上的实验表明,M3DFEL优于现有的最先进的方法,具有基本的R3D18主干网络。

Dynamic Facial Expression Recognition (DFER) is a rapidly developing field that focuses on recognizing facial expressions in video format. Previous research has considered non-target frames as noisy frames, but we propose that it should be treated as a weakly supervised problem. We also identify the imbalance of short- and long-term temporal relationships in DFER. Therefore, we introduce the Multi-3D Dynamic Facial Expression Learning (M3DFEL) framework, which utilizes Multi-Instance Learning (MIL) to handle inexact labels. M3DFEL generates 3D-instances to model the strong short-term temporal relationship and utilizes 3DCNNs for feature extraction. The Dynamic Long-term Instance Aggregation Module (DLIAM) is then utilized to learn the long-term temporal relationships and dynamically aggregate the instances. Our experiments on DFEW and FERV39K datasets show that M3DFEL outperforms existing state-of-the-art approaches with a vanilla R3D18 backbone. The source code is available at https://github.com/faceeyes/M3DFEL.

Ham2Pose: Animating Sign Language Notation Into Pose Sequences
Arkushin, RotemShalevandMoryossef, AmitandFried, Ohad



研究问题:如何将口语转化为手语,以实现听力正常和听力受损群体之间的开放交流。
动机:为了解决这一问题,我们提出了一种将HamNoSys文本动画化的方法,这是一种通用的手势语言符号表示法。
方法:我们的方法使用transformer编码器逐步生成姿势预测,同时考虑文本和姿势的空间和时间信息。我们使用弱监督进行训练,并展示了该方法可以从部分和不准确的数据中学习。
效果:我们提供了一种新的距离测量方法,用于衡量缺失关键点的姿势序列之间的距离。我们在大型手语数据集AUTSL上验证了其正确性,并证明它比现有测量方法更准确地衡量姿势序列之间的距离。我们的代码已公开发布,供未来研究使用。

Translating spoken languages into Sign languages is necessary for open communication between the hearing and hearing-impaired communities. To achieve this goal, we propose the first method for animating a text written in HamNoSys, a lexical Sign language notation, into signed pose sequences. As HamNoSys is universal by design, our proposed method offers a generic solution invariant to the target Sign language. Our method gradually generates pose predictions using transformer encoders that create meaningful representations of the text and poses while considering their spatial and temporal information. We use weak supervision for the training process and show that our method succeeds in learning from partial and inaccurate data. Additionally, we offer a new distance measurement that considers missing keypoints, to measure the distance between pose sequences using DTW-MJE. We validate its correctness using AUTSL, a large-scale Sign language dataset, show that it measures the distance between pose sequences more accurately than existing measurements, and use it to assess the quality of our generated pose sequences. Code for the data pre-processing, the model, and the distance measurement is publicly released for future research.

''Seeing'' Electric Network Frequency From Events
Xu, LexuanandHua, GuangandZhang, HaijianandYu, LeiandQiao, Ning



研究问题:如何从常规视频中估计电网频率(ENF)?
动机:现有的基于视频的ENF(V-ENF)估计方法受到摄像质量、非理想采样、运动和极端照明条件的影响。
方法:本文提出了一种无需上述限制的新方法,即事件相机,这是一种神经形态传感器,可以编码光强变化并以极高的时间分辨率和动态范围异步发出事件。我们首先验证了事件中捕获的ENF的物理机制,然后通过模式过滤和谐波增强提出了一种简单而稳健的事件基础ENF(E-ENF)估计方法。
效果:我们在各种场景中创建了一个事件视频ENF数据集(EV-ENFD),并在其上进行了大量实验。实验结果表明,我们提出的方法能够提取更准确的ENF轨迹,大大超过传统的V-ENF,特别是在有物体运动和极端照明条件的挑战性环境中。

Most of the artificial lights fluctuate in response to the grid's alternating current and exhibit subtle variations in terms of both intensity and spectrum, providing the potential to estimate the Electric Network Frequency (ENF) from conventional frame-based videos. Nevertheless, the performance of Video-based ENF (V-ENF) estimation largely relies on the imaging quality and thus may suffer from significant interference caused by non-ideal sampling, motion, and extreme lighting conditions. In this paper, we show that the ENF can be extracted without the above limitations from a new modality provided by the so-called event camera, a neuromorphic sensor that encodes the light intensity variations and asynchronously emits events with extremely high temporal resolution and high dynamic range. Specifically, we first formulate and validate the physical mechanism for the ENF captured in events, and then propose a simple yet robust Event-based ENF (E-ENF) estimation method through mode filtering and harmonic enhancement. Furthermore, we build an Event-Video ENF Dataset (EV-ENFD) that records both events and videos in diverse scenes. Extensive experiments on EV-ENFD demonstrate that our proposed E-ENF method can extract more accurate ENF traces, outperforming the conventional V-ENF by a large margin, especially in challenging environments with object motions and extreme lighting conditions. The code and dataset are available at https://github.com/xlx-creater/E-ENF.

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
Sung-Bin, KimandSenocak, ArdaandHa, HyunwooandOwens, AndrewandOh, Tae-Hyun



研究问题:如何通过声音生成场景图像?
动机:解决视觉和听觉之间存在的巨大信息鸿沟。
方法:设计一种模型,通过学习将音频与视觉模式关联起来,以弥补信息差距。主要思想是通过学习将音频对齐到视觉潜在空间来丰富音频特征。
效果:在VEGAS和VGGSound数据集上取得了比先前方法更好的结果,并且可以通过对输入波形或潜在空间进行简单操作来控制模型的预测。

How does audio describe the world around us? In this paper, we propose a method for generating an image of a scene from sound. Our method addresses the challenges of dealing with the large gaps that often exist between sight and sound. We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities despite their information gaps. The key idea is to enrich the audio features with visual information by learning to align audio to visual latent space. We translate the input audio to visual features, then use a pre-trained generator to produce an image. To further improve the quality of our generated images, we use sound source localization to select the audio-visual pairs that have strong cross-modal correlations. We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches. We also show that we can control our model's predictions by applying simple manipulations to the input waveform, or to the latent space.

Text With Knowledge Graph Augmented Transformer for Video Captioning
Gu, XinandChen, GuangandWang, YufeiandZhang, LiboandLuo, TiejianandWen, Longyin



研究问题:视频字幕生成旨在使用自然语言描述视频内容,但面临长尾和开放词集问题。
动机:为了解决这些问题,本文提出了一种结合知识图谱的文本与视频字幕生成模型(TextKG)。
方法:TextKG是一个双流变压器模型,由外部流和内部流组成。外部流吸收外部知识,如预构建的知识图谱,以缓解开放词集的挑战;内部流则利用原始视频的多模态信息,如视频帧的外观、语音转录和视频字幕,以处理长尾问题。两个流之间还使用了交叉注意力机制来共享信息。
效果:在四个具有挑战性的视频字幕生成数据集上进行的大量实验表明,该方法优于最先进的方法。具体来说,TextKG方法在YouCookII数据集上的表现比最佳已发布结果提高了18.7%的绝对CIDEr分数。

Video captioning aims to describe the content of videos using natural language. Although significant progress has been made, there is still much room to improve the performance for real-world applications, mainly due to the long-tail and open set issues of words. In this paper, we propose a text with knowledge graph augmented transformer (TextKG) for video captioning. Notably, TextKG is a two-stream transformer, formed by the external stream and internal stream. The external stream is designed to absorb external knowledge, which models the interactions between the external knowledge, e.g., pre-built knowledge graph, and the built-in information of videos, e.g., the salient object regions, speech transcripts, and video captions, to mitigate the open set of words challenge. Meanwhile, the internal stream is designed to exploit the multi-modality information in original videos (e.g., the appearance of video frames, speech transcripts, and video captions) to deal with the long-tail issue. In addition, the cross attention mechanism is also used in both streams to share information. In this way, the two streams can help each other for more accurate results. Extensive experiments conducted on four challenging video captioning datasets, i.e., YouCookII, ActivityNet Captions, MSR-VTT, and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods. Specifically, the proposed TextKG method outperforms the best published results by improving 18.7% absolute CIDEr scores on the YouCookII dataset.

IS-GGT: Iterative Scene Graph Generation With Generative Transformers
Kundu, SanjoyandAakur, SathyanarayananN.



研究问题:如何有效地从图像中生成场景图,同时减少计算开销?
动机:现有的场景图生成方法需要对所有可能的对象间边进行标注,这增加了计算负担。
方法:提出一种基于生成性变压器的场景图生成方法,首先从检测到的对象及其视觉特征中采样可能的场景图结构,然后对采样的边进行谓词分类以生成最终的场景图。
效果:在Visual Genome数据集上的大量实验表明,该方法在场景图生成(SGG)方面效率高,平均mR@100达到20.7%,优于最先进的SGG方法,与无偏SGG方法具有竞争力。

Scene graphs provide a rich, structured representation of a scene by encoding the entities (objects) and their spatial relationships in a graphical format. This representation has proven useful in several tasks, such as question answering, captioning, and even object detection, to name a few. Current approaches take a generation-by-classification approach where the scene graph is generated through labeling of all possible edges between objects in a scene, which adds computational overhead to the approach. This work introduces a generative transformer-based approach to generating scene graphs beyond link prediction. Using two transformer-based components, we first sample a possible scene graph structure from detected objects and their visual features. We then perform predicate classification on the sampled edges to generate the final scene graph. This approach allows us to efficiently generate scene graphs from images with minimal inference overhead. Extensive experiments on the Visual Genome dataset demonstrate the efficiency of the proposed approach. Without bells and whistles, we obtain, on average, 20.7% mean recall (mR@100) across different settings for scene graph generation (SGG), outperforming state-of-the-art SGG approaches while offering competitive performance to unbiased SGG approaches.

SelfME: Self-Supervised Motion Learning for Micro-Expression Recognition
Fan, XinqiandChen, XueliandJiang, MingjieandShahid, AliRazaandYan, Hong



研究问题:如何利用深度学习进行面部微表情识别,并克服传统方法需要预处理的问题。
动机:面部微表情能反映人的真实情绪,对于测谎、犯罪分析等领域有重要价值。现有的深度学习方法需要借助传统的光流法提取面部运动作为输入,存在局限性。
方法:提出一种基于自我监督学习的面部微表情识别框架(SelfME),首次使用自动自学的运动技术进行面部微表情识别。同时,为解决学习过程中可能忽视面部左右对称动作的问题,开发了一种对称对比视觉变换器(SCViT)来约束面部左右部分相似动作特征的学习。
效果:在两个基准数据集上进行的实验表明,该方法取得了最先进的性能,消融研究证明了该方法的有效性。

Facial micro-expressions (MEs) refer to brief spontaneous facial movements that can reveal a person's genuine emotion. They are valuable in lie detection, criminal analysis, and other areas. While deep learning-based ME recognition (MER) methods achieved impressive success, these methods typically require pre-processing using conventional optical flow-based methods to extract facial motions as inputs. To overcome this limitation, we proposed a novel MER framework using self-supervised learning to extract facial motion for ME (SelfME). To the best of our knowledge, this is the first work using an automatically self-learned motion technique for MER. However, the self-supervised motion learning method might suffer from ignoring symmetrical facial actions on the left and right sides of faces when extracting fine features. To address this issue, we developed a symmetric contrastive vision transformer (SCViT) to constrain the learning of similar facial action features for the left and right parts of faces. Experiments were conducted on two benchmark datasets showing that our method achieved state-of-the-art performance, and ablation studies demonstrated the effectiveness of our method.

NewsNet: A Novel Dataset for Hierarchical Temporal Segmentation
Wu, HaoqianandChen, KeyuandLiu, HaozheandZhuge, MingchenandLi, BingandQiao, RuizhiandShu, XiujunandGan, BeiandXu, LiangshengandRen, BoandXu, MengmengandZhang, WentianandRamachandra, RaghavendraandLin, Chia-WenandGhanem, Bernard



研究问题:本文旨在解决现有视频分析方法在处理复杂结构视频时,无法全面理解较大时间跨度的问题。
动机:现有的视频分析方法虽然可以对视频进行细致的切割,但在处理复杂和结构化的视频时,缺乏对更大时间跨度的全面理解。
方法:本文提出了两种抽象级别的时间视频分割,并研究了它们与现有细粒度级别的层次关系。同时收集了NewsNet,这是一个包含1000个视频、900多小时的新闻视频数据集,用于进行分层的时间视频分割。
效果:通过在NewsNet上的研究,可以增进对复杂结构化视频的理解,同时对短视频创作、个性化广告、数字教学等领域有所裨益。

Temporal video segmentation is the get-to-go automatic video analysis, which decomposes a long-form video into smaller components for the following-up understanding tasks. Recent works have studied several levels of granularity to segment a video, such as shot, event, and scene. Those segmentations can help compare the semantics in the corresponding scales, but lack a wider view of larger temporal spans, especially when the video is complex and structured. Therefore, we present two abstractive levels of temporal segmentations and study their hierarchy to the existing fine-grained levels. Accordingly, we collect NewsNet, the largest news video dataset consisting of 1,000 videos in over 900 hours, associated with several tasks for hierarchical temporal video segmentation. Each news video is a collection of stories on different topics, represented as aligned audio, visual, and textual data, along with extensive frame-wise annotations in four granularities. We assert that the study on NewsNet can advance the understanding of complex structured video and benefit more areas such as short-video creation, personalized advertisement, digital instruction, and education. Our dataset and code is publicly available at: https://github.com/NewsNet-Benchmark/NewsNet.

LSTFE-Net:Long Short-Term Feature Enhancement Network for Video Small Object Detection
Xiao, JinshengandWu, YuanxuandChen, YunhuaandWang, ShuruiandWang, ZhongyuanandMa, Jiayi



研究问题:视频小物体检测由于缺乏对象信息而具有挑战性。
动机:现有的方法主要通过增加更多的时间信息来获取更强大的高级特征,但往往无法明确指出小物体的最关键信息,导致特征不足或不适当。
方法:提出了一种长短期特征增强网络(LSTFE-Net)用于视频小物体检测。首先开发了一个即插即用的时空特征对齐模块,以在短期帧和当前帧之间创建时间对应关系。然后,提出了一个帧选择模块,以选择能够提供最多额外上下文信息的长期帧。最后,提出了一个长短期特征聚合模块,以融合长短期特征。
效果:与其他最先进的方法相比,我们的LSTFE-Net在FL-Drones数据集上实现了4.4%的AP绝对提升。

Video small object detection is a difficult task due to the lack of object information. Recent methods focus on adding more temporal information to obtain more potent high-level features, which often fail to specify the most vital information for small objects, resulting in insufficient or inappropriate features. Since information from frames at different positions contributes differently to small objects, it is not ideal to assume that using one universal method will extract proper features. We find that context information from the long-term frame and temporal information from the short-term frame are two useful cues for video small object detection. To fully utilize these two cues, we propose a long short-term feature enhancement network (LSTFE-Net) for video small object detection. First, we develop a plug-and-play spatio-temporal feature alignment module to create temporal correspondences between the short-term and current frames. Then, we propose a frame selection module to select the long-term frame that can provide the most additional context information. Finally, we propose a long short-term feature aggregation module to fuse long short-term features. Compared to other state-of-the-art methods, our LSTFE-Net achieves 4.4% absolute boosts in AP on the FL-Drones dataset. More details can be found at https://github.com/xiaojs18/LSTFE-Net.

Joint Video Multi-Frame Interpolation and Deblurring Under Unknown Exposure Time
Shang, WeiandRen, DongweiandYang, YiandZhang, HongzhiandMa, KedeandZuo, Wangmeng



研究问题:如何改善由动态场景复杂性、镜头和传感器缺陷以及不理想的曝光设置等因素导致的低帧率和运动模糊问题。
动机:现有的视频插值和去模糊方法都假设曝光时间是已知且固定的,这在现实中并不成立。因此,本文旨在解决未知曝光时间下的视频多帧插值和去模糊这一更具挑战性的任务。
方法:首先,采用一种变种的有监督对比学习方法从输入的模糊帧中构建一个与曝光相关的表示。然后,训练两个U-Nets分别进行内部运动和外部运动分析,通过增益调整适应学习到的曝光表示。最后,通过逐步曝光自适应卷积和运动细化,基于曝光和运动表示构建视频重建网络。
效果:在模拟和真实世界的数据集上的大量实验表明,优化后的方法在联合视频x8插值和去模糊任务上取得了显著的性能提升。此外,在看似不可能的x16插值任务上,该方法比现有方法在PSNR上提高了1.5 dB以上。

Natural videos captured by consumer cameras often suffer from low framerate and motion blur due to the combination of dynamic scene complexity, lens and sensor imperfection, and less than ideal exposure setting. As a result, computational methods that jointly perform video frame interpolation and deblurring begin to emerge with the unrealistic assumption that the exposure time is known and fixed. In this work, we aim ambitiously for a more realistic and challenging task - joint video multi-frame interpolation and deblurring under unknown exposure time. Toward this goal, we first adopt a variant of supervised contrastive learning to construct an exposure-aware representation from input blurred frames. We then train two U-Nets for intra-motion and inter-motion analysis, respectively, adapting to the learned exposure representation via gain tuning. We finally build our video reconstruction network upon the exposure and motion representation by progressive exposure-adaptive convolution and motion refinement. Extensive experiments on both simulated and real-world datasets show that our optimized method achieves notable performance gains over the state-of-the-art on the joint video x8 interpolation and deblurring task. Moreover, on the seemingly implausible x16 interpolation task, our method outperforms existing methods by more than 1.5 dB in terms of PSNR.

MMG-Ego4D: Multimodal Generalization in Egocentric Action Recognition
Gong, XinyuandMohan, SreyasandDhingra, NainaandBazin, Jean-CharlesandLi, YileiandWang, ZhangyangandRanjan, Rakesh



研究问题:本文研究了一种新的自中心动作识别问题,称为“多模态泛化”(MMG),主要探讨在数据的某些模态有限或甚至完全缺失的情况下,系统如何进行泛化。
动机:为了支持现实世界应用中的安全性和效率考虑,我们设计了两种新的情境来深入研究MMG,包括训练时存在但推理时缺失的模态的缺失模态泛化,以及推理时和训练时模态不相交的跨模态零射一泛化。
方法:我们构建了一个新的数据集MMG-Ego4D,包含视频、音频和惯性运动传感器(IMU)模态的数据点。通过引入新的融合模块,包括模态丢弃训练、对比性对齐训练和一种新的跨模态原型损失,以提高少样本性能。
效果:我们在MMG-Ego4D上评估了各种模型,并提出了具有改善泛化能力的新方法。实验结果表明,这些新方法在各种知识驱动任务上取得了显著改进,并在其他常见的NLP任务上与最先进的BERT模型相媲美。

In this paper, we study a novel problem in egocentric action recognition, which we term as "Multimodal Generalization" (MMG). MMG aims to study how systems can generalize when data from certain modalities is limited or even completely missing. We thoroughly investigate MMG in the context of standard supervised action recognition and the more challenging few-shot setting for learning new action categories. MMG consists of two novel scenarios, designed to support security, and efficiency considerations in real-world applications: (1) missing modality generalization where some modalities that were present during the train time are missing during the inference time, and (2) cross-modal zero-shot generalization, where the modalities present during the inference time and the training time are disjoint. To enable this investigation, we construct a new dataset MMG-Ego4D containing data points with video, audio, and inertial motion sensor (IMU) modalities. Our dataset is derived from Ego4D dataset, but processed and thoroughly re-annotated by human experts to facilitate research in the MMG problem. We evaluate a diverse array of models on MMG-Ego4D and propose new methods with improved generalization ability. In particular, we introduce a new fusion module with modality dropout training, contrastive-based alignment training, and a novel cross-modal prototypical loss for better few-shot performance. We hope this study will serve as a benchmark and guide future research in multimodal generalization problems. The benchmark and code are available at https://github.com/facebookresearch/MMG_Ego4D

Panoptic Video Scene Graph Generation
Yang, JingkangandPeng, WenxuanandLi, XiangtaiandGuo, ZujinandChen, LiangyuandLi, BoandMa, ZhengandZhou, KaiyangandZhang, WayneandLoy, ChenChangeandLiu, Ziwei



研究问题:本文旨在解决全面真实世界视觉感知系统中的一个新问题,即全景场景图生成(PVSG)。
动机:现有的视频场景图生成(VidSGG)问题主要关注人类和物体之间的时间交互,但使用边界框检测非刚性物体和背景时常常会导致VidSGG系统遗漏对全面视频理解至关重要的关键细节。相比之下,PVSG需要通过更精确的像素级分割掩码将场景图中的节点与场景理解整体化联系起来。
方法:为了推进这一新领域的研究,我们贡献了一个高质量的PVSG数据集,其中包括400个视频(289个第三人称+111个自我中心视频),总共有15万个帧,标注了全景分割掩码以及精细的时间场景图。我们还提供了各种基线方法和有用的设计实践供未来工作参考。
效果:实验结果表明,我们在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Towards building comprehensive real-world visual perception systems, we propose and study a new problem called panoptic scene graph generation (PVSG). PVSG is related to the existing video scene graph generation (VidSGG) problem, which focuses on temporal interactions between humans and objects localized with bounding boxes in videos. However, the limitation of bounding boxes in detecting non-rigid objects and backgrounds often causes VidSGG systems to miss key details that are crucial for comprehensive video understanding. In contrast, PVSG requires nodes in scene graphs to be grounded by more precise, pixel-level segmentation masks, which facilitate holistic scene understanding. To advance research in this new area, we contribute a high-quality PVSG dataset, which consists of 400 videos (289 third-person + 111 egocentric videos) with totally 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs. We also provide a variety of baseline methods and share useful design practices for future work.

Class Prototypes Based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos
Gupta, RohitandRoy, AnirbanandChristensen, ClaireandKim, SujeongandGerard, SarahandCincebeaux, MadelineandDivakaran, AjayandGrindal, ToddandShah, Mubarak



研究问题:如何为幼儿过滤出合适的在线教育内容。
动机:随着儿童对在线媒体消费的增长,需要数据驱动的工具帮助教育工作者为年幼的学习者筛选适当的教育内容。
方法:提出一种在在线视频中检测教育内容的方法,主要关注两大常用的教育内容类别:读写和数学。针对每一类,根据共同核心标准选择显著的代码(子类别)。将此问题视为细粒度的多标签分类问题,因为视频可能包含多种类型的教育内容,且内容类别在视觉上可能相似。提出一种基于类原型的监督对比学习方法,可以处理与多个标签相关的细粒度样本。学习每个类别的类原型,并使用损失函数最小化类原型与该类别样本之间的距离,同时最大化类原型与其他类别样本之间的距离。考虑到视觉和听觉提示之间的对齐对于有效理解至关重要,因此考虑使用多模态转换器网络在学习视频嵌入的同时捕获视频中的视觉和听觉提示之间的交互。
效果:通过一个由教育研究人员从YouTube标注的教育视频组成的数据集APPROVE进行评估,该数据集包括193小时的专家标注视频,共19个类别。所提出的方法在APPROVE和其他基准测试(如Youtube-8M,COIN)上优于强大的基线方法。

The recent growth in the consumption of online media by children during early childhood necessitates data-driven tools enabling educators to filter out appropriate educational content for young learners. This paper presents an approach for detecting educational content in online videos. We focus on two widely used educational content classes: literacy and math. For each class, we choose prominent codes (sub-classes) based on the Common Core Standards. For example, literacy codes include 'letter names', 'letter sounds', and math codes include 'counting', 'sorting'. We pose this as a fine-grained multilabel classification problem as videos can contain multiple types of educational content and the content classes can get visually similar (e.g., 'letter names' vs 'letter sounds'). We propose a novel class prototypes based supervised contrastive learning approach that can handle fine-grained samples associated with multiple labels. We learn a class prototype for each class and a loss function is employed to minimize the distances between a class prototype and the samples from the class. Similarly, distances between a class prototype and the samples from other classes are maximized. As the alignment between visual and audio cues are crucial for effective comprehension, we consider a multimodal transformer network to capture the interaction between visual and audio cues in videos while learning the embedding for videos. For evaluation, we present a dataset, APPROVE, employing educational videos from YouTube labeled with fine-grained education classes by education researchers. APPROVE consists of 193 hours of expert-annotated videos with 19 classes. The proposed approach outperforms strong baselines on APPROVE and other benchmarks such as Youtube-8M, and COIN. The dataset is available at https://nusci.csl.sri.com/project/APPROVE.

Decoupled Multimodal Distilling for Emotion Recognition
Li, YongandWang, YuanzhiandCui, Zhen



研究问题:本文旨在解决多模态情感识别(MER)中存在的模态异质性和不同模态贡献差异大的问题。
动机:尽管现有的MER方法表现优秀,但模态异质性和模态贡献差异仍然是一个挑战。
方法:本文提出了一种解耦的多模态蒸馏(DMD)方法,通过灵活和自适应的跨模态知识蒸馏来增强每个模态的判别特征。具体来说,将每个模态的表示解耦为两个部分,即模态无关/独占空间,并使用图蒸馏单元(GD-Unit)进行知识蒸馏。
效果:实验结果表明,DMD在各种任务上都优于最先进的MER方法,且其图边显示出有意义的分布模式。

Human multimodal emotion recognition (MER) aims to perceive human emotions via language, visual and acoustic modalities. Despite the impressive performance of previous MER approaches, the inherent multimodal heterogeneities still haunt and the contribution of different modalities varies significantly. In this work, we mitigate this issue by proposing a decoupled multimodal distillation (DMD) approach that facilitates flexible and adaptive crossmodal knowledge distillation, aiming to enhance the discriminative features of each modality. Specially, the representation of each modality is decoupled into two parts, i.e., modality-irrelevant/-exclusive spaces, in a self-regression manner. DMD utilizes a graph distillation unit (GD-Unit) for each decoupled part so that each GD can be performed in a more specialized and effective manner. A GD-Unit consists of a dynamic graph where each vertice represents a modality and each edge indicates a dynamic knowledge distillation. Such GD paradigm provides a flexible knowledge transfer manner where the distillation weights can be automatically learned, thus enabling diverse crossmodal knowledge transfer patterns. Experimental results show DMD consistently obtains superior performance than state-of-the-art MER methods. Visualization results show the graph edges in DMD exhibit meaningful distributional patterns w.r.t. the modality-irrelevant/-exclusive feature spaces. Codes are released at https://github.com/mdswyz/DMD.

Multivariate, Multi-Frequency and Multimodal: Rethinking Graph Neural Networks for Emotion Recognition in Conversation
Chen, FeiyuandShao, JieandZhu, ShuyuanandShen, HengTao



研究问题:在对话中的情感识别(ERC)任务中,跨模态和上下文维度的高阶复杂关系是一个关键挑战。
动机:现有的方法往往以松散的方式编码多模态和上下文关系,这可能损害关系建模。图神经网络(GNN)在捕捉数据关系方面表现出优势,为ERC提供了新的解决方案。
方法:我们提出了一种基于GNN的模型,该模型探索了多元关系,并通过评估多频信号来捕捉情感差异和共性的变化重要性。
效果:实验结果表明,我们提出的方法在两个流行的多模态ERC数据集上优于先前最先进的工作。

Complex relationships of high arity across modality and context dimensions is a critical challenge in the Emotion Recognition in Conversation (ERC) task. Yet, previous works tend to encode multimodal and contextual relationships in a loosely-coupled manner, which may harm relationship modelling. Recently, Graph Neural Networks (GNN) which show advantages in capturing data relations, offer a new solution for ERC. However, existing GNN-based ERC models fail to address some general limits of GNNs, including assuming pairwise formulation and erasing high-frequency signals, which may be trivial for many applications but crucial for the ERC task. In this paper, we propose a GNN-based model that explores multivariate relationships and captures the varying importance of emotion discrepancy and commonality by valuing multi-frequency signals. We empower GNNs to better capture the inherent relationships among utterances and deliver more sufficient multimodal and contextual modelling. Experimental results show that our proposed method outperforms previous state-of-the-art works on two popular multimodal ERC datasets.

DeFeeNet: Consecutive 3D Human Motion Prediction With Deviation Feedback
Sun, XiaoningandSun, HuaijiangandLi, BinandWei, DongandLi, WeiqingandLu, Jianfeng



研究问题:本文旨在解决当前预测人类运动的技术无法满足实际需求的问题,如人机协作。
动机:现有的预测模型将预测人类运动简化为基于历史观察的短期未来序列预测过程,忽视了在实际应用中,运动预测是一个连续的过程,每轮预测的结果都会在下一轮中得到验证和反馈。
方法:本文提出了DeFeeNet,这是一个可以添加到现有一次性预测模型中的简单有效的网络,用于实现连续运动预测任务中的偏差感知和反馈。在每轮预测中,前一轮产生的偏差首先由DeFeeNet编码,然后被整合到现有的预测器中,实现了偏差感知的预测方式。
效果:实验结果表明,无论基础模型如何,本文提出的网络都能提高连续人类运动预测的性能。

Let us rethink the real-world scenarios that require human motion prediction techniques, such as human-robot collaboration. Current works simplify the task of predicting human motions into a one-off process of forecasting a short future sequence (usually no longer than 1 second) based on a historical observed one. However, such simplification may fail to meet practical needs due to the neglect of the fact that motion prediction in real applications is not an isolated "observe then predict" unit, but a consecutive process composed of many rounds of such unit, semi-overlapped along the entire sequence. As time goes on, the predicted part of previous round has its corresponding ground truth observable in the new round, but their deviation in-between is neither exploited nor able to be captured by existing isolated learning fashion. In this paper, we propose DeFeeNet, a simple yet effective network that can be added on existing one-off prediction models to realize deviation perception and feedback when applied to consecutive motion prediction task. At each prediction round, the deviation generated by previous unit is first encoded by our DeFeeNet, and then incorporated into the existing predictor to enable a deviation-aware prediction manner, which, for the first time, allows for information transmit across adjacent prediction units. We design two versions of DeFeeNet as MLP-based and GRU-based, respectively. On Human3.6M and more complicated BABEL, experimental results indicate that our proposed network improves consecutive human motion prediction performance regardless of the basic model.

Align and Attend: Multimodal Summarization With Dual Contrastive Losses
He, BoandWang, JunandQiu, JielinandBui, TrungandShrivastava, AbhinavandWang, Zhaowen



研究问题:多模态摘要的目标是从不同的模态中提取最重要的信息来形成摘要,但现有的方法未能利用不同模态之间的时间对应关系和内在关联。
动机:为了解决这个问题,我们提出了Align and Attend Multimodal Summarization(A2Summ)模型,这是一个统一的多模态基于变压器的模型,可以有效地对齐和关注多模态输入。
方法:我们引入了两种新的对比损失函数,以模拟样本内和样本间的关联性。
效果:在两个标准的视频摘要数据集(TVSum和SumMe)以及两个多模态摘要数据集(Daily Mail和CNN)上的大量实验表明,A2Summ在所有数据集上都取得了最先进的性能。我们还收集了一个大规模的多模态摘要数据集BLiSS,其中包含直播视频和带注释摘要的转录文本。我们的代码和数据集可在https://boheumd.github.io/A2Summ/上公开获取。

The goal of multimodal summarization is to extract the most important information from different modalities to form summaries. Unlike unimodal summarization, the multimodal summarization task explicitly leverages cross-modal information to help generate more reliable and high-quality summaries. However, existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples. To address this issue, we introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input. In addition, we propose two novel contrastive losses to model both inter-sample and intra-sample correlations. Extensive experiments on two standard video summarization datasets (TVSum and SumMe) and two multimodal summarization datasets (Daily Mail and CNN) demonstrate the superiority of A2Summ, achieving state-of-the-art performances on all datasets. Moreover, we collected a large-scale multimodal summarization dataset BLiSS, which contains livestream videos and transcribed texts with annotated summaries. Our code and dataset are publicly available at https://boheumd.github.io/A2Summ/.

Movies2Scenes: Using Movie Metadata To Learn Scene Representation
Chen, ShixingandLiu, Chun-HaoandHao, XiangandNie, XiaohanandArap, MaximandHamid, Raffay



研究问题:如何有效地理解电影中的场景,用于视频审查、搜索和推荐等应用。
动机:手动标注场景费时费力,而电影级别的元数据(如类型、剧情简介等)作为电影制作过程的一部分,通常更容易获取。
方法:提出一种新的对比学习方法,利用电影元数据学习通用场景表示。具体来说,使用电影元数据定义一种电影相似性度量,并在对比学习中使用它来限制对正场景对的搜索,只考虑彼此相似的电影。
效果:在多个基准数据集上评估的多种任务中,我们学习的场景表示始终优于现有的最先进方法。特别是在LVU数据集上,我们的表示在七个分类任务上平均提高了7.9%,在两个回归任务上提高了9.7%。此外,使用新收集的电影数据集,我们在一系列视频审查任务上展示了我们的场景表示的泛化能力。

Understanding scenes in movies is crucial for a variety of applications such as video moderation, search, and recommendation. However, labeling individual scenes is a time-consuming process. In contrast, movie level metadata (e.g., genre, synopsis, etc.) regularly gets produced as part of the film production process, and is therefore significantly more commonly available. In this work, we propose a novel contrastive learning approach that uses movie metadata to learn a general-purpose scene representation. Specifically, we use movie metadata to define a measure of movie similarity, and use it during contrastive learning to limit our search for positive scene-pairs to only the movies that are considered similar to each other. Our learned scene representation consistently outperforms existing state-of-the-art methods on a diverse set of tasks evaluated using multiple benchmark datasets. Notably, our learned representation offers an average improvement of 7.9% on the seven classification tasks and 9.7% improvement on the two regression tasks in LVU dataset. Furthermore, using a newly collected movie dataset, we present comparative results of our scene representation on a set of video moderation tasks to demonstrate its generalizability on previously less explored tasks.

Neural Video Compression With Diverse Contexts
Li, JiahaoandLi, BinandLu, Yan



研究问题:如何提高视频编码的效率和压缩比。
动机:传统的视频编码方法需要大量的计算时间来寻找相关上下文,而新兴的神经网络视频编码器(NVC)的上下文有限,导致压缩比较低。
方法:提出在时间和空间维度上增加上下文多样性的方法。通过学习跨帧的分层质量模式来丰富长期的高质量时间上下文,并引入基于光流的分组偏移多样性以更好地挖掘上下文信息。同时,采用四叉树分区并行编码潜在表示以增加空间上下文多样性。
效果:实验表明,该方法比之前的SOTA NVC节省了23.5%的比特率,并在RGB和YUV420颜色空间中超过了未成熟的下一代传统编码器/ECM,在PSNR方面表现更好。

For any video codecs, the coding efficiency highly relies on whether the current signal to be encoded can find the relevant contexts from the previous reconstructed signals. Traditional codec has verified more contexts bring substantial coding gain, but in a time-consuming manner. However, for the emerging neural video codec (NVC), its contexts are still limited, leading to low compression ratio. To boost NVC, this paper proposes increasing the context diversity in both temporal and spatial dimensions. First, we guide the model to learn hierarchical quality patterns across frames, which enriches long-term and yet high-quality temporal contexts. Furthermore, to tap the potential of optical flow-based coding framework, we introduce a group-based offset diversity where the cross-group interaction is proposed for better context mining. In addition, this paper also adopts a quadtree-based partition to increase spatial context diversity when encoding the latent representation in parallel. Experiments show that our codec obtains 23.5% bitrate saving over previous SOTA NVC. Better yet, our codec has surpassed the under-developing next generation traditional codec/ECM in both RGB and YUV420 colorspaces, in terms of PSNR. The codes are at https://github.com/microsoft/DCVC.

Event-Guided Person Re-Identification via Sparse-Dense Complementary Learning
Cao, ChengzhiandFu, XueyangandLiu, HongjianandHuang, YukunandWang, KunyuandLuo, JieboandZha, Zheng-Jun



研究问题:视频行人重识别(Re-ID)是计算机视觉领域的重要课题,由于其在视频监控应用中的广泛应用。
动机:现有的方法主要利用帧序列中的空间和时间相关性来获取区别性的人的特征。然而,由于在帧中包含的模糊等不可避免的退化,会导致纹理噪声和时间干扰的歧义,从而丢失了身份区分的线索。
方法:受生物启发的新型传感器事件相机的出现为Re-ID任务带来了新的活力。该相机可以异步记录强度变化,具有微秒级的分辨率和低延迟,即使在上述退化环境中也能准确捕捉行人的运动。因此,我们提出了一种稀疏-密集互补学习框架,通过充分利用密集帧和稀疏事件的互补信息来有效提取身份特征。
效果:实验结果表明,通过将事件和脉冲神经网络(SNN)应用于Re-ID,我们的方法显著优于竞争方法。

Video-based person re-identification (Re-ID) is a prominent computer vision topic due to its wide range of video surveillance applications. Most existing methods utilize spatial and temporal correlations in frame sequences to obtain discriminative person features. However, inevitable degradations, e.g., motion blur contained in frames often cause ambiguity texture noise and temporal disturbance, leading to the loss of identity-discriminating cues. Recently, a new bio-inspired sensor called event camera, which can asynchronously record intensity changes, brings new vitality to the Re-ID task. With the microsecond resolution and low latency, event cameras can accurately capture the movements of pedestrians even in the aforementioned degraded environments. Inspired by the properties of event cameras, in this work, we propose a Sparse-Dense Complementary Learning Framework, which effectively extracts identity features by fully exploiting the complementary information of dense frames and sparse events. Specifically, for frames, we build a CNN-based module to aggregate the dense features of pedestrian appearance step-by-step, while for event streams, we design a bio-inspired spiking neural backbone, which encodes event signals into sparse feature maps in a spiking form, to present the dynamic motion cues of pedestrians. Finally, a cross feature alignment module is constructed to complementarily fuse motion information from events and appearance cues from frames to enhance identity representation learning. Experiments on several benchmarks show that by employing events and SNN into Re-ID, our method significantly outperforms competitive methods.

Unsupervised Contour Tracking of Live Cells by Mechanical and Cycle Consistency Losses
Jang, JunbongandLee, KwonmooandKim, Tae-Kyun



研究问题:分析细胞形态的动态变化对于理解活细胞的各种功能和特性至关重要。
动机:由于细胞的流动性,以及局部轮廓特征的扩张和收缩等复杂运动,使得细胞轮廓上的局部形状和纹理不易观察,因此需要跟踪活细胞视频的每一帧中高度可变形的细胞轮廓上的所有点。
方法:我们提出了第一种基于深度学习的细胞(或更一般的粘弹性材料)轮廓跟踪方法,通过融合两个轮廓之间的密集表示和交叉注意力来实现点对应。
效果:在两个相差显微镜拍摄的活细胞数据集上进行的定量评估表明,我们的轮廓跟踪器在数量上优于比较的方法,并产生更有利的定性结果。

Analyzing the dynamic changes of cellular morphology is important for understanding the various functions and characteristics of live cells, including stem cells and metastatic cancer cells. To this end, we need to track all points on the highly deformable cellular contour in every frame of live cell video. Local shapes and textures on the contour are not evident, and their motions are complex, often with expansion and contraction of local contour features. The prior arts for optical flow or deep point set tracking are unsuited due to the fluidity of cells, and previous deep contour tracking does not consider point correspondence. We propose the first deep learning-based tracking of cellular (or more generally viscoelastic materials) contours with point correspondence by fusing dense representation between two contours with cross attention. Since it is impractical to manually label dense tracking points on the contour, unsupervised learning comprised of the mechanical and cyclical consistency losses is proposed to train our contour tracker. The mechanical loss forcing the points to move perpendicular to the contour effectively helps out. For quantitative evaluation, we labeled sparse tracking points along the contour of live cells from two live cell datasets taken with phase contrast and confocal fluorescence microscopes. Our contour tracker quantitatively outperforms compared methods and produces qualitatively more favorable results. Our code and data are publicly available at https://github.com/JunbongJang/contour-tracking/

VideoTrack: Learning To Track Objects via Video Transformer
Xie, FeiandChu, LeiandLi, JiahaoandLu, YanandMa, Chao



研究问题:现有的Siamese跟踪方法在效率和工业部署上存在限制,因为它们严重依赖于研究问题:现有的Siamese跟踪方法在效率和工业部署上存在限制,因为它们严重依赖于两个单帧之间的成对匹配,并需要复杂的机制来利用连续视频帧之间的时间信息。
动机:为了解决这些问题,我们转向了序列级别的目标匹配,通过前馈视频模型将时间上下文编码到空间特征中。
方法:我们修改了标准的 video transformer 架构,使其能够直接从帧级别补丁序列进行时空特征学习,以适应跟踪任务。我们还通过顺序多分支三元组块混合视频剪辑中的时空信息,形成了一个视频transformer主干。
效果:我们的实验研究表明,我们的方法(命名为VideoTrack)在实时运行的同时实现了最先进的结果。

Existing Siamese tracking methods, which are built on pair-wise matching between two single frames, heavily rely on additional sophisticated mechanism to exploit temporal information among successive video frames, hindering them from high efficiency and industrial deployments. In this work, we resort to sequence-level target matching that can encode temporal contexts into the spatial features through a neat feedforward video model. Specifically, we adapt the standard video transformer architecture to visual tracking by enabling spatiotemporal feature learning directly from frame-level patch sequences. To better adapt to the tracking task, we carefully blend the spatiotemporal information in the video clips through sequential multi-branch triplet blocks, which formulates a video transformer backbone. Our experimental study compares different model variants, such as tokenization strategies, hierarchical structures, and video attention schemes. Then, we propose a disentangled dual-template mechanism that decouples static and dynamic appearance changes over time, and reduces the temporal redundancy in video frames. Extensive experiments show that our method, named as VideoTrack, achieves state-of-the-art results while running in real-time.

Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning
Zhang, HengandLiu, DaqingandZheng, QiandSu, Bing



研究问题:现有的细粒度视频表示学习方法在学习帧特征时,忽视了视频的内在动态过程。
动机:提出一种新的基于过程的对比学习框架,通过区分视频过程并捕捉过程中的动态变化来学习视频表示。
方法:将视频建模为随机过程,通过过程对比损失强制目标序列的嵌入在潜在空间中逼近布朗桥,以实现对视频过程的区分和动态捕捉。
效果:在四个数据集上的实验结果表明,该方法在各种视频理解任务上表现优秀,包括阶段进展、阶段分类和帧检索。

A meaningful video is semantically coherent and changes smoothly. However, most existing fine-grained video representation learning methods learn frame-wise features by aligning frames across videos or exploring relevance between multiple views, neglecting the inherent dynamic process of each video. In this paper, we propose to learn video representations by modeling Video as Stochastic Processes (VSP) via a novel process-based contrastive learning framework, which aims to discriminate between video processes and simultaneously capture the temporal dynamics in the processes. Specifically, we enforce the embeddings of the frame sequence of interest to approximate a goal-oriented stochastic process, i.e., Brownian bridge, in the latent space via a process-based contrastive loss. To construct the Brownian bridge, we adapt specialized sampling strategies under different annotations for both self-supervised and weakly-supervised learning. Experimental results on four datasets show that VSP stands as a state-of-the-art method for various video understanding tasks, including phase progression, phase classification and frame retrieval. Code is available at 'https://github.com/hengRUC/VSP'.

CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning
Cheng, YitingandWei, FangyunandBao, JianminandChen, DongandZhang, Wenqiang



研究问题:本文旨在解决手语理解中的一项新任务——手语检索,包括文本到手语视频(T2V)和手语视频到文本(V2T)的检索。
动机:由于手语既是自然语言,又包含丰富的语义信息,因此传统的视频-文本检索方法无法满足需求。同时,手语数据集的规模远小于语音识别数据集,这增加了数据稀缺性的问题。
方法:本文将手语检索定义为跨语言检索问题和视频-文本检索任务,并采用跨语言对比学习的方法在联合嵌入空间中对手语和自然语言的文本以及手语视频进行对比,以确定精细的跨语言(即手语到单词)映射。
效果:通过在大规模手语视频上预训练领域不可知的手语编码器并通过伪标签将其引入目标领域,本文提出的框架在各种数据集上均大幅超越了先前的方法,例如在How2Sign数据集上T2V和V2T R@1指标分别提高了+22.4和+28.0,在PHOENIX-2014T数据集上T2V和V2T R@1指标分别提高了+13.7和+17.1。代码和模型可在https://github.com/FangyunWei/SLRT获取。

This work focuses on sign language retrieval--a recently proposed task for sign language understanding. Sign language retrieval consists of two sub-tasks: text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval. Different from traditional video-text retrieval, sign language videos, not only contain visual signals but also carry abundant semantic meanings by themselves due to the fact that sign languages are also natural languages. Considering this character, we formulate sign language retrieval as a cross-lingual retrieval problem as well as a video-text retrieval task. Concretely, we take into account the linguistic properties of both sign languages and natural languages, and simultaneously identify the fine-grained cross-lingual (i.e., sign-to-word) mappings while contrasting the texts and the sign videos in a joint embedding space. This process is termed as cross-lingual contrastive learning. Another challenge is raised by the data scarcity issue--sign language datasets are orders of magnitude smaller in scale than that of speech recognition. We alleviate this issue by adopting a domain-agnostic sign encoder pre-trained on large-scale sign videos into the target domain via pseudo-labeling. Our framework, termed as domain-aware sign language retrieval via Cross-lingual Contrastive learning or CiCo for short, outperforms the pioneering method by large margins on various datasets, e.g., +22.4 T2V and +28.0 V2T R@1 improvements on How2Sign dataset, and +13.7 T2V and +17.1 V2T R@1 improvements on PHOENIX-2014T dataset. Code and models are available at: https://github.com/FangyunWei/SLRT.

Relational Space-Time Query in Long-Form Videos
Yang, XitongandChu, Fu-JenandFeiszli, MattandGoyal, RaghavandTorresani, LorenzoandTran, Du



研究问题:现有的视频基准测试独立地研究活动、对象及其交互的问题,并且是在短的、经过策划的剪辑上进行。然而,真实世界的应用,如AR助手,需要将这些问题捆绑在一起进行模型开发和评估。
动机:为了解决这一问题,本文提出了一个联合框架用于长视频理解。
方法:首先,提出了一个集成框架Relational Space-Time Query(ReST),通过模板化的时空查询来评估视频理解模型。其次,引入了两个新的基准测试ReST-ADL和ReST-Ego4D,这两个基准测试通过ReST框架生成的丰富的查询注释来增强现有的自我中心视频数据集。
效果:实验结果表明,ReST框架和基准测试有助于在长视频中进行综合的多步推理,并相信这将促进下一代视频理解模型的发展。

Egocentric videos are often available in the form of uninterrupted, uncurated long videos capturing the camera wearers' daily life activities.Understanding these videos requires models to be able to reason about activities, objects, and their interactions. However, current video benchmarks study these problems independently and under short, curated clips. In contrast, real-world applications, e.g., AR assistants, require bundling these problems for both model development and evaluation. In this paper, we propose to study these problems in a joint framework for long video understanding. Our contributions are three-fold. First, we propose an integrated framework, namely Relational Space-Time Query (ReST), for evaluating video understanding models via templated spatiotemporal queries. Second, we introduce two new benchmarks, ReST-ADL and ReST-Ego4D, which augment the existing egocentric video datasets with abundant query annotations generated by the ReST framework. Finally, we present a set of baselines and in-depth analysis on the two benchmarks and provide insights about the query tasks. We view our integrated framework and benchmarks as a step towards comprehensive, multi-step reasoning in long videos, and believe it will facilitate the development of next generations of video understanding models.

Video Dehazing via a Multi-Range Temporal Alignment Network With Physical Prior
Xu, JiaqiandHu, XiaoweiandZhu, LeiandDou, QiandDai, JifengandQiao, YuandHeng, Pheng-Ann



研究问题:本文旨在提出一种新的视频去雾框架,以恢复高可见性和对比度的无雾帧。
动机:现有的视频去雾方法往往忽视了物理雾的先验知识和时间信息的聚合,导致去雾效果不佳。
方法:我们设计了一个基于记忆的物理先验指导模块,将与先验相关的特征编码到长程记忆中。同时,我们构建了一个多范围场景辐射恢复模块,以捕捉多个时空范围内的空间-时间依赖性,从而有效地从相邻帧中聚合时间信息。
效果:我们在各种真实世界场景中创建了第一个大型户外视频去雾基准数据集。实验结果表明,我们的方法在合成和真实条件下都表现出优越性。

Video dehazing aims to recover haze-free frames with high visibility and contrast. This paper presents a novel framework to effectively explore the physical haze priors and aggregate temporal information. Specifically, we design a memory-based physical prior guidance module to encode the prior-related features into long-range memory. Besides, we formulate a multi-range scene radiance recovery module to capture space-time dependencies in multiple space-time ranges, which helps to effectively aggregate temporal information from adjacent frames. Moreover, we construct the first large-scale outdoor video dehazing benchmark dataset, which contains videos in various real-world scenarios. Experimental results on both synthetic and real conditions show the superiority of our proposed method.

BiFormer: Learning Bilateral Motion Estimation via Bilateral Transformer for 4K Video Frame Interpolation
Park, JunheumandKim, JintaeandKim, Chang-Su



研究问题:提出一种基于双向变换器的4K视频帧插值器。
动机:改进现有的视频帧插值方法,提高插值效果。
方法:通过全局运动估计、局部运动精细化和帧合成三个步骤进行插值。其中,全局运动估计采用双向变换器预测对称的双侧运动场;局部运动精细化利用块状双向成本体积高效地优化全局运动场;最后,使用优化后的运动场对输入帧进行变形并融合以生成中间帧。
效果:实验证明,提出的双向变换器算法在4K数据集上表现出优秀的插值性能。

A novel 4K video frame interpolator based on bilateral transformer (BiFormer) is proposed in this paper, which performs three steps: global motion estimation, local motion refinement, and frame synthesis. First, in global motion estimation, we predict symmetric bilateral motion fields at a coarse scale. To this end, we propose BiFormer, the first transformer-based bilateral motion estimator. Second, we refine the global motion fields efficiently using blockwise bilateral cost volumes (BBCVs). Third, we warp the input frames using the refined motion fields and blend them to synthesize an intermediate frame. Extensive experiments demonstrate that the proposed BiFormer algorithm achieves excellent interpolation performance on 4K datasets. The source codes are available at https://github.com/JunHeum/BiFormer.

Learning From Unique Perspectives: User-Aware Saliency Modeling
Chen, ShiandValliappan, NachiappanandShen, ShaoleiandYe, XinyuandKohlhoff, KaiandHe, Junfeng



研究问题:如何利用视觉偏好进行用户关注模型的研究。
动机:目前的显著性模型通常只适用于一般人群,忽视了用户行为之间的差异。通过理解用户的视觉偏好,可以更好地理解不同用户的精细注意力模式,并有助于开发定制的应用。
方法:提出了一个新的模型,能够灵活捕捉各种用户组合的注意力模式,同时自适应预测个性化关注、用户群体关注和一般显著性。并通过逐步理解视觉注意力的原则学习方法,增强模型对不同用户注意力构成的知识。
效果:在包括自然图像和网页等多种刺激下进行的实验结果表明,该方法能有效地捕捉不同用户的视觉行为差异和视觉刺激的一般显著性。

Everyone is unique. Given the same visual stimuli, people's attention is driven by both salient visual cues and their own inherent preferences. Knowledge of visual preferences not only facilitates understanding of fine-grained attention patterns of diverse users, but also has the potential of benefiting the development of customized applications. Nevertheless, existing saliency models typically limit their scope to attention as it applies to the general population and ignore the variability between users' behaviors. In this paper, we identify the critical roles of visual preferences in attention modeling, and for the first time study the problem of user-aware saliency modeling. Our work aims to advance attention research from three distinct perspectives: (1) We present a new model with the flexibility to capture attention patterns of various combinations of users, so that we can adaptively predict personalized attention, user group attention, and general saliency at the same time with one single model; (2) To augment models with knowledge about the composition of attention from different users, we further propose a principled learning method to understand visual attention in a progressive manner; and (3) We carry out extensive analyses on publicly available saliency datasets to shed light on the roles of visual preferences. Experimental results on diverse stimuli, including naturalistic images and web pages, demonstrate the advantages of our method in capturing the distinct visual behaviors of different users and the general saliency of visual stimuli.

MoStGAN-V: Video Generation With Temporal Motion Styles
Shen, XiaoqianandLi, XiangandElhoseiny, Mohamed



研究问题:视频生成任务由于时空复杂性和需要合成多样化运动且具有时间连贯性的挑战,仍然十分困难。
动机:现有的视频生成方法在生成任意长度的视频时,或者采用自回归方式,或者将时间视为连续信号,但都难以合成详细且多样化的运动,并会在几个时间步后产生重复的场景。
方法:本文提出引入额外的时间依赖运动风格来模拟各种运动模式,同时提出一种名为MoStAtt的动态风格注意力调制机制,以增强每个特定尺度(即层)的帧的生动动态。
效果:实验结果表明,该方法在四个无条件256^2视频合成基准测试中取得了最先进的性能,并且仅使用每段3帧进行训练就能产生更好的动态运动质量。

Video generation remains a challenging task due to spatiotemporal complexity and the requirement of synthesizing diverse motions with temporal consistency. Previous works attempt to generate videos in arbitrary lengths either in an autoregressive manner or regarding time as a continuous signal. However, they struggle to synthesize detailed and diverse motions with temporal coherence and tend to generate repetitive scenes after a few time steps. In this work, we argue that a single time-agnostic latent vector of style-based generator is insufficient to model various and temporally-consistent motions. Hence, we introduce additional time-dependent motion styles to model diverse motion patterns. In addition, a Motion Style Attention modulation mechanism, dubbed as MoStAtt, is proposed to augment frames with vivid dynamics for each specific scale (i.e., layer), which assigns attention score for each motion style w.r.t deconvolution filter weights in the target synthesis layer and softly attends different motion styles for weight modulation. Experimental results show our model achieves state-of-the-art performance on four unconditional 256^2 video synthesis benchmarks trained with only 3 frames per clip and produces better qualitative results with respect to dynamic motions. Code and videos have been made available at https://github.com/xiaoqian-shen/MoStGAN-V.

ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data
Zhao, HaojieandChen, JunsongandWang, LijunandLu, Huchuan



研究问题:本文旨在解决传统RGB-only视觉跟踪中缺乏数据集的问题,提出一种新的RGB-D跟踪数据集ARKitTrack。
动机:目前RGB-D跟踪的数据集较少,而RGB-D数据能够提供更丰富的信息,有助于提高跟踪的准确性和鲁棒性。
方法:利用消费者级LiDAR扫描仪捕获静态和动态场景的RGB-D序列,构建了一个新的RGB-D跟踪数据集ARKitTrack,包含300个RGB-D序列、455个目标和229.7K个视频帧。同时,还提供了边界框注释、帧级别属性以及123.9K个像素级别的目标掩码。
效果:通过在ARKitTrack数据集上进行实验,验证了该数据集对RGB-D跟踪的促进作用,并提出了一个新的基线方法,该方法结合了RGB特征和鸟瞰图表示,能够更好地探索跨模态3D几何信息。实验结果表明,该方法在跟踪准确性和鲁棒性方面优于现有方法。

Compared with traditional RGB-only visual tracking, few datasets have been constructed for RGB-D tracking. In this paper, we propose ARKitTrack, a new RGB-D tracking dataset for both static and dynamic scenes captured by consumer-grade LiDAR scanners equipped on Apple's iPhone and iPad. ARKitTrack contains 300 RGB-D sequences, 455 targets, and 229.7K video frames in total. Along with the bounding box annotations and frame-level attributes, we also annotate this dataset with 123.9K pixel-level target masks. Besides, the camera intrinsic and camera pose of each frame are provided for future developments. To demonstrate the potential usefulness of this dataset, we further present a unified baseline for both box-level and pixel-level tracking, which integrates RGB features with bird's-eye-view representations to better explore cross-modality 3D geometry. In-depth empirical analysis has verified that the ARKitTrack dataset can significantly facilitate RGB-D tracking and that the proposed baseline method compares favorably against the state of the arts. The source code and dataset will be released.

Learning Action Changes by Measuring Verb-Adverb Textual Relationships
Moltisanti, DavideandKeller, FrankandBilen, HakanandSevilla-Lara, Laura



研究问题:理解视频中动作的执行方式,即给定一个视频,预测修饰动作的副词(如“精细地切”)。
动机:现有的数据集对于学习副词识别存在噪声干扰或包含的动作不受副词影响,导致评估不可靠。
方法:将此问题转化为回归任务,通过测量动词和副词之间的文本关系来生成代表要学习的动作变化的回归目标。并在一系列数据集上进行测试,实现在副词预测和反义词分类上的最先进结果。
效果:收集并创建了一个新的高质量数据集:Adverbs in Recipes (AIR)。结果显示,模型从更干净的AIR视频中学习效果更好,同时,对AIR上的副词预测具有挑战性,表明还有很大的改进空间。

The goal of this work is to understand the way actions are performed in videos. That is, given a video, we aim to predict an adverb indicating a modification applied to the action (e.g. cut "finely"). We cast this problem as a regression task. We measure textual relationships between verbs and adverbs to generate a regression target representing the action change we aim to learn. We test our approach on a range of datasets and achieve state-of-the-art results on both adverb prediction and antonym classification. Furthermore, we outperform previous work when we lift two commonly assumed conditions: the availability of action labels during testing and the pairing of adverbs as antonyms. Existing datasets for adverb recognition are either noisy, which makes learning difficult, or contain actions whose appearance is not influenced by adverbs, which makes evaluation less reliable. To address this, we collect a new high quality dataset: Adverbs in Recipes (AIR). We focus on instructional recipes videos, curating a set of actions that exhibit meaningful visual changes when performed differently. Videos in AIR are more tightly trimmed and were manually reviewed by multiple annotators to ensure high labelling quality. Results show that models learn better from AIR given its cleaner videos. At the same time, adverb prediction on AIR is challenging, demonstrating that there is considerable room for improvement.

Feature Aggregated Queries for Transformer-Based Video Object Detectors
Cui, Yiming



研究问题:视频目标检测需要解决在图像领域很少发生的功能退化情况。
动机:现有的基于变压器的视频目标检测器仍然遵循与经典目标检测器相同的流程,如通过聚合增强对象特征表示。
方法:我们提出了一种改进的查询聚合模块,根据相邻帧的特征对查询进行加权平均,并将其扩展到一个更实用的版本,根据输入帧的特征生成和聚合查询。
效果:在具有挑战性的ImageNet VID基准测试中,当与我们提出的模块集成时,当前最先进的基于变压器的目标检测器在mAP和AP50上可以提高超过2.4%和4.2%。

Video object detection needs to solve feature degradation situations that rarely happen in the image domain. One solution is to use the temporal information and fuse the features from the neighboring frames. With Transformer-based object detectors getting a better performance on the image domain tasks, recent works began to extend those methods to video object detection. However, those existing Transformer-based video object detectors still follow the same pipeline as those used for classical object detectors, like enhancing the object feature representations by aggregation. In this work, we take a different perspective on video object detection. In detail, we improve the qualities of queries for the Transformer-based models by aggregation. To achieve this goal, we first propose a vanilla query aggregation module that weighted averages the queries according to the features of the neighboring frames. Then, we extend the vanilla module to a more practical version, which generates and aggregates queries according to the features of the input frames. Extensive experimental results validate the effectiveness of our proposed methods: On the challenging ImageNet VID benchmark, when integrated with our proposed modules, the current state-of-the-art Transformer-based object detectors can be improved by more than 2.4% on mAP and 4.2% on AP50.

Decomposed Cross-Modal Distillation for RGB-Based Temporal Action Detection
Lee, PilhyeonandKim, TaeohandShim, MinhoandWee, DongyoonandByun, Hyeran



研究问题:本文旨在解决时间动作检测中的计算成本高和推理速度慢的问题。
动机:现有的双流模型由于依赖计算量大的光流,导致推理速度慢。
方法:提出了一种分解的跨模态蒸馏框架,通过转移运动模态的知识来构建强大的基于RGB的检测器。具体地,我们提出分别学习RGB和运动表示,然后将它们组合进行动作定位。
效果:广泛的实验证明,该方法在提高基于RGB的动作检测器方面非常有效。特别是,我们的框架与不同的模型组合无关,带来了一致的收益。

Temporal action detection aims to predict the time intervals and the classes of action instances in the video. Despite the promising performance, existing two-stream models exhibit slow inference speed due to their reliance on computationally expensive optical flow. In this paper, we introduce a decomposed cross-modal distillation framework to build a strong RGB-based detector by transferring knowledge of the motion modality. Specifically, instead of direct distillation, we propose to separately learn RGB and motion representations, which are in turn combined to perform action localization. The dual-branch design and the asymmetric training objectives enable effective motion knowledge transfer while preserving RGB information intact. In addition, we introduce a local attentive fusion to better exploit the multimodal complementarity. It is designed to preserve the local discriminability of the features that is important for action localization. Extensive experiments on the benchmarks verify the effectiveness of the proposed method in enhancing RGB-based action detectors. Notably, our framework is agnostic to backbones and detection heads, bringing consistent gains across different model combinations.

Event-Based Frame Interpolation With Ad-Hoc Deblurring
Sun, LeiandSakaridis, ChristosandLiang, JingyunandSun, PengandCao, JiezhangandZhang, KaiandJiang, QiandWang, KaiweiandVanGool, Luc



研究问题:视频帧插值的性能与处理输入场景中运动的能力密切相关。
动机:尽管先前的研究认识到异步事件信息对这项任务的效用,但他们忽略了运动可能导致输入视频模糊的事实,这取决于要插值的帧的曝光时间和运动速度,并假设输入视频是锐利的,限制自己进行帧插值,或者认为它是模糊的,在插值之前在其管道中包含一个明确的单独去模糊阶段。
方法:我们提出了一种通用的事件基帧插值方法,该方法临时执行去模糊操作,因此适用于锐利和模糊的输入视频。我们的模型由一个双向循环网络组成,该网络自然地结合了插值的时间维度,并根据其时间接近度自适应地融合了来自输入帧和事件的信息。此外,我们还引入了一个具有事件和彩色视频的新型真实世界高分辨率数据集,为所审查的任务提供了具有挑战性的评估设置。
效果:在标准的GoPro基准测试和我们的数据集上进行的大量实验表明,我们的网络在帧插值、单图像去模糊和插值与去模糊的联合任务上始终优于先前最先进的方法。我们的代码和数据集将在https://github.com/AHupuJR/REFID上提供。

The performance of video frame interpolation is inherently correlated with the ability to handle motion in the input scene. Even though previous works recognize the utility of asynchronous event information for this task, they ignore the fact that motion may or may not result in blur in the input video to be interpolated, depending on the length of the exposure time of the frames and the speed of the motion, and assume either that the input video is sharp, restricting themselves to frame interpolation, or that it is blurry, including an explicit, separate deblurring stage before interpolation in their pipeline. We instead propose a general method for event-based frame interpolation that performs deblurring ad-hoc and thus works both on sharp and blurry input videos. Our model consists in a bidirectional recurrent network that naturally incorporates the temporal dimension of interpolation and fuses information from the input frames and the events adaptively based on their temporal proximity. In addition, we introduce a novel real-world high-resolution dataset with events and color videos which provides a challenging evaluation setting for the examined task. Extensive experiments on the standard GoPro benchmark and on our dataset show that our network consistently outperforms previous state-of-the-art methods on frame interpolation, single image deblurring and the joint task of interpolation and deblurring. Our code and dataset will be available at https://github.com/AHupuJR/REFID.

Egocentric Video Task Translation
Xue, ZihuiandSong, YaleandGrauman, KristenandTorresani, Lorenzo



研究问题:本文旨在解决视频理解任务中,不同任务之间缺乏统一处理的问题。
动机:现有的视频理解任务通常被孤立处理,而可穿戴摄像头提供了一种连续的、由人的目标驱动的、围绕人与世界互动的沉浸式自我中心视角,这需要一种更加统一的处理方法。
方法:本文提出了EgoTask Translation(EgoT2)模型,该模型通过学习将优化在单独任务上的模型输出进行转换,以提高同时对所有任务的性能。与传统的转移学习或多任务学习不同,EgoT2的“翻转设计”包括所有任务共享的任务翻译器和特定于任务的骨干网络,以捕捉甚至异构任务之间的协同作用并减轻任务竞争。
效果:通过对Ego4D的一系列视频任务进行演示,本文展示了EgoT2模型优于现有转移范式的优势,并在Ego4D 2022年的四个基准挑战中取得了顶级排名的结果。

Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e.g., classifying sports in one dataset, tracking animals in another). However, in wearable cameras, the immersive egocentric perspective of a person engaging with the world around them presents an interconnected web of video understanding tasks---hand-object manipulations, navigation in the space, or human-human interactions---that unfold continuously, driven by the person's goals. We argue that this calls for a much more unified approach. We propose EgoTask Translation (EgoT2), which takes a collection of models optimized on separate tasks and learns to translate their outputs for improved performance on any or all of them at once. Unlike traditional transfer or multi-task learning, EgoT2's "flipped design" entails separate task-specific backbones and a task translator shared across all tasks, which captures synergies between even heterogeneous tasks and mitigates task competition. Demonstrating our model on a wide array of video tasks from Ego4D, we show its advantages over existing transfer paradigms and achieve top-ranked results on four of the Ego4D 2022 benchmark challenges.

AdamsFormer for Spatial Action Localization in the Future
Chi, Hyung-gunandLee, KwonjoonandAgarwal, NakulandXu, YiandRamani, KarthikandChoi, Chiho



研究问题:如何准确预测未来动作的位置,特别是在未来帧中。
动机:在人机协作等应用中,预测未来动作位置至关重要,但现有方法在此领域仍有改进空间。
方法:提出了一种名为“空间动作在未来的定位”(SALF)的新任务,并使用NeuralODE的概念来解决,通过神经网络解决常微分方程来模拟序列数据的潜动态。
效果:提出的AdamsFormer模型在UCF101-24和JHMDB-21数据集上的表现优于现有的长范围时间建模方法,显著提高了帧-mAP。

Predicting future action locations is vital for applications like human-robot collaboration. While some computer vision tasks have made progress in predicting human actions, accurately localizing these actions in future frames remains an area with room for improvement. We introduce a new task called spatial action localization in the future (SALF), which aims to predict action locations in both observed and future frames. SALF is challenging because it requires understanding the underlying physics of video observations to predict future action locations accurately. To address SALF, we use the concept of NeuralODE, which models the latent dynamics of sequential data by solving ordinary differential equations (ODE) with neural networks. We propose a novel architecture, AdamsFormer, which extends observed frame features to future time horizons by modeling continuous temporal dynamics through ODE solving. Specifically, we employ the Adams method, a multi-step approach that efficiently uses information from previous steps without discarding it. Our extensive experiments on UCF101-24 and JHMDB-21 datasets demonstrate that our proposed model outperforms existing long-range temporal modeling methods by a significant margin in terms of frame-mAP.

Learning Discriminative Representations for Skeleton Based Action Recognition
Zhou, HuanyuandLiu, QingjieandWang, Yunhong



研究问题:人体动作识别旨在从视频片段中分类人类动作类别。
动机:虽然基于图卷积网络(GCN)的模型在处理骨架数据时比其他模态如RGB帧更有效和鲁棒,但在使用骨架数据时会丢失一些重要线索,如相关物品,导致难以区分的模糊动作容易被误分类。
方法:提出一种辅助特征提炼头(FR Head),包括空间-时间解耦和对比特征提炼,以获取骨架的判别性表示。在特征空间中动态发现和校准模糊样本。此外,FR Head可以施加在GCN的不同阶段,建立多级提炼进行更强的监督。
效果:在NTU RGB+D、NTU RGB+D 120和NW-UCLA数据集上进行了大量实验。所提出的模型取得了与最先进的方法相当的结果,并能有助于区分那些模糊样本。代码可在https://github.com/zhysora/FR-Head获取。

Human action recognition aims at classifying the category of human action from a segment of a video. Recently, people have dived into designing GCN-based models to extract features from skeletons for performing this task, because skeleton representations are much more efficient and robust than other modalities such as RGB frames. However, when employing the skeleton data, some important clues like related items are also discarded. It results in some ambiguous actions that are hard to be distinguished and tend to be misclassified. To alleviate this problem, we propose an auxiliary feature refinement head (FR Head), which consists of spatial-temporal decoupling and contrastive feature refinement, to obtain discriminative representations of skeletons. Ambiguous samples are dynamically discovered and calibrated in the feature space. Furthermore, FR Head could be imposed on different stages of GCNs to build a multi-level refinement for stronger supervision. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets. Our proposed models obtain competitive results from state-of-the-art methods and can help to discriminate those ambiguous samples. Codes are available at https://github.com/zhysora/FR-Head.

Token Turing Machines
Ryoo, MichaelS.andGopalakrishnan, KeerthanaandKahatapitiya, KumaraandXiao, TedandRao, KanishkaandStone, AustinandLu, YaoandIbarz, JulianandArnab, Anurag



研究问题:提出Token Turing Machines (TTM)模型,一种具有记忆功能的序列自回归Transformer模型,用于真实世界的序列视觉理解。
动机:受到神经图灵机的启发,设计了一种外部记忆模块,该模块由一组总结先前历史的令牌组成,以解决处理长序列时计算成本高的问题。
方法:使用Transformer作为处理单元/控制器在每一步高效地寻址、读取和写入内存。模型的内存模块确保新观察只会与内存内容(而非整个历史)一起处理,从而有效地处理具有有界计算成本的长期序列。
效果:实验表明,TTM在两个真实世界的序列视觉理解任务上优于其他替代方案,如专为长序列设计的其他Transformer模型和循环神经网络,包括在线视频活动检测和基于视觉的机器人动作策略学习。

We propose Token Turing Machines (TTM), a sequential, autoregressive Transformer model with memory for real-world sequential visual understanding. Our model is inspired by the seminal Neural Turing Machine, and has an external memory consisting of a set of tokens which summarise the previous history (i.e., frames). This memory is efficiently addressed, read and written using a Transformer as the processing unit/controller at each step. The model's memory module ensures that a new observation will only be processed with the contents of the memory (and not the entire history), meaning that it can efficiently process long sequences with a bounded computational cost at each step. We show that TTM outperforms other alternatives, such as other Transformer models designed for long sequences and recurrent neural networks, on two real-world sequential visual understanding tasks: online temporal activity detection from videos and vision-based robot action policy learning. Code is publicly available at: https://github.com/google-research/scenic/tree/main/scenic/projects/token_turing.

Learning Event Guided High Dynamic Range Video Reconstruction
Yang, YixinandHan, JinandLiang, JinxiuandSato, ImariandShi, Boxin



研究问题:如何利用事件相机和传统相机捕捉的视觉信号进行HDR视频重建。
动机:传统的基于帧的HDR视频重建在处理动态场景时存在曝光比例平衡和鬼影等问题,而事件相机可以提供更高的动态范围和时间分辨率,为LDR视频的HDR成像提供了有效的指导。
方法:提出了一种多模态学习框架,通过多模态表示对齐策略来学习共享的潜在空间,并设计了一个融合模块来补充两种不同类型的信号在不同区域的动态范围。同时,利用了时间相关性来抑制重建的HDR视频中的闪烁效应。
效果:所提出的HDRev-Net在合成数据和真实世界数据上都取得了最先进的性能。

Limited by the trade-off between frame rate and exposure time when capturing moving scenes with conventional cameras, frame based HDR video reconstruction suffers from scene-dependent exposure ratio balancing and ghosting artifacts. Event cameras provide an alternative visual representation with a much higher dynamic range and temporal resolution free from the above issues, which could be an effective guidance for HDR imaging from LDR videos. In this paper, we propose a multimodal learning framework for event guided HDR video reconstruction. In order to better leverage the knowledge of the same scene from the two modalities of visual signals, a multimodal representation alignment strategy to learn a shared latent space and a fusion module tailored to complementing two types of signals for different dynamic ranges in different regions are proposed. Temporal correlations are utilized recurrently to suppress the flickering effects in the reconstructed HDR video. The proposed HDRev-Net demonstrates state-of-the-art performance quantitatively and qualitatively for both synthetic and real-world data.

CASP-Net: Rethinking Video Saliency Prediction From an Audio-Visual Consistency Perceptual Perspective
Xiong, JunwenandWang, GanglaiandZhang, PengandHuang, WeiandZha, YufeiandZhai, Guangtao



研究问题:如何利用音频流实现视频显著性预测,模仿人类大脑的选择性注意力机制。
动机:大多数视频显著性预测方法只关注视觉和听觉模态之间的语义关联,忽视了由于视听内在时间不一致性带来的负面影响。
方法:提出一种具有一致性感知的视听显著性预测网络(CASP-Net),全面考虑视听语义交互和一致感知。设计了优雅关联视频帧和相应声源的双流编码器,以及新颖的一致性感知预测编码,以迭代提高音频和视觉表示的一致性。引入了一个显著性解码器,进一步聚合多尺度的视听信息,生成最终的显著性图。
效果:在六个具有挑战性的视听眼动追踪数据集上,所提出的CASP-Net优于其他最先进的方法。

Incorporating the audio stream enables Video Saliency Prediction (VSP) to imitate the selective attention mechanism of human brain. By focusing on the benefits of joint auditory and visual information, most VSP methods are capable of exploiting semantic correlation between vision and audio modalities but ignoring the negative effects due to the temporal inconsistency of audio-visual intrinsics. Inspired by the biological inconsistency-correction within multi-sensory information, in this study, a consistency-aware audio-visual saliency prediction network (CASP-Net) is proposed, which takes a comprehensive consideration of the audio-visual semantic interaction and consistent perception. In addition a two-stream encoder for elegant association between video frames and corresponding sound source, a novel consistency-aware predictive coding is also designed to improve the consistency within audio and visual representations iteratively. To further aggregate the multi-scale audio-visual information, a saliency decoder is introduced for the final saliency map generation. Substantial experiments demonstrate that the proposed CASP-Net outperforms the other state-of-the-art methods on six challenging audio-visual eye-tracking datasets. For a demo of our system please see https://woshihaozhu.github.io/CASP-Net/.

Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition From Egocentric RGB Videos
Wen, YilinandPan, HaoandYang, LeiandPan, JiaandKomura, TakuandWang, Wenping



研究问题:解决从自我中心RGB视频中理解动态手势和动作的挑战,由于自我遮挡和模糊性。
动机:由于自我遮挡和模糊性,这是一项基本但具有挑战性的任务。
方法:开发了一个基于变压器的框架来利用时间信息进行稳健估计。构建了一个具有两个级联变压器编码器的网络层次结构,其中第一个编码器利用短期时间线索进行手部姿态估计,后者在较长的时间跨度内聚合每帧的姿态和物体信息以识别动作。
效果:该方法在两个第一人称手势基准测试(FPHA和H2O)上取得了有竞争力的结果,广泛的消融研究验证了我们的设计选择。

Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices.

Simultaneously Short- and Long-Term Temporal Modeling for Semi-Supervised Video Semantic Segmentation
Lao, JiangweiandHong, WeixiangandGuo, XinandZhang, YingyingandWang, JianandChen, JingdongandChu, Wei



研究问题:如何利用未标注的帧降低视频语义分割任务的成本。
动机:现有的方法主要通过分配伪标签或进行特征增强来利用未标注的帧,但这些方法往往只关注短期对应关系,忽视了长期时间相关性。
方法:本文提出了一种新的特征增强网络,该网络可以同时模拟短期和长期的时序关联。与仅利用短期对应关系的方法相比,从远距离帧获取的长期时间相关性可以有效地扩大时间感知范围,提供更丰富的上下文先验信息。更重要的是,同时建模相邻和远距离帧可以减轻过拟合的风险,从而为训练集中的远距离未标注帧和测试集中的未见过的视频产生高质量的特征表示。
效果:在每段视频只有一个标注帧的设置下,该方法在具有挑战性的VSPW数据集上的性能比最先进的方法高出2%-3% mIoU。此外,当与基于伪标签的方法(如MeanTeacher)结合使用时,我们的最终模型仅比手动标注所有帧的上限性能低0.13% mIoU。

In order to tackle video semantic segmentation task at a lower cost, e.g., only one frame annotated per video, lots of efforts have been devoted to investigate the utilization of those unlabeled frames by either assigning pseudo labels or performing feature enhancement. In this work, we propose a novel feature enhancement network to simultaneously model short- and long-term temporal correlation. Compared with existing work that only leverage short-term correspondence, the long-term temporal correlation obtained from distant frames can effectively expand the temporal perception field and provide richer contextual prior. More importantly, modeling adjacent and distant frames together can alleviate the risk of over-fitting, hence produce high-quality feature representation for the distant unlabeled frames in training set and unseen videos in testing set. To this end, we term our method SSLTM, short for Simultaneously Short- and Long-Term Temporal Modeling. In the setting of only one frame annotated per video, SSLTM significantly outperforms the state-of-the-art methods by 2% 3% mIoU on the challenging VSPW dataset. Furthermore, when working with a pseudo label based method such as MeanTeacher, our final model only exhibits 0.13% mIoU less than the ceiling performance (i.e., all frames are manually annotated).

Conditional Generation of Audio From Video via Foley Analogies
Du, YuexiandChen, ZiyangandSalamon, JustinandRussell, BryanandOwens, Andrew



研究问题:如何为与实际声音不同的视频创建匹配的音效?
动机:为了解决视频音效设计与真实场景声音不符的问题,提出条件性Foley问题。
方法:通过训练模型预测输入视频片段的音效,使用来自同一源视频的另一时间点的有条件音视片段作为样本;并提出一个模型,根据用户提供的示例为无声输入视频生成音效。
效果:通过人类研究和自动化评估指标,证明该模型能成功从视频中生成音效,并根据提供的示例调整输出。

The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributions to address this problem. First, we propose a pretext task for training our model to predict sound for an input video clip using a conditional audio-visual clip sampled from another time within the same source video. Second, we propose a model for generating a soundtrack for a silent input video, given a user-supplied example that specifies what the video should "sound like". We show through human studies and automated evaluation metrics that our model successfully generates sound from video, while varying its output according to the content of a supplied example.

Diverse 3D Hand Gesture Prediction From Body Dynamics by Bilateral Hand Disentanglement
Qi, XingqunandLiu, ChenandSun, MuyiandLi, LinchengandFan, ChangjieandYu, Xin



研究问题:如何从上身动态中预测自然且多样的3D手势,这是虚拟角色创建中的一个实践但具有挑战性的任务。
动机:先前的工作通常忽视了两只手之间的非对称运动,并以整体的方式生成两只手,导致不自然的结果。
方法:本文提出了一种新的基于双侧手解耦的两阶段3D手部生成方法,以从身体动态中实现自然且多样的3D手部预测。在第一阶段,我们通过两个手部解耦分支生成自然的手部姿势。考虑到两只手的非对称姿势和运动,我们引入了空间残差记忆(SRM)模块,通过残差学习来模型化身体与每只手之间的空间交互。为了全面提升两只手相对于身体动态的协调性,我们接着提出了时间-运动记忆(TMM)模块。TMM可以有效地模型化身体动态与两只手的运动之间的时间关联性。第二阶段建立在以下洞察之上:给定连续的身体姿态,3D手部预测应该是非决定性的。因此,我们进一步根据第一阶段的初始输出来多样化我们的3D手部预测。具体来说,我们提出了原型记忆采样策略(PSS),通过基于梯度的马尔科夫链蒙特卡罗(MCMC)采样来生成非决定性的手势。
效果:大量实验证明,我们的方法在B2H数据集和我们新收集的TED Hands数据集上都优于最先进的模型。数据集和代码可在以下链接获取:https://github.com/XingqunQi-lab/Diverse-3D-Hand-Gesture-Prediction。

Predicting natural and diverse 3D hand gestures from the upper body dynamics is a practical yet challenging task in virtual avatar creation. Previous works usually overlook the asymmetric motions between two hands and generate two hands in a holistic manner, leading to unnatural results. In this work, we introduce a novel bilateral hand disentanglement based two-stage 3D hand generation method to achieve natural and diverse 3D hand prediction from body dynamics. In the first stage, we intend to generate natural hand gestures by two hand-disentanglement branches. Considering the asymmetric gestures and motions of two hands, we introduce a Spatial-Residual Memory (SRM) module to model spatial interaction between the body and each hand by residual learning. To enhance the coordination of two hand motions wrt. body dynamics holistically, we then present a Temporal-Motion Memory (TMM) module. TMM can effectively model the temporal association between body dynamics and two hand motions. The second stage is built upon the insight that 3D hand predictions should be non-deterministic given the sequential body postures. Thus, we further diversify our 3D hand predictions based on the initial output from the stage one. Concretely, we propose a Prototypical-Memory Sampling Strategy (PSS) to generate the non-deterministic hand gestures by gradient-based Markov Chain Monte Carlo (MCMC) sampling. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on the B2H dataset and our newly collected TED Hands dataset. The dataset and code are available at: https://github.com/XingqunQi-lab/Diverse-3D-Hand-Gesture-Prediction.

MOSO: Decomposing MOtion, Scene and Object for Video Prediction
Sun, MingzhenandWang, WeiningandZhu, XinxinandLiu, Jing



研究问题:如何有效地从视频中分离出运动、场景和对象,并利用这些组件进行视频预测。
动机:视频中的运动、场景和对象是其三个主要视觉组成部分,理解并有效利用这些信息对于视频预测等任务至关重要。
方法:提出了一个两阶段的运动、场景和对象分解框架(MOSO),包括MOSO-VQVAE和MOSO-Transformer。首先,MOSO-VQVAE将前一视频片段分解为运动、场景和对象组件,并将它们表示为离散的标记组。然后,MOSO-Transformer根据先前的标记预测后续视频片段的对象和场景标记,并在标记级别向生成的对象和场景标记添加动态运动。
效果:实验结果表明,该方法在五个具有挑战性的视频预测和无条件视频生成基准测试中实现了新的最先进的性能:BAIR,RoboNet,KTH,KITTI和UCF101。此外,MOSO可以通过组合来自不同视频的对象和场景来生成逼真的视频。

Motion, scene and object are three primary visual components of a video. In particular, objects represent the foreground, scenes represent the background, and motion traces their dynamics. Based on this insight, we propose a two-stage MOtion, Scene and Object decomposition framework (MOSO) for video prediction, consisting of MOSO-VQVAE and MOSO-Transformer. In the first stage, MOSO-VQVAE decomposes a previous video clip into the motion, scene and object components, and represents them as distinct groups of discrete tokens. Then, in the second stage, MOSO-Transformer predicts the object and scene tokens of the subsequent video clip based on the previous tokens and adds dynamic motion at the token level to the generated object and scene tokens. Our framework can be easily extended to unconditional video generation and video frame interpolation tasks. Experimental results demonstrate that our method achieves new state-of-the-art performance on five challenging benchmarks for video prediction and unconditional video generation: BAIR, RoboNet, KTH, KITTI and UCF101. In addition, MOSO can produce realistic videos by combining objects and scenes from different videos.

Unifying Short and Long-Term Tracking With Graph Hierarchies
Cetintas, OrcunandBras\'o, GuillemandLeal-Taix\'e, Laura



研究问题:如何有效地在长视频中追踪多个对象,包括未被遮挡的对象和被遮挡后重新出现的对象。
动机:目前处理这两种任务的方法通常是分离的,针对特定场景进行设计,而表现最好的方法往往是各种技术的结合,导致需要大量的工程工作,缺乏通用性。
方法:我们提出了SUSHI,一种统一且可扩展的多目标跟踪器。我们将长视频分割成一系列子视频进行处理,实现了高度的可扩展性。利用图神经网络处理所有级别的子视频,使我们的模型在整个时间尺度上保持一致,具有很高的通用性。
效果:在四个不同的数据集上,我们的方法都取得了显著优于现有技术的效果。我们的代码和模型可以在bit.ly/sushi-mot获取。

Tracking objects over long videos effectively means solving a spectrum of problems, from short-term association for un-occluded objects to long-term association for objects that are occluded and then reappear in the scene. Methods tackling these two tasks are often disjoint and crafted for specific scenarios, and top-performing approaches are often a mix of techniques, which yields engineering-heavy solutions that lack generality. In this work, we question the need for hybrid approaches and introduce SUSHI, a unified and scalable multi-object tracker. Our approach processes long clips by splitting them into a hierarchy of subclips, which enables high scalability. We leverage graph neural networks to process all levels of the hierarchy, which makes our model unified across temporal scales and highly general. As a result, we obtain significant improvements over state-of-the-art on four diverse datasets. Our code and models are available at bit.ly/sushi-mot.

Recurrence Without Recurrence: Stable Video Landmark Detection With Deep Equilibrium Models
Micaelli, PaulandVahdat, ArashandYin, HongxuandKautz, JanandMolchanov, Pavlo



研究问题:如何改进地标检测模型的预测精度和稳定性?
动机:目前的地标检测模型在处理视频数据时,由于训练数据的缺乏,往往会出现“闪烁”现象,影响预测的稳定性。
方法:提出了一种称为Recurrence without Recurrence(RwR)的新范式,通过将DEQs重新表述为约束优化问题,模拟推理时的递归,即使在训练时没有时间序列数据也能减少地标闪烁。
效果:实验结果表明,使用RwR的LDEQ在WFLW-V数据集上取得了显著的改进,NME和NMF分别提高了10%和13%。

Cascaded computation, whereby predictions are recurrently refined over several stages, has been a persistent theme throughout the development of landmark detection models. In this work, we show that the recently proposed Deep Equilibrium Model (DEQ) can be naturally adapted to this form of computation. Our Landmark DEQ (LDEQ) achieves state-of-the-art performance on the challenging WFLW facial landmark dataset, reaching 3.92 NME with fewer parameters and a training memory cost of O(1) in the number of recurrent modules. Furthermore, we show that DEQs are particularly suited for landmark detection in videos. In this setting, it is typical to train on still images due to the lack of labelled videos. This can lead to a "flickering" effect at inference time on video, whereby a model can rapidly oscillate between different plausible solutions across consecutive frames. By rephrasing DEQs as a constrained optimization, we emulate recurrence at inference time, despite not having access to temporal data at training time. This Recurrence without Recurrence (RwR) paradigm helps in reducing landmark flicker, which we demonstrate by introducing a new metric, normalized mean flicker (NMF), and contributing a new facial landmark video dataset (WFLW-V) targeting landmark uncertainty. On the WFLW-V hard subset made up of 500 videos, our LDEQ with RwR improves the NME and NMF by 10 and 13% respectively, compared to the strongest previously published model using a hand-tuned conventional filter.

Egocentric Audio-Visual Object Localization
Huang, ChaoandTian, YapengandKumar, AnuragandXu, Chenliang



研究问题:如何通过联合处理多模态输入,实现以第一人称视角进行音频-视觉对象定位。
动机:人类通过视听融合自然感知周围环境,机器也需要学习从自我中心的角度处理多模态输入来接近人类的智能。
方法:提出了一个几何感知的时间聚合模块来显式处理自我运动问题,并利用估计的时空几何变换更新视觉表示来减轻自我运动的影响。同时,提出级联特征增强模块来提高跨模态定位的鲁棒性,通过解耦视觉指示的音频表示来实现。在训练过程中,利用自然可用的音视频时间同步作为"免费"的自我监督,避免昂贵的标签化。
效果:实验表明该方法在自我中心的视屏中实现了最先进的定位性能,并能推广到多样化的音视频场景。

Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created while wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module to handle the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to tackle the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally available audio-visual temporal synchronization as the "free" self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes.

Unbiased Scene Graph Generation in Videos
Nag, SayakandMin, KyleandTripathi, SubarnaandRoy-Chowdhury, AmitK.



研究问题:动态场景图生成(SGG)任务复杂且具有挑战性,包括场景的固有动态性、模型预测的时间波动以及视觉关系的长尾分布。
动机:现有的动态SGG方法主要通过复杂的架构捕捉时空上下文,但并未解决上述挑战,特别是关系长尾分布的问题,这常常导致生成有偏的场景图。
方法:我们提出了一个新的框架TEMPURA,它采用基于变压器的序列建模实现对象级别的时间一致性,通过记忆引导训练学习合成无偏的关系表示,并使用高斯混合模型(GMM)衰减视觉关系的预测不确定性。
效果:大量实验证明,该方法比现有方法性能提高了10%,在生成更无偏的场景图方面表现出优越性。

The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context using complex architectures without addressing the challenges mentioned above, especially the long-tailed distribution of relationships. This often leads to the generation of biased scene graphs. To address these challenges, we introduce a new framework called TEMPURA: TEmporal consistency and Memory Prototype guided UnceRtainty Attenuation for unbiased dynamic SGG. TEMPURA employs object-level temporal consistencies via transformer-based sequence modeling, learns to synthesize unbiased relationship representations using memory-guided training, and attenuates the predictive uncertainty of visual relations using a Gaussian Mixture Model (GMM). Extensive experiments demonstrate that our method achieves significant (up to 10% in some cases) performance gain over existing methods highlight- ing its superiority in generating more unbiased scene graphs. Code: https://github.com/sayaknag/unbiasedSGG.git

MIST: Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering
Gao, DifeiandZhou, LuoweiandJi, LeiandZhu, LinchaoandYang, YiandShou, MikeZheng



研究问题:如何构建能够从具有复杂事件的长视频中寻找答案的视频问答系统。
动机:现有的多模态视频问答模型在处理图像或短视频时表现良好,但在处理长视频时面临新的挑战。
方法:提出了一种新的名为“多模态迭代空间-时间变压器”(MIST)的模型,通过将传统的密集空间-时间自注意力分解为级联的片段和区域选择模块,以更好地适应预训练模型进行长视频问答。
效果:在四个视频问答数据集上进行的实验结果表明,MIST实现了最先进的性能,并在计算效率和可解释性方面表现出优越性。

To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multi-modal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multi-event and multi-granularity visual reasoning are required. In this work, we introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. Specifically, MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules that adaptively select frames and image regions that are closely relevant to the question itself. Visual concepts at different granularities are then processed efficiently through an attention module. In addition, MIST iteratively conducts selection and attention over multiple layers to support reasoning over multiple events. The experimental results on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA, show that MIST achieves state-of-the-art performance and is superior at computation efficiency and interpretability.

Two-Stage Co-Segmentation Network Based on Discriminative Representation for Recovering Human Mesh From Videos
Zhang, BoyangandMa, KehuaandWu, SupingandYuan, Zhixiang



研究问题:如何从视频中恢复3D人体网格,特别是在极端光照和混乱背景的情况下。
动机:现有的方法主要关注视频的时间连续性,忽视了复杂场景中的空间表示,导致在极端光照和混乱背景下无法恢复合理且平滑的人体网格序列。
方法:提出了一种基于判别性表示的两阶段协同分割网络,用于从视频中恢复人体网格。第一阶段对视频空间域进行分割,突出空间精细信息,并通过双激励机制和频域增强模块学习并增强帧内判别性表示,同时抑制无关信息(如背景)。第二阶段通过动态整合策略对视频时间域进行分割,建立帧间判别性表示。
效果:通过精心设计的地标锚区域损失来约束人体运动区域的变化,有效地生成合理的人体判别动作。在大量公开数据集上的实验结果表明,该方法优于大多数最先进的方法。

Recovering 3D human mesh from videos has recently made significant progress. However, most of the existing methods focus on the temporal consistency of videos, while ignoring the spatial representation in complex scenes, thus failing to recover a reasonable and smooth human mesh sequence under extreme illumination and chaotic backgrounds.To alleviate this problem, we propose a two-stage co-segmentation network based on discriminative representation for recovering human body meshes from videos. Specifically, the first stage of the network segments the video spatial domain to spotlight spatially fine-grained information, and then learns and enhances the intra-frame discriminative representation through a dual-excitation mechanism and a frequency domain enhancement module, while suppressing irrelevant information (e.g., background). The second stage focuses on temporal context by segmenting the video temporal domain, and models inter-frame discriminative representation via a dynamic integration strategy.Further, to efficiently generate reasonable human discriminative actions, we carefully elaborate a landmark anchor area loss to constrain the variation of the human motion area. Extensive experimental results on large publicly available datasets indicate the superiority in comparison with most state-of-the-art. Code will be made public.

Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-Based Action Recognition
Lin, LilangandZhang, JiahangandLiu, Jiaying



研究问题:现有的自监督预训练方法在骨架动作识别中取得了成功,但这些方法对运动和静态部分同等对待,缺乏针对不同部分的自适应设计,这对动作识别的准确性产生了负面影响。
动机:为了实现对运动和静态部分的自适应动作建模,我们提出了一种基于动作片段对比学习的Actionlet-Dependent Contrastive Learning方法(ActCLR)。
方法:我们将动作片段定义为人体骨架的判别子集,有效地分解了运动区域以更好地进行动作建模。具体来说,通过与没有运动的静态锚点进行对比,我们在无监督的方式下提取出骨架数据的运动区域,作为动作片段。然后,围绕动作片段构建了一个运动自适应的数据转换方法。不同的数据转换被应用于动作片段和非动作片段区域,引入更多的多样性,同时保持它们自身的特点。同时,我们提出了一种语义感知的特征池化方法,以区分的方式在运动和静态区域之间建立特征表示。
效果:我们在NTU RGB+D和PKUMMD上进行了广泛的实验,结果表明我们提出的方法在动作识别方面取得了显著的性能提升。更多的可视化和定量实验证明了我们方法的有效性。

The self-supervised pretraining paradigm has achieved great success in skeleton-based action recognition. However, these methods treat the motion and static parts equally, and lack an adaptive design for different parts, which has a negative impact on the accuracy of action recognition. To realize the adaptive action modeling of both parts, we propose an Actionlet-Dependent Contrastive Learning method (ActCLR). The actionlet, defined as the discriminative subset of the human skeleton, effectively decomposes motion regions for better action modeling. In detail, by contrasting with the static anchor without motion, we extract the motion region of the skeleton data, which serves as the actionlet, in an unsupervised manner. Then, centering on actionlet, a motion-adaptive data transformation method is built. Different data transformations are applied to actionlet and non-actionlet regions to introduce more diversity while maintaining their own characteristics. Meanwhile, we propose a semantic-aware feature pooling method to build feature representations among motion and static regions in a distinguished manner. Extensive experiments on NTU RGB+D and PKUMMD show that the proposed method achieves remarkable action recognition performance. More visualization and quantitative experiments demonstrate the effectiveness of our method.

ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration
Hsu, Wei-NingandRemez, TalandShi, BowenandDonley, JacobandAdi, Yossi



研究问题:本文旨在通过视觉输入改善语音质量,将分离、修复和视频转语音等不同听觉失真类型统一起来,并专注于改进语音的某些方面。
动机:现有的研究通常分别研究每种听觉失真类型,并提出定制的算法。本文提出将这些主题统一起来,研究通用的语音再生,重点不在于重建精确的参考清洁信号,而在于提高语音的某些方面,如可理解性、质量和视频同步。
方法:本文将问题表述为视听语音再生,包括两个步骤:伪视听语音识别(P-AVSR)和伪文本转语音合成(P-TTS)。P-AVSR和P-TTS通过来自自监督语音模型的离散单元连接。此外,还利用自监督视听语音模型初始化P-AVSR。提出的模型称为ReVISE。
效果:ReVISE是第一个高质量的野外视频转语音合成模型,在LRS3的所有视听再生任务上都取得了优异的性能。为了证明其在实际世界中的适用性,还在EasyCom上进行了评估,这是一个在具有挑战性的声学条件下收集的仅有1.6小时训练数据的视听基准。同样,ReVISE大大抑制了噪声并提高了质量。

Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Regeneration, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech while not necessarily preserving the rest such as voice. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual regeneration tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE.

Two-Stream Networks for Weakly-Supervised Temporal Action Localization With Semantic-Aware Mechanisms
Wang, YuandLi, YadongandWang, Hongbin



研究问题:弱监督的时间动作定位旨在在只有视频级别的标注下,检测未修剪视频中的动作边界。
动机:大多数现有的方案只关注对视频级别分类最敏感的时间段,忽视了帧之间的语义一致性。
方法:设计一个可学习的字典,其中条目是相应动作类别的类质心。被识别为同一动作类别的片段表示被引导接近相同的类质心,这指导网络感知帧的语义并避免不合理的定位。此外,提出了一个双流框架,集成了注意力机制和多实例学习策略,以分别提取细粒度的线索和显著的特征。
效果:通过在公开可用的THUMOS-14和ActivityNet-1.3数据集上进行验证,大量的实验和分析表明,我们的模型比现有方法取得了显著的进步。

Weakly-supervised temporal action localization aims to detect action boundaries in untrimmed videos with only video-level annotations. Most existing schemes detect temporal regions that are most responsive to video-level classification, but they overlook the semantic consistency between frames. In this paper, we hypothesize that snippets with similar representations should be considered as the same action class despite the absence of supervision signals on each snippet. To this end, we devise a learnable dictionary where entries are the class centroids of the corresponding action categories. The representations of snippets identified as the same action category are induced to be close to the same class centroid, which guides the network to perceive the semantics of frames and avoid unreasonable localization. Besides, we propose a two-stream framework that integrates the attention mechanism and the multiple-instance learning strategy to extract fine-grained clues and salient features respectively. Their complementarity enables the model to refine temporal boundaries. Finally, the developed model is validated on the publicly available THUMOS-14 and ActivityNet-1.3 datasets, where substantial experiments and analyses demonstrate that our model achieves remarkable advances over existing methods.

Egocentric Auditory Attention Localization in Conversations
Ryan, FionaandJiang, HaoandShukla, AbhinavandRehg, JamesM.andIthapu, VamsiKrishna



研究问题:在嘈杂的交谈环境中,如何识别人们正在关注哪个说话者。
动机:发展理解社会行为和增强人类听力的技术。
方法:提出一种端到端的深度学习方法,利用自我中心视频和多通道音频预测摄像头佩戴者的听觉注意力热图。
效果:该方法利用空间-时间音视特征和场景的整体推理进行预测,并在具有挑战性的多说话者对话数据集上超越了一组基线。

In a noisy conversation environment such as a dinner party, people often exhibit selective auditory attention, or the ability to focus on a particular speaker while tuning out others. Recognizing who somebody is listening to in a conversation is essential for developing technologies that can understand social behavior and devices that can augment human hearing by amplifying particular sound sources. The computer vision and audio research communities have made great strides towards recognizing sound sources and speakers in scenes. In this work, we take a step further by focusing on the problem of localizing auditory attention targets in egocentric video, or detecting who in a camera wearer's field of view they are listening to. To tackle the new and challenging Selective Auditory Attention Localization problem, we propose an end-to-end deep learning approach that uses egocentric video and multichannel audio to predict the heatmap of the camera wearer's auditory attention. Our approach leverages spatiotemporal audiovisual features and holistic reasoning about the scene to make predictions, and outperforms a set of baselines on a challenging multi-speaker conversation dataset. Project page: https://fkryan.github.io/saal

A New Comprehensive Benchmark for Semi-Supervised Video Anomaly Detection and Anticipation
Cao, CongqiandLu, YueandWang, PengandZhang, Yanning



研究问题:本文旨在解决视频异常检测(VAD)中的一种重要异常类型——场景依赖性异常,以及预防异常事件发生的更重要的任务——异常预期。
动机:目前,对于场景依赖性异常和异常预期的研究还很少,且缺乏大规模的半监督VAD数据集。
方法:本文提出了一个新的综合数据集NWPU Campus,包含43个场景、28种异常事件类别和16小时的视频,是当前最大的半监督VAD数据集。同时,本文还提出了一种新的模型,能够同时检测和预期异常事件。
效果:与近年来7种优秀的VAD算法相比,该方法在处理场景依赖性异常检测和异常预期方面都表现出色,并在ShanghaiTech、CUHK Avenue、IITB Corridor和新的NWPU Campus数据集上取得了最先进的性能。

Semi-supervised video anomaly detection (VAD) is a critical task in the intelligent surveillance system. However, an essential type of anomaly in VAD named scene-dependent anomaly has not received the attention of researchers. Moreover, there is no research investigating anomaly anticipation, a more significant task for preventing the occurrence of anomalous events. To this end, we propose a new comprehensive dataset, NWPU Campus, containing 43 scenes, 28 classes of abnormal events, and 16 hours of videos. At present, it is the largest semi-supervised VAD dataset with the largest number of scenes and classes of anomalies, the longest duration, and the only one considering the scene-dependent anomaly. Meanwhile, it is also the first dataset proposed for video anomaly anticipation. We further propose a novel model capable of detecting and anticipating anomalous events simultaneously. Compared with 7 outstanding VAD algorithms in recent years, our method can cope with scene-dependent anomaly detection and anomaly anticipation both well, achieving state-of-the-art performance on ShanghaiTech, CUHK Avenue, IITB Corridor and the newly proposed NWPU Campus datasets consistently. Our dataset and code is available at: https://campusvad.github.io.

3Mformer: Multi-Order Multi-Mode Transformer for Skeletal Action Recognition
Wang, LeiandKoniusz, Piotr



研究问题:本文旨在解决视频异常检测(VAD)中的一种重要异常类型——场景依赖性异常,以及预防异常事件发生的更重要的任务——异常预期。
动机:目前,对于场景依赖性异常和异常预期的研究还很少,且缺乏大规模的半监督VAD数据集。
方法:本文提出了一个新的综合数据集NWPU Campus,包含43个场景、28种异常事件类别和16小时的视频,是当前最大的半监督VAD数据集。同时,本文还提出了一种新的模型,能够同时检测和预期异常事件。
效果:与近年来7种优秀的VAD算法相比,该方法在处理场景依赖性异常检测和异常预期方面都表现出色,并在ShanghaiTech、CUHK Avenue、IITB Corridor和新的NWPU Campus数据集上取得了最先进的性能。

Many skeletal action recognition models use GCNs to represent the human body by 3D body joints connected body parts. GCNs aggregate one- or few-hop graph neighbourhoods, and ignore the dependency between not linked body joints. We propose to form hypergraph to model hyper-edges between graph nodes (e.g., third- and fourth-order hyper-edges capture three and four nodes) which help capture higher-order motion patterns of groups of body joints. We split action sequences into temporal blocks, Higher-order Transformer (HoT) produces embeddings of each temporal block based on (i) the body joints, (ii) pairwise links of body joints and (iii) higher-order hyper-edges of skeleton body joints. We combine such HoT embeddings of hyper-edges of orders 1, ..., r by a novel Multi-order Multi-mode Transformer (3Mformer) with two modules whose order can be exchanged to achieve coupled-mode attention on coupled-mode tokens based on 'channel-temporal block', 'order-channel-body joint', 'channel-hyper-edge (any order)' and 'channel-only' pairs. The first module, called Multi-order Pooling (MP), additionally learns weighted aggregation along the hyper-edge mode, whereas the second module, Temporal block Pooling (TP), aggregates along the temporal block mode. Our end-to-end trainable network yields state-of-the-art results compared to GCN-, transformer- and hypergraph-based counterparts.

STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition
Zhu, XiaoyuandHuang, Po-YaoandLiang, JunweianddeMelo, CelsoM.andHauptmann, AlexanderG.



研究问题:使用运动捕捉序列进行人体动作识别。
动机:现有技术需要多个手动步骤来提取标准骨架表示作为模型输入,我们提出了一种新的空间-时间网格变换器(STMT)直接对网格序列进行建模。
方法:该模型使用具有帧内偏移注意力和帧间自注意力的分层变换器。注意力机制允许模型在任意两个顶点补丁之间自由关注,以学习空间-时间域中的非局部关系。
效果:通过使用蒙版顶点建模和未来帧预测作为两个自我监督任务,充分激活了我们分层变换器中的双向和自回归注意力。所提出的方法在常见的运动捕捉基准测试中,与基于骨架和点云的模型相比,取得了最先进的性能。

We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention. The attention mechanism allows the model to freely attend between any two vertex patches to learn non-local relationships in the spatial-temporal domain. Masked vertex modeling and future frame prediction are used as two self-supervised tasks to fully activate the bi-directional and auto-regressive attention in our hierarchical transformer. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is available at https://github.com/zgzxy001/STMT.

Prototype-Based Embedding Network for Scene Graph Generation
Zheng, ChaofanandLyu, XinyuandGao, LianliandDai, BoandSong, Jingkuan



研究问题:当前的场景图生成(SGG)方法在预测实体对之间的关系时,由于视觉研究问题:当前的场景图生成(SGG)方法在预测实体对之间的关系时,由于视觉外观的多样性和类别间的相似性,导致模型难以获取可靠的特征。
动机:为了解决上述问题,本文提出利用谓词类别的内在语义作为类别原型,以缓解视觉外观的多样性和类别间的相似性带来的挑战。
方法:本文提出了基于原型的嵌入网络(PE-Net),该网络通过原型对齐的紧凑和独特的表示来建模实体/谓词,并在一个共同的嵌入空间中建立实体对和谓词之间的匹配关系,用于关系识别。同时,引入了原型引导学习(PL)来帮助PE-Net高效地学习这种实体-谓词匹配,并设计了原型正则化(PR)来解决由谓词的语义重叠引起的模糊实体-谓词匹配问题。
效果:实验结果表明,该方法在场景图生成方面具有优越的关系识别能力,在Visual Genome和Open Images数据集上均取得了新的最先进的性能。

Current Scene Graph Generation (SGG) methods explore contextual information to predict relationships among entity pairs. However, due to the diverse visual appearance of numerous possible subject-object combinations, there is a large intra-class variation within each predicate category, e.g., "man-eating-pizza, giraffe-eating-leaf", and the severe inter-class similarity between different classes, e.g., "man-holding-plate, man-eating-pizza", in model's latent space. The above challenges prevent current SGG methods from acquiring robust features for reliable relation prediction. In this paper, we claim that predicate's categoryinherent semantics can serve as class-wise prototypes in the semantic space for relieving the above challenges caused by the diverse visual appearances. To the end, we propose the Prototype-based Embedding Network (PE-Net), which models entities/predicates with prototype-aligned compact and distinctive representations and establishes matching between entity pairs and predicates in a common embedding space for relation recognition. Moreover, Prototypeguided Learning (PL) is introduced to help PE-Net efficiently learn such entity-predicate matching, and Prototype Regularization (PR) is devised to relieve the ambiguous entity-predicate matching caused by the predicate's semantic overlap. Extensive experiments demonstrate that our method gains superior relation recognition capability on SGG, achieving new state-of-the-art performances on both Visual Genome and Open Images datasets.

Music-Driven Group Choreography
Le, NhatandPham, ThangandDo, TuongandTjiputra, ErmanandTran, QuangD.andNguyen, Anh



研究问题:如何从音乐中生成舞蹈动作,特别是群体舞蹈。
动机:虽然现有的方法可以生成单个舞者的舞蹈动作,但群体舞蹈的生成仍是一个未解决的问题。
方法:提出了AIOZ-GDANCE,一个新的大规模数据集和相应的半自动标注方法,用于音乐驱动的群体舞蹈生成。同时,还提出了一种新的方法,该方法接受音乐序列和一组舞者的位置作为输入,以有效地生成多个群体一致的舞蹈编排。
效果:实验结果表明,直接将单人舞蹈生成技术应用于群体舞蹈可能会产生不令人满意的结果,如舞者之间的动作不一致和碰撞。而新的方法能够有效生成群体一致的舞蹈编排,为未来的群体舞蹈生成研究提供了便利。

Music-driven choreography is a challenging problem with a wide variety of industrial applications. Recently, many methods have been proposed to synthesize dance motions from music for a single dancer. However, generating dance motion for a group remains an open problem. In this paper, we present AIOZ-GDANCE, a new largescale dataset for music-driven group dance generation. Unlike existing datasets that only support single dance, our new dataset contains group dance videos, hence supporting the study of group choreography. We propose a semiautonomous labeling method with humans in the loop to obtain the 3D ground truth for our dataset. The proposed dataset consists of 16.7 hours of paired music and 3D motion from in-the-wild videos, covering 7 dance styles and 16 music genres. We show that naively applying single dance generation technique to creating group dance motion may lead to unsatisfactory results, such as inconsistent movements and collisions between dancers. Based on our new dataset, we propose a new method that takes an input music sequence and a set of 3D positions of dancers to efficiently produce multiple group-coherent choreographies. We propose new evaluation metrics for measuring group dance quality and perform intensive experiments to demonstrate the effectiveness of our method. Our project facilitates future research on group dance generation and is available at https://aioz-ai.github.io/AIOZ-GDANCE/.

Efficient Movie Scene Detection Using State-Space Transformers
Islam, MdMohaiminulandHasan, MahmudulandAthrey, KishanShamsundarandBraskich, TonyandBertasius, Gedas



研究问题:如何准确检测电影场景,理解电影故事线。
动机:现有的视频识别模型主要针对短程视频分析,对于长段电影视频的精确场景检测存在挑战。
方法:提出一种State-Space Transformer模型,利用新型S4A构建模块有效捕捉长篇电影视频中的依赖关系,进行准确的电影场景检测。
效果:提出的TranS4mer模型在MovieNet、BBC和OVSD三个电影场景检测数据集上的表现优于所有先前的方法,同时比标准的Transformer模型快2倍,需要的GPU内存少3倍。

The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Space Transformer model that can efficiently capture dependencies in long movie videos for accurate movie scene detection. Our model, dubbed TranS4mer, is built using a novel S4A building block, which combines the strengths of structured state-space sequence (S4) and self-attention (A) layers. Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies. Afterward, the state-space operation in the S4A block is used to aggregate long-range inter-shot cues. The final TranS4mer model, which can be trained end-to-end, is obtained by stacking the S4A blocks one after the other multiple times. Our proposed TranS4mer outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being 2x faster and requiring 3x less GPU memory than standard Transformer models. We will release our code and models.

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert
Wang, JiadongandQian, XinyuanandZhang, MaluandTan, RobbyT.andLi, Haizhou



研究问题:本文旨在解决说话人脸部生成(Talking face generation)中,嘴唇动作内容研究问题:本文旨在解决说话人脸部生成(Talking face generation)中,嘴唇动作内容(即所说话的可视性)的问题。
动机:尽管现有的说话人脸部生成技术在唇音同步和视觉质量上取得了很大进步,但它们很少关注嘴唇动作的内容,即所说话的可视性,这是一个重要的生成质量方面。
方法:我们提出使用唇读专家来提高生成的嘴唇区域的可理解性,通过惩罚错误的生成结果。为了弥补数据的稀缺性,我们在视听自监督的方式下训练唇读专家。我们还提出了一种新的对比学习方法来增强唇音同步,并使用一个转换器来编码音频与视频同步,同时考虑音频的全局时间依赖性。
效果:实验表明,我们的提案在阅读可理解性上优于其他最先进的方法,如Wav2Lip,在LRS2数据集上的词错误率超过38%,在LRW数据集上的准确率为27.8%。我们在唇音同步上也达到了最先进的性能,并在视觉质量上取得了相当的性能。

Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input. The previous studies revealed the importance of lip-speech synchronization and visual quality. Despite much progress, they hardly focus on the content of lip movements i.e., the visual intelligibility of the spoken words, which is an important aspect of generation quality. To address the problem, we propose using a lip-reading expert to improve the intelligibility of the generated lip regions by penalizing the incorrect generation results. Moreover, to compensate for data scarcity, we train the lip-reading expert in an audio-visual self-supervised manner. With a lip-reading expert, we propose a novel contrastive learning to enhance lip-speech synchronization, and a transformer to encode audio synchronically with video, while considering global temporal dependency of audio. For evaluation, we propose a new strategy with two different lip-reading experts to measure intelligibility of the generated videos. Rigorous experiments show that our proposal is superior to other State-of-the-art (SOTA) methods, such as Wav2Lip, in reading intelligibility i.e., over 38% Word Error Rate (WER) on LRS2 dataset and 27.8% accuracy on LRW dataset. We also achieve the SOTA performance in lip-speech synchronization and comparable performances in visual quality.

Range-Nullspace Video Frame Interpolation With Focalized Motion Estimation
Yu, ZhiyangandZhang, YuandZou, DongqingandChen, XijunandRen, JimmyS.andRen, Shunqing



研究问题:如何通过预训练语言模型充分利用结构化知识,以提升语言理解能力。
动机:现有的预训练语言模型往往忽视了知识图谱中的有信息量的实体,而这些实体可以增强语言表示,提升模型性能。
方法:本文提出了一种增强的语言表示模型ERNIE,该模型利用大规模文本语料库和知识图谱进行联合训练,能够同时捕捉词汇、句法和知识信息。
效果:实验结果显示,ERNIE在各种知识驱动任务上表现优秀,且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Continuous-time video frame interpolation is a fundamental technique in computer vision for its flexibility in synthesizing motion trajectories and novel video frames at arbitrary intermediate time steps. Yet, how to infer accurate intermediate motion and synthesize high-quality video frames are two critical challenges. In this paper, we present a novel VFI framework with improved treatment for these challenges. To address the former, we propose focalized trajectory fitting, which performs confidence-aware motion trajectory estimation by learning to pay focus to reliable optical flow candidates while suppressing the outliers. The second is range-nullspace synthesis, a novel frame renderer cast as solving an ill-posed problem addressed by learning decoupled components in orthogonal subspaces. The proposed framework sets new records on 7 of 10 public VFI benchmarks.

TransFlow: Transformer As Flow Learner
Lu, YawenandWang, QifanandMa, SiqiandGeng, TongandChen, YingjieVictorandChen, HuaijinandLiu, Dongfang



研究问题:本文旨在提出一种纯转换器架构的光学流估计方法,以解决计算机视觉中的重要任务。
动机:目前的基于CNN的方法在处理诸如遮挡和运动模糊等复杂情况时存在信息丢失的问题,而利用空间自注意力和跨注意力机制可以更准确地捕获全局依赖关系,从而改进光学流估计。
方法:提出了TransFlow模型,该模型通过在相邻帧之间使用空间自注意力和跨注意力机制来有效地捕捉全局依赖关系,并通过长范围的时间关联恢复动态场景中的受损信息。
效果:实验结果表明,TransFlow在Sintel、KITTI-15等数据集上取得了最先进的结果,并在视频对象检测、插值和稳定化等下游任务上也表现出色。

Optical flow is an indispensable building block for various important computer vision tasks, including motion estimation, object tracking, and disparity measurement. In this work, we propose TransFlow, a pure transformer architecture for optical flow estimation. Compared to dominant CNN-based methods, TransFlow demonstrates three advantages. First, it provides more accurate correlation and trustworthy matching in flow estimation by utilizing spatial self-attention and cross-attention mechanisms between adjacent frames to effectively capture global dependencies; Second, it recovers more compromised information (e.g., occlusion and motion blur) in flow estimation through long-range temporal association in dynamic scenes; Third, it enables a concise self-learning paradigm and effectively eliminate the complex and laborious multi-stage pre-training procedures. We achieve the state-of-the-art results on the Sintel, KITTI-15, as well as several downstream tasks, including video object detection, interpolation and stabilization. For its efficacy, we hope TransFlow could serve as a flexible baseline for optical flow estimation.

Simple Cues Lead to a Strong Multi-Object Tracker
Seidenschwarz, JennyandBras\'o, GuillemandSerrano, V{\'\i



研究问题:本文探讨了在多目标跟踪中,是否简单的检测跟踪方法也能实现与端到端模型相同的性能。
动机:尽管基于注意力的方法在多目标跟踪中取得了显著的效果,但作者想知道传统的检测跟踪方法是否也能达到同样的效果。
方法:作者提出了两个关键要素,使标准的再识别网络能在外观跟踪上表现出色。这两个要素是外观特征和简单的运动模型的组合。
效果:通过广泛的分析失败案例,作者发现这种组合能产生强大的跟踪结果。该跟踪器在四个公共数据集上进行了泛化,包括MOT17、MOT20、BDD100k和DanceTrack,实现了最先进的性能。

For a long time, the most common paradigm in MultiObject Tracking was tracking-by-detection (TbD), where objects are first detected and then associated over video frames. For association, most models resourced to motion and appearance cues, e.g., re-identification networks. Recent approaches based on attention propose to learn the cues in a data-driven manner, showing impressive results. In this paper, we ask ourselves whether simple good old TbD methods are also capable of achieving the performance of end-to-end models. To this end, we propose two key ingredients that allow a standard re-identification network to excel at appearance-based tracking. We extensively analyse its failure cases, and show that a combination of our appearance features with a simple motion model leads to strong tracking results. Our tracker generalizes to four public datasets, namely MOT17, MOT20, BDD100k, and DanceTrack, achieving state-ofthe-art performance. https://github.com/dvl-tum/GHOST

TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition
Dave, IshanRajendrakumarandRizve, MamshadNayeemandChen, ChenandShah, Mubarak



研究问题:视频理解任务需要同时考虑空间和时间维度,现有的半监督学习方法主要依赖于硬输入的启发式偏差,如使用两种模态(RGB和光流)或不同播放速度的两股流。
动机:由于视频标注成本高且维度大,半监督学习在视频领域可能比图像更有利。此外,任何视频理解任务都需要对空间和时间维度进行推理。
方法:我们提出了一种学生-教师的半监督学习框架TimeBalance,利用了时间不变和时间独特的自我监督视频表示。这两种表示形式根据动作的性质互补。我们根据一种新的基于时间相似性的重加权方案,根据未标记视频的性质动态地结合这两个教师的知识。
效果:在三个动作识别基准测试集UCF101、HMDB51和Kinetics400上,我们的方法实现了最先进的性能。

Semi-Supervised Learning can be more beneficial for the video domain compared to images because of its higher annotation cost and dimensionality. Besides, any video understanding task requires reasoning over both spatial and temporal dimensions. In order to learn both the static and motion related features for the semi-supervised action recognition task, existing methods rely on hard input inductive biases like using two-modalities (RGB and Optical-flow) or two-stream of different playback rates. Instead of utilizing unlabeled videos through diverse input streams, we rely on self-supervised video representations, particularly, we utilize temporally-invariant and temporally-distinctive representations. We observe that these representations complement each other depending on the nature of the action. Based on this observation, we propose a student-teacher semi-supervised learning framework, TimeBalance, where we distill the knowledge from a temporally-invariant and a temporally-distinctive teacher. Depending on the nature of the unlabeled video, we dynamically combine the knowledge of these two teachers based on a novel temporal similarity-based reweighting scheme. Our method achieves state-of-the-art performance on three action recognition benchmarks: UCF101, HMDB51, and Kinetics400. Code: https://github.com/DAVEISHAN/TimeBalance.

Unified Keypoint-Based Action Recognition Framework via Structured Keypoint Pooling
Hachiuma, RyoandSato, FumiakiandSekii, Taiki



研究问题:本文解决了传统基于骨架的动作识别的三个限制,包括骨架检测和跟踪错误、目标动作种类有限以及个人和帧级的动作识别。
动机:引入了点云深度学习范式到动作识别中,并提出了一种新的深度神经网络架构——结构化关键点池化,以解决上述问题。
方法:提出了一种统一框架和新颖的深度神经网络架构,该架构通过稀疏聚合关键点特征,实现了对输入错误的鲁棒性,并能有效地处理由人类骨架和非人类物体轮廓组成的时间序列关键点作为输入3D点云。
效果:实验结果表明,该方法在各种限制下均表现出色,优于最先进的基于骨架的动作识别和时空动作定位方法。

This paper simultaneously addresses three limitations associated with conventional skeleton-based action recognition; skeleton detection and tracking errors, poor variety of the targeted actions, as well as person-wise and frame-wise action recognition. A point cloud deep-learning paradigm is introduced to the action recognition, and a unified framework along with a novel deep neural network architecture called Structured Keypoint Pooling is proposed. The proposed method sparsely aggregates keypoint features in a cascaded manner based on prior knowledge of the data structure (which is inherent in skeletons), such as the instances and frames to which each keypoint belongs, and achieves robustness against input errors. Its less constrained and tracking-free architecture enables time-series keypoints consisting of human skeletons and nonhuman object contours to be efficiently treated as an input 3D point cloud and extends the variety of the targeted action. Furthermore, we propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This trick switches the pooling kernels between the training and inference phases to detect person-wise and frame-wise actions in a weakly supervised manner using only video-level action labels. This trick enables our training scheme to naturally introduce novel data augmentation, which mixes multiple point clouds extracted from different videos. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art skeleton-based action recognition and spatio-temporal action localization methods.

A Large-Scale Robustness Analysis of Video Action Recognition Models
Schiappa, MadelineChantryandBiyani, NamanandKamtam, PrudviandVyas, ShrutiandPalangi, HamidandVineet, VibhavandRawat, YogeshS.



研究问题:本文旨在对现有的视频动作识别模型进行大规模鲁棒性分析,主要关注真实世界分布偏移扰动下的鲁棒性,而非对抗性扰动。
动机:近年来,视频动作识别领域取得了很大进展,出现了基于卷积神经网络(CNN)和基于变换器的先进方法。然而,这些模型在面对真实世界的分布偏移时是否仍然具有鲁棒性尚不清楚。
方法:本文提出了四个不同的基准数据集(HMDB51-P、UCF101-P、Kinetics400-P和SSv2-P),用于分析现有模型的鲁棒性。研究了六种最先进的动作识别模型在90种不同扰动下的表现。
效果:研究发现,1) 基于变换器的方法相对于基于CNN的方法更具鲁棒性;2) 预训练对基于变换器的方法的鲁棒性提升作用大于基于CNN的方法;3) 所有研究模型在所有数据集上对时间扰动都具有鲁棒性,但SSv2除外,这表明时间信息在动作识别中的重要性因数据集和活动而异。此外,研究还探讨了数据增强在模型鲁棒性中的作用,并提出了包含真实世界分布偏移的UCF101-DS数据集,以进一步验证部分发现。

We have seen great progress in video action recognition in recent years. There are several models based on convolutional neural network (CNN) and some recent transformer based approaches which provide top performance on existing benchmarks. In this work, we perform a large-scale robustness analysis of these existing models for video action recognition. We focus on robustness against real-world distribution shift perturbations instead of adversarial perturbations. We propose four different benchmark datasets, HMDB51-P, UCF101-P, Kinetics400-P, and SSv2-P to perform this analysis. We study robustness of six state-of-the-art action recognition models against 90 different perturbations. The study reveals some interesting findings, 1) Transformer based models are consistently more robust compared to CNN based models, 2) Pre-training improves robustness for Transformer based models more than CNN based models, and 3) All of the studied models are robust to temporal perturbations for all datasets but SSv2; suggesting the importance of temporal information for action recognition varies based on the dataset and activities. Next, we study the role of augmentations in model robustness and present a real-world dataset, UCF101-DS, which contains realistic distribution shifts, to further validate some of these findings. We believe this study will serve as a benchmark for future research in robust video action recognition.

Blind Video Deflickering by Neural Filtering With a Flawed Atlas
Lei, ChenyangandRen, XuanchiandZhang, ZhaoxiangandChen, Qifeng



研究问题:如何有效地去除视频中的闪烁瑕疵?
动机:许多视频都存在闪烁的瑕疵,而现有的去闪烁方法通常需要特定的指导,如闪烁频率、手动标注或额外的一致视频。
方法:本文提出了一种通用的去闪烁框架,只需要输入一个有闪烁的视频,无需其他指导。该框架的核心是利用神经图谱和神经过滤策略。
效果:通过在真实世界的闪烁视频上进行大量实验,验证了该方法的有效性,其性能甚至超过了使用额外指导的基线方法。

Many videos contain flickering artifacts; common causes of flicker include video processing algorithms, video generation algorithms, and capturing videos under specific situations. Prior work usually requires specific guidance such as the flickering frequency, manual annotations, or extra consistent videos to remove the flicker. In this work, we propose a general flicker removal framework that only receives a single flickering video as input without additional guidance. Since it is blind to a specific flickering type or guidance, we name this "blind deflickering." The core of our approach is utilizing the neural atlas in cooperation with a neural filtering strategy. The neural atlas is a unified representation for all frames in a video that provides temporal consistency guidance but is flawed in many cases. To this end, a neural network is trained to mimic a filter to learn the consistent features (e.g., color, brightness) and avoid introducing the artifacts in the atlas. To validate our method, we construct a dataset that contains diverse real-world flickering videos. Extensive experiments show that our method achieves satisfying deflickering performance and even outperforms baselines that use extra guidance on a public benchmark. The source code is publicly available at https://chenyanglei.github.io/deflicker.

Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting
Peng, XiaogangandMao, SiyuanandWu, Zizhao



研究问题:多人姿态预测在复杂人群中的精细人体交互建模中仍然是一个挑战。
动机:现有的方法通常将整个姿态序列表示为一个时间序列,忽视了基于骨骼身体部位的人与人之间的互动影响。
方法:我们提出了一种新的轨迹感知的身体互动转换器(TBIFormer)进行多人姿态预测,通过有效地模拟身体部位之间的互动。具体来说,我们构建了一个临时身体部分模块,将所有的姿态序列转换为多个人身体部分序列,以保留基于身体语义的空间和时间信息。然后,我们设计了一个社会身体互动自我注意力(SBI-MSA)模块,利用转换后的序列学习身体部位的动态交互。此外,与先前的欧几里得距离为基础的空间编码不同,我们提出了一种新颖而有效的轨迹感知的相对位置编码,为SBI-MSA提供区分性的空间信息和额外的交互线索。
效果:我们在CMU-Mocap、MuPoTS-3D以及合成数据集(6-10人)上进行了实证评估,无论是短期还是长期,都大大超过了最先进的方法。

Multi-person pose forecasting remains a challenging problem, especially in modeling fine-grained human body interaction in complex crowd scenarios. Existing methods typically represent the whole pose sequence as a temporal series, yet overlook interactive influences among people based on skeletal body parts. In this paper, we propose a novel Trajectory-Aware Body Interaction Transformer (TBIFormer) for multi-person pose forecasting via effectively modeling body part interactions. Specifically, we construct a Temporal Body Partition Module that transforms all the pose sequences into a Multi-Person Body-Part sequence to retain spatial and temporal information based on body semantics. Then, we devise a Social Body Interaction Self-Attention (SBI-MSA) module, utilizing the transformed sequence to learn body part dynamics for inter- and intra-individual interactions. Furthermore, different from prior Euclidean distance-based spatial encodings, we present a novel and efficient Trajectory-Aware Relative Position Encoding for SBI-MSA to offer discriminative spatial information and additional interactive clues. On both short- and long-term horizons, we empirically evaluate our framework on CMU-Mocap, MuPoTS-3D as well as synthesized datasets (6 10 persons), and demonstrate that our method greatly outperforms the state-of-the-art methods.

Learning Spatial-Temporal Implicit Neural Representations for Event-Guided Video Super-Resolution
Lu, YunfanandWang, ZipengandLiu, MinjieandWang, HongjianandWang, Lin



研究问题:如何利用事件的时间高分辨率特性,实现任意尺度的视频超分辨率(VSR)任务。
动机:由于事件相机的异步感知强度变化和高动态范围、低延迟的特性,启发了研究者使用事件来引导具有挑战性的VSR任务。
方法:提出了一个新颖的框架,将事件的空间-时间插值引入到VSR的统一框架中。主要思路是从查询到的空间-时间坐标和RGB帧和事件的特征中学习隐式神经表示。
效果:通过在真实世界的数据集中进行大量实验,证明该方法显著超越了现有技术,并实现了任意尺度的VSR。

Event cameras sense the intensity changes asynchronously and produce event streams with high dynamic range and low latency. This has inspired research endeavors utilizing events to guide the challenging video super-resolution (VSR) task. In this paper, we make the first at tempt to address a novel problem of achieving VSR at random scales by taking advantages of the high temporal resolution property of events. This is hampered by the difficulties of representing the spatial-temporal information of events when guiding VSR. To this end, we propose a novel framework that incorporates the spatial-temporal interpolation of events to VSR in a unified framework. Our key idea is to learn implicit neural representations from queried spatial-temporal coordinates and features from both RGB frames and events. Our method contains three parts. Specifically, the Spatial-Temporal Fusion (STF) module first learns the 3D features from events and RGB frames. Then, the Temporal Filter (TF) module unlocks more explicit motion information from the events near the queried timestamp and generates the 2D features. Lastly, the Spatial-Temporal Implicit Representation (STIR) module recovers the SR frame in arbitrary resolutions from the outputs of these two modules. In addition, we collect a real-world dataset with spatially aligned events and RGB frames. Extensive experiments show that our method significantly surpass the prior-arts and achieves VSR with random scales, e.g., 6.5. Code and dataset are available at https://.

Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network
Zhang, ZhichengandWang, LijuanandYang, Jufeng



研究问题:如何更准确地预测用户生成视频(UGVs)的情绪?
动机:现有方法主要关注少数关键视觉帧,可能限制了其编码描绘预期情绪的上下文的能力。
方法:提出一种跨模态的时间擦除网络,弱监督方式下不仅定位关键帧,也获取上下文和音频相关信息。首先利用不同片段之间的内部和外部关系准确选择关键帧,然后迭代擦除关键帧以鼓励模型专注于包含互补信息的上下文。
效果:在三个具有挑战性的视频情感基准上进行的大量实验表明,该方法优于最先进的方法。代码已在https://github.com/nku-zhichengzhang/WECL上发布。

Automatically predicting the emotions of user-generated videos (UGVs) receives increasing interest recently. However, existing methods mainly focus on a few key visual frames, which may limit their capacity to encode the context that depicts the intended emotions. To tackle that, in this paper, we propose a cross-modal temporal erasing network that locates not only keyframes but also context and audio-related information in a weakly-supervised manner. In specific, we first leverage the intra- and inter-modal relationship among different segments to accurately select keyframes. Then, we iteratively erase keyframes to encourage the model to concentrate on the contexts that include complementary information. Extensive experiments on three challenging video emotion benchmarks demonstrate that our method performs favorably against state-of-the-art approaches. The code is released on https://github.com/nku-zhichengzhang/WECL.

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
Geng, TiantianandWang, TengandDuan, JinmingandCong, RunminandZheng, Feng



研究问题:现有的音视频事件定位方法只处理手动剪辑的视频,每个视频中只有一个实例,这不符合现实生活中自然视频中存在多个不同类别的音视频事件的情况。
动机:为了更好地适应现实生活应用,本文关注密集音视频事件定位任务,旨在联合定位和识别未剪辑视频中发生的所有音视频事件。
方法:引入第一个未剪辑音视频(UnAV-100)数据集,包含1万个未剪辑的视频和3万个以上的音视频事件。然后,使用新的学习框架来解决这个问题,该框架能够充分整合音频和视觉模态,一次性定位各种长度的音视频事件并捕捉它们之间的依赖关系。
效果:大量实验表明了该方法的有效性以及多尺度跨模态感知和依赖建模对于此任务的重要性。

Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, in this paper we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. The problem is challenging as it requires fine-grained audio-visual scene and context understanding. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass. Extensive experiments demonstrate the effectiveness of our method as well as the significance of multi-scale cross-modal perception and dependency modeling for this task.

Streaming Video Model
Zhao, YuchengandLuo, ChongandTang, ChuanxinandChen, DongdongandCodella, NoelandZha, Zheng-Jun



研究问题:本文旨在解决视频理解任务的传统模型架构问题,即序列和帧基的视频任务通常由两种不同的架构分别处理。
动机:现有的视频理解任务模型将序列和帧基的任务分开处理,导致效率低下。因此,作者提出将这两种任务统一到一个流媒体视频架构中。
方法:作者提出了一种名为Streaming Vision Transformer(S-ViT)的新型流媒体视频架构。该架构首先使用记忆增强的时态感知空间编码器生成帧级特征,以满足帧基视频任务的需求。然后,将帧特征输入到与任务相关的时序解码器中,以获取用于序列基任务的时空特征。
效果:实验结果表明,S-ViT在序列基的动作识别任务上达到了最先进的准确率,并在帧基的多目标跟踪任务上比传统架构具有竞争优势。作者认为,流媒体视频模型的概念和S-ViT的实现是朝着视频理解的统一深度学习架构迈出的坚实一步。

Video understanding tasks have traditionally been modeled by two separate architectures, specially tailored for two distinct tasks. Sequence-based video tasks, such as action recognition, use a video backbone to directly extract spatiotemporal features, while frame-based video tasks, such as multiple object tracking (MOT), rely on single fixed-image backbone to extract spatial features. In contrast, we propose to unify video understanding tasks into one novel streaming video architecture, referred to as Streaming Vision Transformer (S-ViT). S-ViT first produces frame-level features with a memory-enabled temporally-aware spatial encoder to serve the frame-based video tasks. Then the frame features are input into a task-related temporal decoder to obtain spatiotemporal features for sequence-based tasks. The efficiency and efficacy of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based action recognition task and the competitive advantage over conventional architecture in the frame-based MOT task. We believe that the concept of streaming video model and the implementation of S-ViT are solid steps towards a unified deep learning architecture for video understanding. Code will be available at https://github.com/yuzhms/Streaming-Video-Model.

Masked Motion Encoding for Self-Supervised Video Representation Learning
Sun, XinyuandChen, PeihaoandChen, LiangweiandLi, ChanghaoandLi, ThomasH.andTan, MingkuiandGan, Chuang



研究问题:如何从无标签视频中学习判别性视频表示是视频分析中具有挑战性但至关重要的问题。
动机:现有的方法通过预测被遮蔽区域中的外观内容来学习表示模型,但这可能不足以建模时间线索,因为外观内容可以从单帧轻松重建。
方法:提出Masked Motion Encoding(MME)新预训练范式,通过重建外观和运动信息来探索时间线索。具体来说,我们关注两个关键挑战以改善表示性能:1)如何在多个帧上良好地表示可能的长期运动;2)如何从稀疏采样的视频中获得细粒度的时间线索。
效果:在MME预训练范式下,模型能够预测长期和细粒度的运动细节。代码可在https://github.com/XinyuSun/MME获取。

How to learn discriminative video representation from unlabeled videos is challenging but crucial for video analysis. The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions. However, simply masking and recovering appearance contents may not be sufficient to model temporal clues as the appearance contents can be easily reconstructed from a single frame. To overcome this limitation, we present Masked Motion Encoding (MME), a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues. In MME, we focus on addressing two critical challenges to improve the representation performance: 1) how to well represent the possible long-term motion across multiple frames; and 2) how to obtain fine-grained temporal clues from sparsely sampled videos. Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions. Besides, given the sparse video input, we enforce the model to reconstruct dense motion trajectories in both spatial and temporal dimensions. Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details. Code is available at https://github.com/XinyuSun/MME.

Text-Visual Prompting for Efficient 2D Temporal Video Grounding
Zhang, YimengandChen, XinandJia, JinghanandLiu, SijiaandDing, Ke



研究问题:本文研究了时间视频定位(TVG)问题,即在一段未剪辑的长视频中预测文本句子描述的时刻的开始/结束时间点。
动机:尽管近年来受益于精细的3D视觉特征,TVG技术取得了显著的进步,但3D卷积神经网络的高复杂性使得提取密集的3D视觉特征耗时且需要大量的内存和计算资源。
方法:为了实现高效的TVG,我们提出了一种新的文本-视觉提示(TVP)框架,该框架将优化的扰动模式(我们称之为“提示”)纳入TVG模型的视觉输入和文本特征中。与3D CNNs不同,我们证明TVP允许我们在2D TVG模型中有效地共同训练视觉编码器和语言编码器,并仅使用低复杂度的稀疏2D视觉特征来提高跨模态特征融合的性能。此外,我们还提出了一种有效的学习TVG的时序距离IoU(TDIoU)损失函数。
效果:在Charades-STA和ActivityNet Captions两个基准数据集上的实验表明,所提出的TVP显著提高了2D TVG的性能(例如,在Charades-STA上提高了9.79%,在ActivityNet Captions上提高了30.77%),并且与使用3D视觉特征的TVG相比,实现了5倍的推理加速。代码可在Open.

In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call 'prompts') into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of crossmodal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5x inference acceleration over TVG using 3D visual features. Codes are available at Open.Intel.

Fast Contextual Scene Graph Generation With Unbiased Context Augmentation
Jin, TianleiandGuo, FangtaiandMeng, QiweiandZhu, ShiqiangandXi, XiangmingandWang, WenandMu, ZonghaoandSong, Wei



研究问题:本文旨在解决场景图生成方法中的长尾偏差和推理速度慢的问题。
动机:人类可以通过上下文描述来分析物体之间的关系,这种抽象的认知过程可能受到经验的指导。
方法:提出了一种不使用视觉信息的上下文场景图生成(C-SGG)方法,并引入了上下文增强方法。通过在原始数据集上使用基于位置和大小的对象轻微扰动的上下文增强方法,产生多样化的上下文描述,用于无偏的C-SGG训练以减轻长尾偏差。此外,还引入了一种上下文引导的视觉场景图生成(CV-SGG)方法,利用C-SGG的经验来指导视觉关注可能的谓词。
效果:实验结果表明,C-SGG缓解了长尾偏差,省略了视觉特征提取的巨大计算,实现了实时的场景图生成。CV-SGG在常见谓词和尾部谓词之间取得了很好的平衡。

Scene graph generation (SGG) methods have historically suffered from long-tail bias and slow inference speed. In this paper, we notice that humans can analyze relationships between objects relying solely on context descriptions,and this abstract cognitive process may be guided by experience. For example, given descriptions of cup and table with their spatial locations, humans can speculate possible relationships < cup, on, table > or < table, near, cup >. Even without visual appearance information, some impossible predicates like flying in and looking at can be empirically excluded. Accordingly, we propose a contextual scene graph generation (C-SGG) method without using visual information and introduce a context augmentation method. We propose that slight perturbations in the position and size of objects do not essentially affect the relationship between objects. Therefore, at the context level, we can produce diverse context descriptions by using a context augmentation method based on the original dataset. These diverse context descriptions can be used for unbiased training of C-SGG to alleviate long-tail bias. In addition, we also introduce a context guided visual scene graph generation (CV-SGG) method, which leverages the C-SGG experience to guide vision to focus on possible predicates. Through extensive experiments on the publicly available dataset, C-SGG alleviates long-tail bias and omits the huge computation of visual feature extraction to realize real-time SGG. CV-SGG achieves a great trade-off between common predicates and tail predicates.

Event-Based Blurry Frame Interpolation Under Blind Exposure
Weng, WenmingandZhang, YueyiandXiong, Zhiwei



研究问题:如何从低帧率模糊视频中恢复高帧率清晰视频。
动机:现有的模糊帧插值方法需要预定义和已知的曝光时间,当应用于野外捕获的视频时,性能会严重下降。
方法:利用事件相机进行盲目曝光下的模糊帧插值。首先,通过事件流引导的曝光估计策略来估计丢失的曝光先验,使盲目曝光问题得到良好解决。其次,通过迭代残差学习,用一个时间-曝光控制策略来模拟相互约束。
效果:在盲目曝光下,该方法在合成和自行收集的真实世界数据集上的性能均优于现有方法。

Restoring sharp high frame-rate videos from low frame-rate blurry videos is a challenging problem. Existing blurry frame interpolation methods assume a predefined and known exposure time, which suffer from severe performance drop when applied to videos captured in the wild. In this paper, we study the problem of blurry frame interpolation under blind exposure with the assistance of an event camera. The high temporal resolution of the event camera is beneficial to obtain the exposure prior that is lost during the imaging process. Besides, sharp frames can be restored using event streams and blurry frames relying on the mutual constraint among them. Therefore, we first propose an exposure estimation strategy guided by event streams to estimate the lost exposure prior, transforming the blind exposure problem well-posed. Second, we propose to model the mutual constraint with a temporal-exposure control strategy through iterative residual learning. Our blurry frame interpolation method achieves a distinct performance boost over existing methods on both synthetic and self-collected real-world datasets under blind exposure.

Modular Memorability: Tiered Representations for Video Memorability Prediction
Dumont, Th\'eoandHevia, JuanSegundoandFosco, CamiloL.



研究问题:如何最好地估计视觉内容的记忆力,目前是记忆力社区的争论焦点。
动机:不同的图像和视频的关键属性如何影响它们的记忆巩固。
方法:分析几种特征的影响,开发一个模型来模拟提出的“记忆路径”最重要的部分。构建M3-S模型,这是一个新颖的记忆网络,以模块化的方式处理输入的视频。
效果:我们的模块学习的不同表示是非平凡的,并且彼此之间有显著的差异。此外,我们发现某些表示在记忆预测任务上比其他表示表现得更好。我们的方法超越了在两个最大的视频记忆数据集上的最先进的技术,并为该领域开辟了新的应用。

The question of how to best estimate the memorability of visual content is currently a source of debate in the memorability community. In this paper, we propose to explore how different key properties of images and videos affect their consolidation into memory. We analyze the impact of several features and develop a model that emulates the most important parts of a proposed "pathway to memory": a simple but effective way of representing the different hurdles that new visual content needs to surpass to stay in memory. This framework leads to the construction of our M3-S model, a novel memorability network that processes input videos in a modular fashion. Each module of the network emulates one of the four key steps of the pathway to memory: raw encoding, scene understanding, event understanding and memory consolidation. We find that the different representations learned by our modules are non-trivial and substantially different from each other. Additionally, we observe that certain representations tend to perform better at the task of memorability prediction than others, and we introduce an in-depth ablation study to support our results. Our proposed approach surpasses the state of the art on the two largest video memorability datasets and opens the door to new applications in the field.

Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization
Zhao, ChenandLiu, ShumingandMangalam, KarttikeyaandGhanem, Bernard



研究问题:如何有效地进行长期形式推理,预测各种持续时间和复杂内容的动作?
动机:由于GPU内存有限,对长视频进行端到端的TAL训练是一个重大挑战。大多数方法只能在没有优化特征的情况下训练预提取的特征,从而限制了定位性能。
方法:我们提出了一种新的端到端方法Re2TAL,通过重新连接预训练的视频主干网络来实现可逆的TAL。Re2TAL构建了一个具有可逆模块的主干网络,输入可以从输出中恢复,从而在训练过程中清除内存中的大量中间激活。
效果:Re2TAL仅使用RGB模态,在ActivityNet-v1.3上达到了37.01%的平均mAP,创造了新的最先进的记录,并在THUMOS-14上以tIoU=0.5达到64.9%的mAP,超过了所有其他仅使用RGB的方法。

Temporal action localization (TAL) requires long-form reasoning to predict actions of various durations and complex content. Given limited GPU memory, training TAL end to end (i.e., from videos to predictions) on long videos is a significant challenge. Most methods can only train on pre-extracted features without optimizing them for the localization problem, consequently limiting localization performance. In this work, to extend the potential in TAL networks, we propose a novel end-to-end method Re2TAL, which rewires pretrained video backbones for reversible TAL. Re2TAL builds a backbone with reversible modules, where the input can be recovered from the output such that the bulky intermediate activations can be cleared from memory during training. Instead of designing one single type of reversible module, we propose a network rewiring mechanism, to transform any module with a residual connection to a reversible module without changing any parameters. This provides two benefits: (1) a large variety of reversible networks are easily obtained from existing and even future model designs, and (2) the reversible models require much less training effort as they reuse the pre-trained parameters of their original non-reversible versions. Re2TAL, only using the RGB modality, reaches 37.01% average mAP on ActivityNet-v1.3, a new state-of-the-art record, and mAP 64.9% at tIoU=0.5 on THUMOS-14, outperforming all other RGB-only methods. Code is available at https://github.com/coolbay/Re2TAL.

Data-Driven Feature Tracking for Event Cameras
Messikommer, NicoandFang, CarterandGehrig, MathiasandScaramuzza, Davide



研究问题:如何提高事件相机在低延迟和低带宽特征跟踪任务中的性能。
动机:现有的事件相机特征跟踪方法需要大量参数调整,对噪声敏感,且无法适应不同的场景。
方法:提出一种数据驱动的事件相机特征跟踪器,利用低延迟事件追踪灰度帧中检测到的特征。通过一种新型的帧注意模块实现稳健性能,该模块在特征轨迹之间共享信息。
效果:通过直接从合成数据到真实数据的零样本转移,所提出的方法在相对特征年龄方面比现有方法提高了120%,同时实现了最低的延迟。通过一种新的自我监督策略将跟踪器适应于真实数据,这种性能差距进一步增加到130%。

Because of their high temporal resolution, increased resilience to motion blur, and very sparse output, event cameras have been shown to be ideal for low-latency and low-bandwidth feature tracking, even in challenging scenarios. Existing feature tracking methods for event cameras are either handcrafted or derived from first principles but require extensive parameter tuning, are sensitive to noise, and do not generalize to different scenarios due to unmodeled effects. To tackle these deficiencies, we introduce the first data-driven feature tracker for event cameras, which leverages low-latency events to track features detected in a grayscale frame. We achieve robust performance via a novel frame attention module, which shares information across feature tracks. By directly transferring zero-shot from synthetic to real data, our data-driven tracker outperforms existing approaches in relative feature age by up to 120% while also achieving the lowest latency. This performance gap is further increased to 130% by adapting our tracker to real data with a novel self-supervision strategy.

Autoregressive Visual Tracking
Wei, XingandBai, YifanandZheng, YongchaoandShi, DahuandGong, Yihong



研究问题:本文旨在开发一种自动回归的视觉目标跟踪框架。
动机:现有的基于模板匹配的目标跟踪器仅考虑每帧的定位精度,而忽视了目标在连续帧之间的运动轨迹。
方法:ARTrack采用时间自回归的方法来模拟目标运动轨迹的序列演化,从而在连续帧中持续追踪目标。
效果:ARTrack简单直接,无需定制的定位头和后处理,尽管其简洁,但在主流基准数据集上取得了最先进的性能。

We present ARTrack, an autoregressive framework for visual object tracking. ARTrack tackles tracking as a coordinate sequence interpretation task that estimates object trajectories progressively, where the current estimate is induced by previous states and in turn affects subsequences. This time-autoregressive approach models the sequential evolution of trajectories to keep tracing the object across frames, making it superior to existing template matching based trackers that only consider the per-frame localization accuracy. ARTrack is simple and direct, eliminating customized localization heads and post-processings. Despite its simplicity, ARTrack achieves state-of-the-art performance on prevailing benchmark datasets.

ASPnet: Action Segmentation With Shared-Private Representation of Multiple Data Sources
vanAmsterdam, BeatriceandKadkhodamohammadi, AbdolrahimandLuengo, ImanolandStoyanov, Danail



研究问题:如何有效地融合互补信息,提高动作分割模型的鲁棒性和准确性。
动机:大多数最先进的动作分割方法基于单一输入模态或多数据源的简单融合,而有效的信息融合可以强化分割模型,使其对传感器噪声更具鲁棒性,并在较小的训练数据集上更准确。
方法:我们提出了一种多模态分割模型,将隐藏特征解耦为模态共享组件和私有组件,然后使用注意力瓶颈来捕获数据中的长范围时间依赖性,同时保留连续处理层的解耦性。
效果:在50salads、Breakfast和RARP45数据集上的评估表明,我们的多模态方法在多视图和多模态数据源上都优于不同的数据融合基线,与最先进的方法相比,获得了竞争或更好的结果。我们的模型对添加的传感器噪声更具鲁棒性,即使训练数据较少,也可以实现与强大的视频基线相媲美的性能。

Most state-of-the-art methods for action segmentation are based on single input modalities or naive fusion of multiple data sources. However, effective fusion of complementary information can potentially strengthen segmentation models and make them more robust to sensor noise and more accurate with smaller training datasets. In order to improve multimodal representation learning for action segmentation, we propose to disentangle hidden features of a multi-stream segmentation model into modality-shared components, containing common information across data sources, and private components; we then use an attention bottleneck to capture long-range temporal dependencies in the data while preserving disentanglement in consecutive processing layers. Evaluation on 50salads, Breakfast and RARP45 datasets shows that our multimodal approach outperforms different data fusion baselines on both multiview and multimodal data sources, obtaining competitive or better results compared with the state-of-the-art. Our model is also more robust to additive sensor noise and can achieve performance on par with strong video baselines even with less training data.

Skinned Motion Retargeting With Residual Perception of Motion Semantics \& Geometry
Zhang, JiaxuandWeng, JunwuandKang, DiandZhao, FangandHuang, ShaoliandZhe, XuefeiandBao, LinchaoandShan, YingandWang, JueandTu, Zhigang



研究问题:如何有效地进行运动重定向,同时考虑到源和目标在骨架和形状几何级别上的差异。
动机:现有的运动重定向方法无法很好地处理源和目标在骨架和形状几何级别上的差异,导致重定向效果不理想。
方法:提出一种新的残差RETargeting网络(R2ET)结构,通过两个神经修改模块逐步调整源动作以适应目标骨架和形状。具体来说,引入了一个骨架感知模块来保留源动作的语义,设计了一个形状感知模块来感知目标角色的几何形状,以减少穿透和接触缺失。通过探索的距离损失模型显式地模拟运动语义和几何形状,这两个模块可以在单次推理中学习源动作的剩余运动修改,生成合理的重定向运动,无需后处理。为了平衡这两种修改,进一步提出了一个平衡门来进行线性插值。
效果:在公共数据集Mixamo上的大量实验表明,我们的R2ET实现了最先进的性能,并在保持运动语义的同时有效地减少了穿透和接触缺失。

A good motion retargeting cannot be reached without reasonable consideration of source-target differences on both the skeleton and shape geometry levels. In this work, we propose a novel Residual RETargeting network (R2ET) structure, which relies on two neural modification modules, to adjust the source motions to fit the target skeletons and shapes progressively. In particular, a skeleton-aware module is introduced to preserve the source motion semantics. A shape-aware module is designed to perceive the geometries of target characters to reduce interpenetration and contact-missing. Driven by our explored distance-based losses that explicitly model the motion semantics and geometry, these two modules can learn residual motion modifications on the source motion to generate plausible retargeted motion in a single inference without post-processing. To balance these two modifications, we further present a balancing gate to conduct linear interpolation between them. Extensive experiments on the public dataset Mixamo demonstrate that our R2ET achieves the state-of-the-art performance, and provides a good balance between the preservation of motion semantics as well as the attenuation of interpenetration and contact-missing. Code is available at https://github.com/Kebii/R2ET.

Learning Situation Hyper-Graphs for Video Question Answering
Urooj, AishaandKuehne, HildeandWu, BoandChheu, KimandBousselham, WalidandGan, ChuangandLobo, NielsandShah, Mubarak



研究问题:视频问答需要捕捉演员、物体及其关系的出现和演变。
动机:现有的视频问答模型无法充分捕获这些关系,因此提出一种基于情况超图的视频问答架构。
方法:通过训练一个情况超图解码器,从输入视频片段中隐式识别动作和人/物关系的图形表示,并使用预测的情况超图与问题嵌入之间的交叉注意力来预测正确答案。
效果:在两个具有挑战性的基准测试AGQA和STAR上进行广泛评估,结果显示学习底层情况超图有助于显著提高系统在视频问答任务上的表现。

Answering questions about complex situations in videos requires not only capturing of the presence of actors, objects, and their relations, but also the evolution of these relationships over time. A situation hyper-graph is a representation that describes situations as scene sub-graphs for video frames and hyper-edges for connected sub-graphs, and has been proposed to capture all such information in a compact structured form. In this work, we propose an architecture for Video Question Answering (VQA) that enables answering questions related to video content by predicting situation hyper-graphs, coined Situation Hyper-Graph based Video Question Answering (SHG-VQA). To this end, we train a situation hyper-graph decoder to implicitly identify graph representations with actions and object/human-object relationships from the input video clip and to use cross-attention between the predicted situation hyper-graphs and the question embedding to predict the correct answer. The proposed method is trained in an end-to-end manner and optimized by a cross-entropy based VQA loss function and a Hungarian matching loss for the situation graph prediction. The effectiveness of the proposed architecture is extensively evaluated on two challenging benchmarks: AGQA and STAR. Our results show that learning the underlying situation hyper-graphs helps the system to significantly improve its performance for novel challenges of video question answering task.

Leveraging Temporal Context in Low Representational Power Regimes
Fosco, CamiloL.andJin, SouYoungandJosephs, EmilieandOliva, Aude



研究问题:如何提高低参数模型(如边缘设备上的模型)的性能?
动机:计算机视觉模型虽然能很好地识别和利用世界中的规律,但从头开始学习这些规律计算成本高,对低参数模型来说是个挑战。
方法:通过引入事件转移矩阵(ETM),从未剪辑的视频数据集中的动作标签中捕获给定动作的时间上下文(即该动作被集合中的其他动作前后跟随的可能性),并在训练过程中包含来自ETM的信息来提高动作识别和预测性能。
效果:实验结果表明,包括来自ETM的信息可以显著提高各种自我中心视频数据集上的动作识别和预测性能,且这种明确表示时间上下文的好处在小型模型上最为明显。

Computer vision models are excellent at identifying and exploiting regularities in the world. However, it is computationally costly to learn these regularities from scratch. This presents a challenge for low-parameter models, like those running on edge devices (e.g. smartphones). Can the performance of models with low representational power be improved by supplementing training with additional information about these statistical regularities? We explore this in the domains of action recognition and action anticipation, leveraging the fact that actions are typically embedded in stereotypical sequences. We introduce the Event Transition Matrix (ETM), computed from action labels in an untrimmed video dataset, which captures the temporal context of a given action, operationalized as the likelihood that it was preceded or followed by each other action in the set. We show that including information from the ETM during training improves action recognition and anticipation performance on various egocentric video datasets. Through ablation and control studies, we show that the coherent sequence of information captured by our ETM is key to this effect, and we find that the benefit of this explicit representation of temporal context is most pronounced for smaller models. Code, matrices and models are available in our project page: https://camilofosco.com/etm_website.

Audio-Visual Grouping Network for Sound Localization From Mixtures
Mo, ShentongandTian, Yapeng



研究问题:本文旨在解决音频源定位问题,即预测视频中声音来源的位置。
动机:现有的单源方法主要使用音视关联作为线索来定位每个帧中的声音对象,但对于混合的多源情况处理能力有限。
方法:本文提出了一种新的音视分组网络(AVGN),该网络可以直接从输入的音频混合物和帧中学习每个源的类别感知语义特征,以同时定位多个源。
效果:实验结果表明,与现有的多源方法相比,新的AVGN框架可以定位灵活数量的源,并分离个体声音源的类别感知音视表示。在MUSIC、VGGSound-Instruments和VGG-Sound Sources基准测试中,AVGN达到了最先进的声音对象定位性能。

Sound source localization is a typical and challenging task that predicts the location of sound sources in a video. Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each frame. Due to the mixed property of multiple sound sources in the original space, there exist rare multi-source approaches to localizing multiple sources simultaneously, except for one recent work using a contrastive random walk in the graph with images and separated sound as nodes. Despite their promising performance, they can only handle a fixed number of sources, and they cannot learn compact class-aware representations for individual sources. To alleviate this shortcoming, in this paper, we propose a novel audio-visual grouping network, namely AVGN, that can directly learn category-wise semantic features for each source from the input audio mixture and frame to localize multiple sources simultaneously. Specifically, our AVGN leverages learnable audio-visual class tokens to aggregate class-aware source features. Then, the aggregated semantic features for each source can be used as guidance to localize the corresponding visual regions. Compared to existing multi-source methods, our new framework can localize a flexible number of sources and disentangle category-aware audio-visual representations for individual sound sources. We conduct extensive experiments on MUSIC, VGGSound-Instruments, and VGG-Sound Sources benchmarks. The results demonstrate that the proposed AVGN can achieve state-of-the-art sounding object localization performance on both single-source and multi-source scenarios.

Tracking Multiple Deformable Objects in Egocentric Videos
Huang, MingzhenandLi, XiaoxingandHu, JunandPeng, HonghongandLyu, Siwei



研究问题:现有的多目标跟踪方法在处理高度变形的目标时存在困难,特别是在自我中心的视频中。
动机:为了解决这些问题,我们提出了一种新的多目标跟踪方法DETracker,该方法可以有效地检测和跟踪自我中心视频中的变形目标。
方法:DETracker使用了三个新的模块:运动解耦网络(MDN)、补丁关联网络(PAN)和补丁记忆网络(PMN),以明确解决由严重的自我运动和快速变形的目标对象引起的困难。
效果:DETracker是端到端可训练的,并且实现了接近实时的速度。我们在智能眼镜上收集了一个大规模的变形多目标跟踪数据集DogThruGlasses,包含150个视频和73K个标注帧。实验结果表明,DETracker在DogThruGlasses数据集和YouTube-Hand数据集上都优于现有的最先进方法。

Most existing multiple object tracking (MOT) methods that solely rely on appearance features struggle in tracking highly deformable objects. Other MOT methods that use motion clues to associate identities across frames have difficulty handling egocentric videos effectively or efficiently. In this work, we propose DETracker, a new MOT method that jointly detects and tracks deformable objects in egocentric videos. DETracker uses three novel modules, namely the motion disentanglement network (MDN), the patch association network (PAN) and the patch memory network (PMN), to explicitly tackle the difficulties caused by severe ego motion and fast morphing target objects. DETracker is end-to-end trainable and achieves near real-time speed. We also present DogThruGlasses, a large-scale deformable multi-object tracking dataset, with 150 videos and 73K annotated frames, collected by smart glasses. DETracker outperforms existing state-of-the-art method on the DogThruGlasses dataset and YouTube-Hand dataset.

Open Set Action Recognition via Multi-Label Evidential Learning
Zhao, ChenandDu, DaweiandHoogs, AnthonyandFunk, Christopher



研究问题:现有的开放集动作识别方法主要关注假设视频片段只显示单一动作的新颖性检测,这在现实世界中是不现实的。
动机:我们提出了一种新的开放集动作识别和新颖性检测方法,通过MUlti-Label Evidential learning(MULE),超越了以前的新动作检测方法,解决了同一场景中的单个或多个演员以及任何演员同时进行多个动作的更一般性问题。
方法:我们的方法使用Beta Evidential神经网络,基于演员-上下文-对象关系表示,用Beta密度估计多动作不确定性。我们还添加了一个证据去偏约束到优化目标函数中,以减少视频表示的静态偏差,这种偏差可能会错误地将预测与静态线索相关联。此外,我们开发了一种原始-对偶平均方案更新学习算法来优化提出的问题,并提供了相应的理论分析。我们还制定了基于不确定性和信念的新颖性估计机制来检测新的动作。
效果:我们在两个真实世界的视频数据集上进行了广泛的实验,结果显示,我们的方法在单/多演员、单/多动作设置下都取得了良好的性能。我们的代码和模型可以在https://github.com/charliezhaoyinpeng/mule上找到。

Existing methods for open set action recognition focus on novelty detection that assumes video clips show a single action, which is unrealistic in the real world. We propose a new method for open set action recognition and novelty detection via MUlti-Label Evidential learning (MULE), that goes beyond previous novel action detection methods by addressing the more general problems of single or multiple actors in the same scene, with simultaneous action(s) by any actor. Our Beta Evidential Neural Network estimates multi-action uncertainty with Beta densities based on actor-context-object relation representations. An evidence debiasing constraint is added to the objective func- tion for optimization to reduce the static bias of video representations, which can incorrectly correlate predictions and static cues. We develop a primal-dual average scheme update-based learning algorithm to optimize the proposed problem and provide corresponding theoretical analysis. Besides, uncertainty and belief-based novelty estimation mechanisms are formulated to detect novel actions. Extensive experiments on two real-world video datasets show that our proposed approach achieves promising performance in single/multi-actor, single/multi-action settings. Our code and models are released at https://github.com/charliezhaoyinpeng/mule.

Rethinking Optical Flow From Geometric Matching Consistent Perspective
Dong, QiaoleandCao, ChenjieandFu, Yanwei



研究问题:如何提高光流估计的准确性和鲁棒性。
动机:目前的深度学习模型在训练时通常从零开始,限制了其对图像特征的鲁棒几何匹配能力。
方法:提出一种新颖的光流估计方法MatchFlow,利用几何图像匹配(GIM)作为预训练任务,通过大规模真实世界数据学习物体和场景的基本特征相关性。
效果:实验表明,该方法具有很好的跨数据集泛化能力,并在Sintel和KITTI数据集上取得了显著的性能提升。

Optical flow estimation is a challenging problem remaining unsolved. Recent deep learning based optical flow models have achieved considerable success. However, these models often train networks from the scratch on standard optical flow data, which restricts their ability to robustly and geometrically match image features. In this paper, we propose a rethinking to previous optical flow estimation. We particularly leverage Geometric Image Matching (GIM) as a pre-training task for the optical flow estimation (MatchFlow) with better feature representations, as GIM shares some common challenges as optical flow estimation, and with massive labeled real-world data. Thus, matching static scenes helps to learn more fundamental feature correlations of objects and scenes with consistent displacements. Specifically, the proposed MatchFlow model employs a QuadTree attention-based network pre-trained on MegaDepth to extract coarse features for further flow regression. Extensive experiments show that our model has great cross-dataset generalization. Our method achieves 11.5% and 10.1% error reduction from GMA on Sintel clean pass and KITTI test set. At the time of anonymous submission, our MatchFlow(G) enjoys state-of-theart performance on Sintel clean and final pass compared to published approaches with comparable computation and memory footprint. Codes and models will be released in https://github.com/DQiaole/MatchFlow.

Learning Optical Expansion From Scale Matching
Ling, HanandSun, YinghuiandSun, QuansenandRen, Zhenwen



研究问题:本文旨在解决光学扩展(OE)的问题,这是在单目3D视觉任务中描述两帧之间的对象尺度变化。
动机:现有的方法主要从光流结果来估计光学扩展,但这种两阶段架构使得其结果受限于光流的准确性,且不够稳健。
方法:我们提出了通过将光学扩展整合到二维光流中来实现三维光流的概念,这由一个即插即用的模块TPCV实现。TPCV在正确的位置和尺度上匹配特征,从而实现了光流和光学扩展任务的同步优化。
效果:实验结果表明,基于RAFT光流基准线的TPCV可以显著提高基线光流性能。此外,我们将光流和光学扩展的结果应用于各种动态3D视觉任务,包括深度运动、碰撞时间和场景流,通常比先前的最佳方法有显著改进。

This paper address the problem of optical expansion (OE). OE describes the object scale change between two frames, widely used in monocular 3D vision tasks. Previous methods estimate optical expansion mainly from optical flow results, but this two-stage architecture makes their results limited by the accuracy of optical flow and less robust. To solve these problems, we propose the concept of 3D optical flow by integrating optical expansion into the 2D optical flow, which is implemented by a plug-and-play module, namely TPCV. TPCV implements matching features at the correct location and scale, thus allowing the simultaneous optimization of optical flow and optical expansion tasks. Experimentally, we apply TPCV to the RAFT optical flow baseline. Experimental results show that the baseline optical flow performance is substantially improved. Moreover, we apply the optical flow and optical expansion results to various dynamic 3D vision tasks, including motion-in-depth, time-to-collision, and scene flow, often achieving significant improvement over the prior SOTA. Code will be available at https://github.com/HanLingsgjk/TPCV.

Context De-Confounded Emotion Recognition
Yang, DingkangandChen, ZhaoyuandWang, YuzhengandWang, ShunliandLi, MingchengandLiu, SiaoandZhao, XiaoandHuang, ShuaiandDong, ZhiyanandZhai, PengandZhang, Lihua



研究问题:本文旨在解决现有情感识别任务中存在的数据集上下文偏差问题,该偏差会导致不同情境下的情感状态分布严重不平衡,从而影响模型的性能。
动机:当前的情感识别方法主要关注从主体和上下文中提取有意义的表示,但忽视了数据集中的上下文偏差问题。这种偏差会误导模型学习基于传统似然估计的虚假相关性,从而限制模型的性能。
方法:本文从因果关系的角度出发,通过定制的因果图来描述CAER任务中变量之间的因果关系。然后,提出了一种基于后门调整的情境因果干预模块(CCIM),以消除混淆因子并利用真实的因果关系进行模型训练。
效果:在三个基准数据集上的大量实验表明,我们的CCIM及其因果关系见解的有效性显著提高了各种最先进的方法。

Context-Aware Emotion Recognition (CAER) is a crucial and challenging task that aims to perceive the emotional states of the target person with contextual information. Recent approaches invariably focus on designing sophisticated architectures or mechanisms to extract seemingly meaningful representations from subjects and contexts. However, a long-overlooked issue is that a context bias in existing datasets leads to a significantly unbalanced distribution of emotional states among different context scenarios. Concretely, the harmful bias is a confounder that misleads existing models to learn spurious correlations based on conventional likelihood estimation, significantly limiting the models' performance. To tackle the issue, this paper provides a causality-based perspective to disentangle the models from the impact of such bias, and formulate the causalities among variables in the CAER task via a tailored causal graph. Then, we propose a Contextual Causal Intervention Module (CCIM) based on the backdoor adjustment to de-confound the confounder and exploit the true causal effect for model training. CCIM is plug-in and model-agnostic, which improves diverse state-of-the-art approaches by considerable margins. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our CCIM and the significance of causal insight.

Breaking the ''Object'' in Video Object Segmentation
Tokmakov, PavelandLi, JieandGaidon, Adrien



研究问题:现有的视频对象分割(VOS)基准主要缺乏对物体变换的考虑。
动机:物体在变化过程中,其颜色、形状和纹理都会发生显著改变,但现有的VOS方法主要依赖静态的外观线索,对于这种复杂的场景处理能力较弱。
方法:收集了一个新的视频对象分割数据集——Video Object Segmentation under Transformations (VOST),包含700多个高分辨率的视频片段,这些视频都经过了详细的标注。通过一系列细致的评估和实验,提出了一些改进现有VOS方法的方案。
效果:实验结果表明,现有的VOS方法在处理物体变换时表现不佳,而新提出的改进方案可以更好地捕捉空间-时间信息,从而提高了VOS的性能。

The appearance of an object can be fleeting when it transforms. As eggs are broken or paper is torn, their color, shape, and texture can change dramatically, preserving virtually nothing of the original except for the identity itself. Yet, this important phenomenon is largely absent from existing video object segmentation (VOS) benchmarks. In this work, we close the gap by collecting a new dataset for Video Object Segmentation under Transformations (VOST). It consists of more than 700 high-resolution videos, captured in diverse environments, which are 20 seconds long on average and densely labeled with instance masks. A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent. We then extensively evaluate state-of-the-art VOS methods and make a number of important discoveries. In particular, we show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static, appearance cues. This motivates us to propose a few modifications for the top-performing baseline that improve its performance by better capturing spatio-temporal information. But more broadly, the hope is to stimulate discussion on learning more robust video object representations.

CIGAR: Cross-Modality Graph Reasoning for Domain Adaptive Object Detection
Liu, YaboandWang, JinghuaandHuang, ChaoandWang, YaoweiandXu, Yong



研究问题:本文旨在解决无监督领域自适应目标检测(UDA-OD)中的问题,即如何从标记的源领域泛化知识到未标记的目标领域。
动机:尽管现有的基于图的方法在UDA-OD上表现良好,但它们不能学习出合适的图节点集,并且这些方法仅基于视觉特征构建图,没有考虑语义原型所携带的语言知识。
方法:为了克服这些问题,我们提出了一种跨模态图推理适应(CIGAR)方法,利用视觉和语言知识。具体来说,我们的方法在语言模态图和视觉模态图之间进行跨模态图推理,以增强它们的表示。我们还提出了一个判别性特征选择器,以找到最具判别性的特征,并将其作为视觉图的节点,以提高效率和效果。此外,我们还使用语言图匹配损失来规范语言图的更新,并在训练过程中保持其语义表示。
效果:全面的实验验证了我们提出的CIGAR的有效性。

Unsupervised domain adaptive object detection (UDA-OD) aims to learn a detector by generalizing knowledge from a labeled source domain to an unlabeled target domain. Though the existing graph-based methods for UDA-OD perform well in some cases, they cannot learn a proper node set for the graph. In addition, these methods build the graph solely based on the visual features and do not consider the linguistic knowledge carried by the semantic prototypes, e.g., dataset labels. To overcome these problems, we propose a cross-modality graph reasoning adaptation (CIGAR) method to take advantage of both visual and linguistic knowledge. Specifically, our method performs cross-modality graph reasoning between the linguistic modality graph and visual modality graphs to enhance their representations. We also propose a discriminative feature selector to find the most discriminative features and take them as the nodes of the visual graph for both efficiency and effectiveness. In addition, we employ the linguistic graph matching loss to regulate the update of linguistic graphs and maintain their semantic representation during the training process. Comprehensive experiments validate the effectiveness of our proposed CIGAR.

TriDet: Temporal Action Detection With Relative Boundary Modeling
Shi, DingfengandZhong, YujieandCao, QiongandMa, LinandLi, JiaandTao, Dacheng



研究问题:本文旨在解决视频中动作边界模糊导致的边界预测不准确的问题。
动机:现有的方法在处理视频中的动作边界时,由于边界的模糊性,常常导致预测的边界不准确。
方法:提出了一种名为TriDet的单阶段框架进行时间动作检测。其中,设计了一种新的Trident-head来通过估计边界周围的相对概率分布来建模动作边界;并在TriDet的特征金字塔中,提出了一种可扩展粒度感知(SGP)层,用于聚合不同时间粒度的信息,这比最近的基于变压器的特征金字塔更有效。
效果:得益于Trident-head和基于SGP的特征金字塔,TriDet在THUMOS14、HACS和EPIC-KITCHEN 100三个具有挑战性的基准测试上取得了最先进的性能,并且计算成本更低。例如,TriDet在THUMOS14上的平均mAP达到了69.3%,比之前的最佳结果提高了2.5%,但其延迟仅为74.6%。

In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous action boundaries in videos. To alleviate this problem, we propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. In the feature pyramid of TriDet, we propose a Scalable-Granularity Perception (SGP) layer to aggregate information across different temporal granularities, which is much more efficient than the recent transformer-based feature pyramid. Benefiting from the Trident-head and the SGP-based feature pyramid, TriDet achieves state-of-the-art performance on three challenging benchmarks: THUMOS14, HACS and EPIC-KITCHEN 100, with lower computational costs, compared to previous methods. For example, TriDet hits an average mAP of 69.3% on THUMOS14, outperforming the previous best by 2.5%, but with only 74.6% of its latency.

Joint Appearance and Motion Learning for Efficient Rolling Shutter Correction
Fan, BinandMao, YuxinandDai, YuchaoandWan, ZhexiongandLiu, Qi



研究问题:如何提高滚动快门修正(RSC)的效率和性能。
动机:现有的RSC方法通常采用两阶段网络结构,忽视了内在信息交互,且推理速度慢。
方法:提出一种名为JAMNet的单阶段编码器-解码器网络进行高效RSC。首先从连续的RS输入中提取金字塔特征,然后在联合学习解码器中同时优化两种互补信息(即全局快门外观和无畸变运动场),实现相互促进。
效果:在各种基准测试中,该方法比最先进的方法有显著改进,特别是在真实世界的RSC上,PSNR提高了4.7 dB。

Rolling shutter correction (RSC) is becoming increasingly popular for RS cameras that are widely used in commercial and industrial applications. Despite the promising performance, existing RSC methods typically employ a two-stage network structure that ignores intrinsic information interactions and hinders fast inference. In this paper, we propose a single-stage encoder-decoder-based network, named JAMNet, for efficient RSC. It first extracts pyramid features from consecutive RS inputs, and then simultaneously refines the two complementary information (i.e., global shutter appearance and undistortion motion field) to achieve mutual promotion in a joint learning decoder. To inject sufficient motion cues for guiding joint learning, we introduce a transformer-based motion embedding module and propose to pass hidden states across pyramid levels. Moreover, we present a new data augmentation strategy "vertical flip + inverse order" to release the potential of the RSC datasets. Experiments on various benchmarks show that our approach surpasses the state-of-the-art methods by a large margin, especially with a 4.7 dB PSNR leap on real-world RSC. Code is available at https://github.com/GitCVfb/JAMNet.

Selective Structured State-Spaces for Long-Form Video Understanding
Wang, JueandZhu, WentaoandWang, PichaoandYu, XiangandLiu, LindaandOmar, MohamedandHamid, Raffay



研究问题:如何有效地对长视频中的复杂时空依赖关系进行建模?
动机:尽管最近的结构化状态空间序列(S4)模型在处理此问题上具有线性复杂度,但其对所有图像标记的平等处理可能会影响其效率和准确性。
方法:我们提出了一种新的选择性S4(即S5)模型,该模型使用轻量级掩码生成器来自适应地选择信息丰富的图像标记,从而更有效地对视频中的长期时空依赖关系进行建模。
效果:通过在三个具有挑战性的长视频理解数据集(LVU、COIN和Breakfast)上进行的广泛比较,我们的模型比之前最先进的S4模型提高了9.6%的准确性,同时减少了23%的内存占用。

Effective modeling of complex spatiotemporal dependencies in long-form videos remains an open problem. The recently proposed Structured State-Space Sequence (S4) model with its linear complexity offers a promising direction in this space. However, we demonstrate that treating all image-tokens equally as done by S4 model can adversely affect its efficiency and accuracy. To address this limitation, we present a novel Selective S4 (i.e., S5) model that employs a lightweight mask generator to adaptively select informative image tokens resulting in more efficient and accurate modeling of long-term spatiotemporal dependencies in videos. Unlike previous mask-based token reduction methods used in transformers, our S5 model avoids the dense self-attention calculation by making use of the guidance of the momentum-updated S4 model. This enables our model to efficiently discard less informative tokens and adapt to various long-form video understanding tasks more effectively. However, as is the case for most token reduction methods, the informative image tokens could be dropped incorrectly. To improve the robustness and the temporal horizon of our model, we propose a novel long-short masked contrastive learning (LSMCL) approach that enables our model to predict longer temporal context using shorter input videos. We present extensive comparative results using three challenging long-form video understanding datasets (LVU, COIN and Breakfast), demonstrating that our approach consistently outperforms the previous state-of-the-art S4 model by up to 9.6% accuracy while reducing its memory footprint by 23%.

Motion Information Propagation for Neural Video Compression
Qi, LinfengandLi, JiahaoandLi, BinandLi, HouqiangandLu, Yan



研究问题:本文旨在解决现有神经网络视频编码器中信息流单向性的问题,即只有运动编码为帧编码提供运动向量。
动机:作者认为通过信息交互,可以实现运动编码和帧编码之间的协同作用。
方法:通过引入运动信息传播,实现了运动编码和帧编码之间的双向信息交互。在生成帧编码的临时上下文时,来自运动解码器的高维运动特征作为运动指导以减少对齐误差。同时,除了协助当前时间步长的帧编码外,生成的上下文特征也将作为后续运动潜在编码的运动条件进行传播。通过这种交互循环,建立了运动编码的特征传播,增强了利用长期时间相关性的能力。此外,还提出了混合上下文生成来利用多尺度上下文特征并提供更好的运动条件。
效果:实验表明,该方法比之前最先进的神经网络视频编码器能够实现12.9%的比特率节省。

In most existing neural video codecs, the information flow therein is uni-directional, where only motion coding provides motion vectors for frame coding. In this paper, we argue that, through information interactions, the synergy between motion coding and frame coding can be achieved. We effectively introduce bi-directional information interactions between motion coding and frame coding via our Motion Information Propagation. When generating the temporal contexts for frame coding, the high-dimension motion feature from the motion decoder serves as motion guidance to mitigate the alignment errors. Meanwhile, besides assisting frame coding at the current time step, the feature from context generation will be propagated as motion condition when coding the subsequent motion latent. Through the cycle of such interactions, feature propagation on motion coding is built, strengthening the capacity of exploiting long-range temporal correlation. In addition, we propose hybrid context generation to exploit the multi-scale context features and provide better motion condition. Experiments show that our method can achieve 12.9% bit rate saving over the previous SOTA neural video codec.

topic-2

Topic words :  view,  pose,  neural,  scene,  reconstruction,  estimation,  images,  depth

GFPose: Learning 3D Human Pose Prior With Gradient Fields
Ci, HaiandWu, MingdongandZhu, WentaoandMa, XiaoxuanandDong, HaoandZhong, FangweiandWang, Yizhou



研究问题:如何有效地学习3D人体姿态先验,以实现以人为中心的人工智能。
动机:3D人体姿态先验的学习对于各种应用至关重要,例如姿势估计、补全和生成等。
方法:提出了一种名为GFPose的通用框架,该框架的核心是一个时间依赖性得分网络,用于估计每个身体关节的梯度并逐步去噪扰动的3D人体姿态以匹配给定的任务规范。在去噪过程中,GFPose隐式地将姿态先验纳入梯度中,并在一个优雅的框架中统一了各种判别性和生成性任务。
效果:实验结果表明,1)作为多假设姿态估计器,GFPose在Human3.6M数据集上比现有最先进技术高出20%。2)作为单假设姿态估计器,即使使用基本的骨架,GFPose也能取得与确定性最先进技术相当的结果。3)在姿态去噪、补全和生成任务中,GFPose能够产生多样化和真实的样本。

Learning 3D human pose prior is essential to human-centered AI. Here, we present GFPose, a versatile framework to model plausible 3D human poses for various applications. At the core of GFPose is a time-dependent score network, which estimates the gradient on each body joint and progressively denoises the perturbed 3D human pose to match a given task specification. During the denoising process, GFPose implicitly incorporates pose priors in gradients and unifies various discriminative and generative tasks in an elegant framework. Despite the simplicity, GFPose demonstrates great potential in several downstream tasks. Our experiments empirically show that 1) as a multi-hypothesis pose estimator, GFPose outperforms existing SOTAs by 20% on Human3.6M dataset. 2) as a single-hypothesis pose estimator, GFPose achieves comparable results to deterministic SOTAs, even with a vanilla backbone. 3) GFPose is able to produce diverse and realistic samples in pose denoising, completion and generation tasks.

CCuantuMM: Cycle-Consistent Quantum-Hybrid Matching of Multiple Shapes
Bhatia, HarshilandTretschk, EdithandL\"ahner, ZorahandBenkner, MarcelSeelbachandMoeller, MichaelandTheobalt, ChristianandGolyanik, Vladislav



研究问题:如何有效地匹配多个非刚性变形的3D形状,并保证匹配结果的循环一致性。
动机:现有的量子形状匹配方法无法支持多形状匹配,且无法保证循环一致性。
方法:本文提出了首个支持多形状匹配且保证循环一致性的量子混合方法。通过将N形状问题转化为三个形状的匹配序列,降低了问题的复杂性。利用量子退火技术,为中间的NP难问题获取高质量的低能量解决方案。
效果:在基准数据集上,该方法显著优于先前的量子混合两形状匹配方法的多形状匹配扩展,并与经典的多形状匹配方法相当。

Jointly matching multiple, non-rigidly deformed 3D shapes is a challenging, NP-hard problem. A perfect matching is necessarily cycle-consistent: Following the pairwise point correspondences along several shapes must end up at the starting vertex of the original shape. Unfortunately, existing quantum shape-matching methods do not support multiple shapes and even less cycle consistency. This paper addresses the open challenges and introduces the first quantum-hybrid approach for 3D shape multi-matching; in addition, it is also cycle-consistent. Its iterative formulation is admissible to modern adiabatic quantum hardware and scales linearly with the total number of input shapes. Both these characteristics are achieved by reducing the N-shape case to a sequence of three-shape matchings, the derivation of which is our main technical contribution. Thanks to quantum annealing, high-quality solutions with low energy are retrieved for the intermediate NP-hard objectives. On benchmark datasets, the proposed approach significantly outperforms extensions to multi-shape matching of a previous quantum-hybrid two-shape matching method and is on-par with classical multi-matching methods. Our source code is available at 4dqv.mpi-inf.mpg.de/CCuantuMM/

Painting 3D Nature in 2D: View Synthesis of Natural Scenes From a Single Semantic Mask
Zhang, ShangzhanandPeng, SidaandChen, TianrunandMou, LinzhanandLin, HaotongandYu, KaichengandLiao, YiyiandZhou, Xiaowei



研究问题:如何利用单个语义掩码合成自然场景的多视图一致颜色图像。
动机:现有的3D感知图像合成方法需要多视图监督或学习特定类别物体的类别级先验,不适用于自然场景。
方法:使用语义场作为中间表示,从输入的语义掩码中重建,然后借助现成的语义图像合成模型转化为辐射场。
效果:实验表明,该方法优于基线方法,并能生成各种自然场景的逼真和多视图一致的视频。

We introduce a novel approach that takes a single semantic mask as input to synthesize multi-view consistent color images of natural scenes, trained with a collection of single images from the Internet. Prior works on 3D-aware image synthesis either require multi-view supervision or learning category-level prior for specific classes of objects, which are inapplicable to natural scenes. Our key idea to solve this challenge is to use a semantic field as the intermediate representation, which is easier to reconstruct from an input semantic mask and then translated to a radiance field with the assistance of off-the-shelf semantic image synthesis models. Experiments show that our method outperforms baseline methods and produces photorealistic and multi-view consistent videos of a variety of natural scenes. The project website is https://zju3dv.github.io/paintingnature/.

Shape, Pose, and Appearance From a Single Image via Bootstrapped Radiance Field Inversion
Pavllo, DarioandTan, DavidJosephandRakotosaona, Marie-JulieandTombari, Federico



研究问题:如何有效地从单视图进行3D重建,特别是在没有精确地面真位姿的情况下。
动机:尽管神经辐射场(NeRF)和生成对抗网络(GANs)在合成数据集上表现出色,但它们忽视了对位姿估计的重要性,这对于增强现实(AR)和机器人等某些下游应用至关重要。
方法:我们提出了一种端到端的自然图像重建框架,该框架无需在训练过程中利用多视图信息,就能从物体的单张图像中恢复出SDF参数化的3D形状、姿态和外观。
效果:我们的框架可以在10步内完成图像的反渲染,使其适用于实际场景。我们在各种真实和合成基准测试中展示了最先进的结果。

Neural Radiance Fields (NeRF) coupled with GANs represent a promising direction in the area of 3D reconstruction from a single view, owing to their ability to efficiently model arbitrary topologies. Recent work in this area, however, has mostly focused on synthetic datasets where exact ground-truth poses are known, and has overlooked pose estimation, which is important for certain downstream applications such as augmented reality (AR) and robotics. We introduce a principled end-to-end reconstruction framework for natural images, where accurate ground-truth poses are not available. Our approach recovers an SDF-parameterized 3D shape, pose, and appearance from a single image of an object, without exploiting multiple views during training. More specifically, we leverage an unconditional 3D-aware generator, to which we apply a hybrid inversion scheme where a model produces a first guess of the solution which is then refined via optimization. Our framework can de-render an image in as few as 10 steps, enabling its use in practical scenarios. We demonstrate state-of-the-art results on a variety of real and synthetic benchmarks.

NoPe-NeRF: Optimising Neural Radiance Field With No Pose Prior
Bian, WenjingandWang, ZiruiandLi, KejieandBian, Jia-WangandPrisacariu, VictorAdrian



研究问题:训练一个无需预计算相机位姿的神经辐射场(NeRF)模型。
动机:尽管近期的研究已经证明在面向前方的场景中,可以联合优化NeRF和相机位姿,但在剧烈的相机运动中仍面临困难。
方法:通过引入未失真的单目深度先验来解决此难题。这些先验是通过在训练过程中修正比例和位移参数生成的,然后我们使用这些参数来约束连续帧之间的相对姿态。
效果:在真实世界的室内和室外场景上的实验表明,我们的方法能够处理具有挑战性的相机轨迹,并在新视图渲染质量和姿态估计准确性方面优于现有方法。

Training a Neural Radiance Field (NeRF) without pre-computed camera poses is challenging. Recent advances in this direction demonstrate the possibility of jointly optimising a NeRF and camera poses in forward-facing scenes. However, these methods still face difficulties during dramatic camera movement. We tackle this challenging problem by incorporating undistorted monocular depth priors. These priors are generated by correcting scale and shift parameters during training, with which we are then able to constrain the relative poses between consecutive frames. This constraint is achieved using our proposed novel loss functions. Experiments on real-world indoor and outdoor scenes show that our method can handle challenging camera trajectories and outperforms existing methods in terms of novel view rendering quality and pose estimation accuracy. Our project page is https://nope-nerf.active.vision.

3D Shape Reconstruction of Semi-Transparent Worms
Ilett, ThomasP.andYuval, OmerandRanner, ThomasandCohen, NettaandHogg, DavidC.



研究问题:如何对半透明且不断进出焦点的主题进行3D形状重建?
动机:传统的多图像识别方法在面对半透明和动态模糊的主题时无法有效工作。
方法:我们通过渲染具有自适应模糊和透明度的候选形状并与图像进行比较来克服这些挑战。我们还开发了一种新的可微分渲染器,用于从2D投影构建图像并与原始图像进行比较,以生成像素级的误差,并使用梯度下降法联合更新曲线、相机和渲染器参数。
效果:该方法能够抵抗干扰(如流体中被困的气泡和污垢),在复杂的姿势序列中保持一致性,从模糊的图像中恢复可靠的估计,并在跟踪秀丽隐杆线虫的3D形状方面取得了显著改进。

3D shape reconstruction typically requires identifying object features or textures in multiple images of a subject. This approach is not viable when the subject is semi-transparent and moving in and out of focus. Here we overcome these challenges by rendering a candidate shape with adaptive blurring and transparency for comparison with the images. We use the microscopic nematode Caenorhabditis elegans as a case study as it freely explores a 3D complex fluid with constantly changing optical properties. We model the slender worm as a 3D curve using an intrinsic parametrisation that naturally admits biologically-informed constraints and regularisation. To account for the changing optics we develop a novel differentiable renderer to construct images from 2D projections and compare against raw images to generate a pixel-wise error to jointly update the curve, camera and renderer parameters using gradient descent. The method is robust to interference such as bubbles and dirt trapped in the fluid, stays consistent through complex sequences of postures, recovers reliable estimates from blurry images and provides a significant improvement on previous attempts to track C. elegans in 3D. Our results demonstrate the potential of direct approaches to shape estimation in complex physical environments in the absence of ground-truth data.

Swept-Angle Synthetic Wavelength Interferometry
Kotwal, AlankarandLevin, AnatandGkioulekas, Ioannis



研究问题:本文提出了一种新的成像技术,用于全域微米级3D传感的扫描角合成波长干涉测量。
动机:传统的合成波长干涉测量技术使用两种窄带分隔的光学波长的光,其相位编码场景深度,但易受像差和(亚)表面散射的影响。
方法:本文提出的新技术通过模拟空间非相干照明,使干涉测量对像差和(亚)表面散射不敏感,同时结合了扫描干涉测量设置的鲁棒性和全域干涉测量设置的速度。
效果:实验证明,该技术可以在5Hz的帧速率下,甚至在强烈的环境光下,以5微米的横向和轴向分辨率恢复完整的深度图。

We present a new imaging technique, swept-angle synthetic wavelength interferometry, for full-field micron-scale 3D sensing. As in conventional synthetic wavelength interferometry, our technique uses light consisting of two narrowly-separated optical wavelengths, resulting in per-pixel interferometric measurements whose phase encodes scene depth. Our technique additionally uses a new type of light source that, by emulating spatially-incoherent illumination, makes interferometric measurements insensitive to aberrations and (sub)surface scattering, effects that corrupt phase measurements. The resulting technique combines the robustness to such corruptions of scanning interferometric setups, with the speed of full-field interferometric setups. Overall, our technique can recover full-frame depth at a lateral and axial resolution of 5 microns, at frame rates of 5 Hz, even under strong ambient light. We build an experimental prototype, and use it to demonstrate these capabilities by scanning a variety of objects, including objects representative of applications in inspection and fabrication, and objects that contain challenging light scattering effects.

Multi-Space Neural Radiance Fields
Yin, Ze-XinandQiu, JiaxiongandCheng, Ming-MingandRen, Bo



研究问题:现有的基于NeRF的方法在处理反射物体时,常常导致渲染结果模糊或失真。
动机:提出一种多空间神经辐射场(MS-NeRF)方法,通过并行子空间的特征场来表示场景,以更好地理解神经网络对反射和折射物体的存在。
方法:MS-NeRF将场景表示为一组并行子空间中的特征场,而不是计算单个辐射场。这种方法是对现有NeRF方法的增强,只需要很小的额外空间输出的训练和推理计算开销。
效果:通过三个代表性的基于NeRF的模型进行比较,实验表明,对于通过类似镜子的物体的复杂光线路径的高质渲染,MS-NeRF方法显著优于现有的单空间NeRF方法。

Neural Radiance Fields (NeRF) and its variants have reached state-of-the-art performance in many novel-view-synthesis-related tasks. However, current NeRF-based methods still suffer from the existence of reflective objects, often resulting in blurry or distorted rendering. Instead of calculating a single radiance field, we propose a multispace neural radiance field (MS-NeRF) that represents the scene using a group of feature fields in parallel sub-spaces, which leads to a better understanding of the neural network toward the existence of reflective and refractive objects. Our multi-space scheme works as an enhancement to existing NeRF methods, with only small computational overheads needed for training and inferring the extra-space outputs. We demonstrate the superiority and compatibility of our approach using three representative NeRF-based models, i.e., NeRF, Mip-NeRF, and Mip-NeRF 360. Comparisons are performed on a novelly constructed dataset consisting of 25 synthetic scenes and 7 real captured scenes with complex reflection and refraction, all having 360-degree viewpoints. Extensive experiments show that our approach significantly outperforms the existing single-space NeRF methods for rendering high-quality scenes concerned with complex light paths through mirror-like objects.

Two-View Geometry Scoring Without Correspondences
Barroso-Laguna, AxelandBrachmann, EricandPrisacariu, VictorAdrianandBrostow, GabrielJ.andTurmukhambetov, Daniyar



研究问题:传统的双视图几何的相机姿态估计依赖于RANSAC,但在某些情况下,这种方法会偏向于选择不理想的模型。
动机:为了解决这个问题,我们提出了基础评分网络(FSNet),它通过极线注意力机制预测两张重叠图像的姿态误差,而不是依赖于稀疏的对应关系。
方法:FSNet不需要稀疏的对应关系,而是通过一个极线注意力机制来体现双视图几何模型,可以融入到传统的RANSAC循环中。
效果:我们在室内和室外数据集上评估了FSNet在基本矩阵和本质矩阵估计上的表现,结果表明,FSNet能够成功地为具有少量或不可靠对应关系的图像对识别出好的姿态。此外,我们还表明,将FSNet与MAGSAC++评分方法相结合可以达到最先进的结果。

Camera pose estimation for two-view geometry traditionally relies on RANSAC. Normally, a multitude of image correspondences leads to a pool of proposed hypotheses, which are then scored to find a winning model. The inlier count is generally regarded as a reliable indicator of "consensus". We examine this scoring heuristic, and find that it favors disappointing models under certain circumstances. As a remedy, we propose the Fundamental Scoring Network (FSNet), which infers a score for a pair of overlapping images and any proposed fundamental matrix. It does not rely on sparse correspondences, but rather embodies a two-view geometry model through an epipolar attention mechanism that predicts the pose error of the two images. FSNet can be incorporated into traditional RANSAC loops. We evaluate FSNet on fundamental and essential matrix estimation on indoor and outdoor datasets, and establish that FSNet can successfully identify good poses for pairs of images with few or unreliable correspondences. Besides, we show that naively combining FSNet with MAGSAC++ scoring approach achieves state of the art results.

Panoptic Lifting for 3D Scene Understanding With Neural Fields
Siddiqui, YawarandPorzi, LorenzoandBul\`o, SamuelRotaandM\"uller, NormanandNie{\ss



研究问题:如何从野外场景的图像中学习全面的3D体积表示。
动机:现有的方法需要直接或间接使用3D输入,而我们的方法只需要从预训练网络中推断出的2D全景分割蒙版。
方法:我们提出了一种新颖的全景提升方法,该方法基于神经场表示生成场景的统一和多视图一致的3D全景表示。为了解决不同视图中2D实例标识符的不一致性,我们解决了一个基于模型当前预测和机器生成分割蒙版的成本的线性分配问题,从而以一致的方式将2D实例提升到3D。
效果:我们在具有挑战性的Hypersim、Replica和ScanNet数据集上验证了我们的方法,与最先进的技术相比,在场景级PQ上提高了8.4%、13.8%和10.6%。

We propose Panoptic Lifting, a novel approach for learning panoptic 3D volumetric representations from images of in-the-wild scenes. Once trained, our model can render color images together with 3D-consistent panoptic segmentation from novel viewpoints. Unlike existing approaches which use 3D input directly or indirectly, our method requires only machine-generated 2D panoptic segmentation masks inferred from a pre-trained network. Our core contribution is a panoptic lifting scheme based on a neural field representation that generates a unified and multi-view consistent, 3D panoptic representation of the scene. To account for inconsistencies of 2D instance identifiers across views, we solve a linear assignment with a cost based on the model's current predictions and the machine-generated segmentation masks, thus enabling us to lift 2D instances to 3D in a consistent way. We further propose and ablate contributions that make our method more robust to noisy, machine-generated labels, including test-time augmentations for confidence estimates, segment consistency loss, bounded segmentation fields, and gradient stopping. Experimental results validate our approach on the challenging Hypersim, Replica, and ScanNet datasets, improving by 8.4, 13.8, and 10.6% in scene-level PQ over state of the art.

Single View Scene Scale Estimation Using Scale Field
Lee, Byeong-UkandZhang, JianmingandHold-Geoffroy, YannickandKweon, InSo



研究问题:本文旨在提出一种基于新型尺度场表示的单图像比例估计方法。
动机:现有的相机参数存在模糊性,需要一种简单有效的方法来收集任意图像的比例标注。
方法:通过在校准的全景图像数据和野外人工标注数据上训练模型,生成各种图像上的稳健尺度场。
效果:该方法可以应用于各种3D理解和尺度感知的图像编辑应用中。

In this paper, we propose a single image scale estimation method based on a novel scale field representation. A scale field defines the local pixel-to-metric conversion ratio along the gravity direction on all the ground pixels. This representation resolves the ambiguity in camera parameters, allowing us to use a simple yet effective way to collect scale annotations on arbitrary images from human annotators. By training our model on calibrated panoramic image data and the in-the-wild human annotated data, our single image scene scale estimation network generates robust scale field on a variety of image, which can be utilized in various 3D understanding and scale-aware image editing applications.

SCADE: NeRFs from Space Carving With Ambiguity-Aware Depth Estimates
Uy, MikaelaAngelinaandMartin-Brualla, RicardoandGuibas, LeonidasandLi, Ke



研究问题:现有的神经辐射场(NeRFs)在少量视图下的表现不佳,因为体积渲染施加的约束不足。
动机:为了解决这一问题,我们提出了SCADE,一种改进稀疏、无约束输入视图下的NeRF重建质量的新方法。
方法:我们利用最先进的单目深度估计模型产生的每视图深度估计作为几何先验来约束NeRF重建。同时,我们提出一种新的方法,通过条件隐式最大似然估计(cIMLE)预测每个视图的连续多模态深度估计分布。此外,我们还引入了一种创新的空间雕刻损失,以融合来自每个视图的多个假设深度图,并从中提炼出与所有视图一致的共同几何体。
效果:实验表明,我们的方法能够在稀疏视图下实现更高的新颖视图合成保真度。

Neural radiance fields (NeRFs) have enabled high fidelity 3D reconstruction from multiple 2D input views. However, a well-known drawback of NeRFs is the less-than-ideal performance under a small number of views, due to insufficient constraints enforced by volumetric rendering. To address this issue, we introduce SCADE, a novel technique that improves NeRF reconstruction quality on sparse, unconstrained input views for in-the-wild indoor scenes. To constrain NeRF reconstruction, we leverage geometric priors in the form of per-view depth estimates produced with state-of-the-art monocular depth estimation models, which can generalize across scenes. A key challenge is that monocular depth estimation is an ill-posed problem, with inherent ambiguities. To handle this issue, we propose a new method that learns to predict, for each view, a continuous, multimodal distribution of depth estimates using conditional Implicit Maximum Likelihood Estimation (cIMLE). In order to disambiguate exploiting multiple views, we introduce an original space carving loss that guides the NeRF representation to fuse multiple hypothesized depth maps from each view and distill from them a common geometry that is consistent with all views. Experiments show that our approach enables higher fidelity novel view synthesis from sparse views. Our project page can be found at https://scade-spacecarving-nerfs.github.io.

CloSET: Modeling Clothed Humans on Continuous Surface With Explicit Template Decomposition
Zhang, HongwenandLin, SiyouandShao, RuizhiandZhang, YuxiangandZheng, ZerongandHuang, HanandGuo, YandongandLiu, Yebin



研究问题:如何从静态扫描中创建可动画化的头像,需要在不同姿态下对服装形变进行建模。
动机:现有的基于学习的方法通常在最小化衣物网格模板或学习到的隐式模板上添加依赖于姿态的形变,这在捕捉细节或阻碍端到端学习方面存在限制。
方法:我们重新审视了基于点的解法,并提出了分解显式的与服装相关的模板,然后在其上添加依赖于姿态的皱纹。这样,服装形变被解耦,使得依赖于姿态的皱纹可以更好地学习和应用于未见过的姿态。此外,为了解决最近最先进的基于点的方法是存在的接缝伪影问题,我们提出在身体表面学习点特征,这建立了一个连续和紧凑的特征空间来捕捉精细和依赖于姿态的服装几何。
效果:我们的方法是在一个现有的数据集和我们新引入的数据集上进行验证,显示出在未见过的姿态下更好的服装形变结果。项目页面、代码和数据集可以在https://www.liuyebin.com/closet找到。

Creating animatable avatars from static scans requires the modeling of clothing deformations in different poses. Existing learning-based methods typically add pose-dependent deformations upon a minimally-clothed mesh template or a learned implicit template, which have limitations in capturing details or hinder end-to-end learning. In this paper, we revisit point-based solutions and propose to decompose explicit garment-related templates and then add pose-dependent wrinkles to them. In this way, the clothing deformations are disentangled such that the pose-dependent wrinkles can be better learned and applied to unseen poses. Additionally, to tackle the seam artifact issues in recent state-of-the-art point-based methods, we propose to learn point features on a body surface, which establishes a continuous and compact feature space to capture the fine-grained and pose-dependent clothing geometry. To facilitate the research in this field, we also introduce a high-quality scan dataset of humans in real-world clothing. Our approach is validated on two existing datasets and our newly introduced dataset, showing better clothing deformation results in unseen poses. The project page with code and dataset can be found at https://www.liuyebin.com/closet.

BUOL: A Bottom-Up Framework With Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction From a Single Image
Chu, TaoandZhang, PanandLiu, QiongandWang, Jiaqi



研究问题:如何从单张图像理解和重建3D场景,同时进行3D重建和3D全景分割。
动机:现有的方法只关注自上而下的方法,根据估计的深度将2D实例填充到3D体素中,存在实例-通道歧义和体素-重建歧义两个问题。
方法:提出BUOL框架,通过占用感知提升来解决这两个问题。对于实例-通道歧义,采用自下而上的方法,基于确定的语义分配将2D信息提升到3D体素,然后根据预测的2D实例中心对3D体素进行细化和分组。对于体素-重建歧义,利用估计的多平面占用和深度来填充物体和素材的所有区域。
效果:在合成数据集3D-Front和真实世界数据集Matterport3D上,该方法比最先进的方法具有显著的性能优势。

Understanding and modeling the 3D scene from a single image is a practical problem. A recent advance proposes a panoptic 3D scene reconstruction task that performs both 3D reconstruction and 3D panoptic segmentation from a single image. Although having made substantial progress, recent works only focus on top-down approaches that fill 2D instances into 3D voxels according to estimated depth, which hinders their performance by two ambiguities. (1) instance-channel ambiguity: The variable ids of instances in each scene lead to ambiguity during filling voxel channels with 2D information, confusing the following 3D refinement. (2) voxel-reconstruction ambiguity: 2D-to-3D lifting with estimated single view depth only propagates 2D information onto the surface of 3D regions, leading to ambiguity during the reconstruction of regions behind the frontal view surface. In this paper, we propose BUOL, a Bottom-Up framework with Occupancy-aware Lifting to address the two issues for panoptic 3D scene reconstruction from a single image. For instance-channel ambiguity, a bottom-up framework lifts 2D information to 3D voxels based on deterministic semantic assignments rather than arbitrary instance id assignments. The 3D voxels are then refined and grouped into 3D instances according to the predicted 2D instance centers. For voxel-reconstruction ambiguity, the estimated multi-plane occupancy is leveraged together with depth to fill the whole regions of things and stuff. Our method shows a tremendous performance advantage over state-of-the-art methods on synthetic dataset 3D-Front and real-world dataset Matterport3D, respectively. Code and models will be released.

SparseFusion: Distilling View-Conditioned Diffusion for 3D Reconstruction
Zhou, ZhizhuoandTulsiani, Shubham



研究问题:提出一种稀疏视角的3D重建方法SparseFusion,以统一神经渲染和概率图像生成的最新进展。
动机:现有的方法通常基于重新投影特征的神经渲染,但无法生成未见过的区域或处理大视角变化下的不确定性。其他方法将其视为(概率)2D合成任务,虽然可以生成合理的2D图像,但不能推断出一致的底层3D。
方法:通过从视图条件的潜在扩散模型中提炼出3D一致的场景表示,能够恢复出一个合理且真实的3D表示,其渲染既准确又真实。
效果:在CO3D数据集的51个类别上评估该方法,并在稀疏视角新视图合成的畸变和感知指标上都优于现有方法。

We propose SparseFusion, a sparse view 3D reconstruction approach that unifies recent advances in neural rendering and probabilistic image generation. Existing approaches typically build on neural rendering with re-projected features but fail to generate unseen regions or handle uncertainty under large viewpoint changes. Alternate methods treat this as a (probabilistic) 2D synthesis task, and while they can generate plausible 2D images, they do not infer a consistent underlying 3D. However, we find that this trade-off between 3D consistency and probabilistic image generation does not need to exist. In fact, we show that geometric consistency and generative inference can be complementary in a mode seeking behavior. By distilling a 3D consistent scene representation from a view-conditioned latent diffusion model, we are able to recover a plausible 3D representation whose renderings are both accurate and realistic. We evaluate our approach across 51 categories in the CO3D dataset and show that it outperforms existing methods, in both distortion and perception metrics, for sparse view novel view synthesis.

Differentiable Shadow Mapping for Efficient Inverse Graphics
Worchel, MarkusandAlexa, Marc



研究问题:如何有效地在三角形网格的可微渲染中生成阴影?
动机:现有的阴影近似技术需要大量的计算资源,且效果不尽人意。
方法:通过将预过滤阴影映射与现有的可微栅格化器结合,产生可微可见性信息。
效果:在多个逆图形学问题上,这种方法比具有相似准确性的可微光照传输模拟快几个数量级,而没有阴影的可微栅格化通常无法收敛。

We show how shadows can be efficiently generated in differentiable rendering of triangle meshes. Our central observation is that pre-filtered shadow mapping, a technique for approximating shadows based on rendering from the perspective of a light, can be combined with existing differentiable rasterizers to yield differentiable visibility information. We demonstrate at several inverse graphics problems that differentiable shadow maps are orders of magnitude faster than differentiable light transport simulation with similar accuracy -- while differentiable rasterization without shadows often fails to converge.

A Practical Stereo Depth System for Smart Glasses
Wang, JialiangandScharstein, DanielandBapat, AkashandBlackburn-Matzen, KevinandYu, MatthewandLehman, JonathanandAlsisan, SuhibandWang, YanghanandTsai, SamandFrahm, Jan-MichaelandHe, ZijianandVajda, PeterandCohen, MichaelF.andUyttendaele, Matt



研究问题:设计一种端到端的立体深度传感系统,实现预处理、在线立体矫正和立体深度估计,并在矫正不可靠时切换到单目深度估计。
动机:在智能手机上执行所有这些步骤,需要处理各种失败情况和不理想的输入数据,同时满足内存和延迟限制,以提供流畅的用户体验。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We present the design of a productionized end-to-end stereo depth sensing system that does pre-processing, online stereo rectification, and stereo depth estimation with a fallback to monocular depth estimation when rectification is unreliable. The output of our depth sensing system is then used in a novel view generation pipeline to create 3D computational photography effects using point-of-view images captured by smart glasses. All these steps are executed on-device on the stringent compute budget of a mobile phone, and because we expect the users can use a wide range of smartphones, our design needs to be general and cannot be dependent on a particular hardware or ML accelerator such as a smartphone GPU. Although each of these steps is well studied, a description of a practical system is still lacking. For such a system, all these steps need to work in tandem with one another and fallback gracefully on failures within the system or less than ideal input data. We show how we handle unforeseen changes to calibration, e.g., due to heat, robustly support depth estimation in the wild, and still abide by the memory and latency constraints required for a smooth user experience. We show that our trained models are fast, and run in less than 1s on a six-year-old Samsung Galaxy S8 phone's CPU. Our models generalize well to unseen data and achieve good results on Middlebury and in-the-wild images captured from the smart glasses.

Instant Volumetric Head Avatars
Zielonka, WojciechandBolkart, TimoandThies, Justus



研究问题:如何实时重建照片级真实的数字头像?
动机:目前最先进的方法需要几天时间来训练一个数字头像,我们的目标是在现代GPU硬件上,将训练时间缩短到10分钟以内。
方法:我们提出了一种名为INSTA的新方法,该方法基于嵌入在参数化人脸模型周围的神经图形原语的动态神经辐射场进行建模。我们的管道在一个单一的视角RGB肖像视频上进行训练,该视频观察了在不同表情和视角下的主题。
效果:实验结果表明,INSTA在渲染质量和训练时间方面优于最先进的方法,并且能够对未见过的姿态进行推断。

We present Instant Volumetric Head Avatars (INSTA), a novel approach for reconstructing photo-realistic digital avatars instantaneously. INSTA models a dynamic neural radiance field based on neural graphics primitives embedded around a parametric face model. Our pipeline is trained on a single monocular RGB portrait video that observes the subject under different expressions and views. While state-of-the-art methods take up to several days to train an avatar, our method can reconstruct a digital avatar in less than 10 minutes on modern GPU hardware, which is orders of magnitude faster than previous solutions. In addition, it allows for the interactive rendering of novel poses and expressions. By leveraging the geometry prior of the underlying parametric face model, we demonstrate that INSTA extrapolates to unseen poses. In quantitative and qualitative studies on various subjects, INSTA outperforms state-of-the-art methods regarding rendering quality and training time. Project website: https://zielon.github.io/insta/

HARP: Personalized Hand Reconstruction From a Monocular RGB Video
Karunratanakul, KorraweandProkudin, SergeyandHilliges, OtmarandTang, Siyu



研究问题:如何从单目RGB视频中重建出高保真度的手部化身?
动机:目前大多数手部化身创建方法都采用神经隐式表示,而HARP提出了一种基于网格参数化手模型、顶点位移图、法线图和漫反射的显式表示方法,无需任何神经网络组件。
方法:HARP通过梯度下降优化,直接使用手持手机捕获的短序列进行训练,并设计了一种阴影感知的可微分渲染方案,以实现真实感渲染和实时渲染能力。
效果:实验证明,HARP在外观重建、新视角和新姿态合成以及3D手部姿态优化等方面具有优越的性能和可扩展性,是一种适用于AR/VR应用的个性化手部表示方法。

We present HARP (HAnd Reconstruction and Personalization), a personalized hand avatar creation approach that takes a short monocular RGB video of a human hand as input and reconstructs a faithful hand avatar exhibiting a high-fidelity appearance and geometry. In contrast to the major trend of neural implicit representations, HARP models a hand with a mesh-based parametric hand model, a vertex displacement map, a normal map, and an albedo without any neural components. The explicit nature of our representation enables a truly scalable, robust, and efficient approach to hand avatar creation as validated by our experiments. HARP is optimized via gradient descent from a short sequence captured by a hand-held mobile phone and can be directly used in AR/VR applications with real-time rendering capability. To enable this, we carefully design and implement a shadow-aware differentiable rendering scheme that is robust to high degree articulations and self-shadowing regularly present in hand motions, as well as challenging lighting conditions. It also generalizes to unseen poses and novel viewpoints, producing photo-realistic renderings of hand animations. Furthermore, the learned HARP representation can be used for improving 3D hand pose estimation quality in challenging viewpoints. The key advantages of HARP are validated by the in-depth analyses on appearance reconstruction, novel view and novel pose synthesis, and 3D hand pose refinement. It is an AR/VR-ready personalized hand representation that shows superior fidelity and scalability.

DBARF: Deep Bundle-Adjusting Generalizable Neural Radiance Fields
Chen, YuandLee, GimHee



研究问题:如何优化基于复杂3D CNN或变压器架构的广义NeRFs(GeNeRFs)中的相机姿态。
动机:尽管BARF和GARF等方法在调整NeRFs中的相机姿态方面取得了显著成果,但它们无法应用于需要图像特征提取的GeNeRFs。
方法:我们提出了DBARF,该方法通过将成本特征图作为隐式成本函数来调整相机姿态,可以与GeNeRFs进行自监督联合训练。
效果:实验表明,我们的DBARF在真实世界数据集上具有高效性和泛化能力,并且无需任何良好的初始初始化,即可跨场景应用。

Recent works such as BARF and GARF can bundle adjust camera poses with neural radiance fields (NeRF) which is based on coordinate-MLPs. Despite the impressive results, these methods cannot be applied to Generalizable NeRFs (GeNeRFs) which require image feature extractions that are often based on more complicated 3D CNN or transformer architectures. In this work, we first analyze the difficulties of jointly optimizing camera poses with GeNeRFs, and then further propose our DBARF to tackle these issues. Our DBARF which bundle adjusts camera poses by taking a cost feature map as an implicit cost function can be jointly trained with GeNeRFs in a self-supervised manner. Unlike BARF and its follow-up works, which can only be applied to per-scene optimized NeRFs and need accurate initial camera poses with the exception of forward-facing scenes, our method can generalize across scenes and does not require any good initialization. Experiments show the effectiveness and generalization ability of our DBARF when evaluated on real-world datasets. Our code is available at https://aibluefisher.github.io/dbarf.

Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries
Yue, YuanwenandKontogianni, TheodoraandSchindler, KonradandEngelmann, Francis



研究问题:本文旨在解决从3D扫描中重建二维平面图的问题。
动机:现有的方法通常采用启发式设计的多阶段管道,而我们将其转化为一个单一的结构预测任务。
方法:我们开发了一种新的Transformer架构,该架构可以并行生成多个房间的多边形,无需手工设计的中间阶段。
效果:该方法在两个具有挑战性的数据集Structured3D和SceneCAD上取得了新的最先进的结果,并且比之前的方法有更快的推理速度。此外,它还可以方便地扩展到预测其他信息,如语义房间类型和建筑元素(如门和窗户)。

We address 2D floorplan reconstruction from 3D scans. Existing approaches typically employ heuristically designed multi-stage pipelines. Instead, we formulate floorplan reconstruction as a single-stage structured prediction task: find a variable-size set of polygons, which in turn are variable-length sequences of ordered vertices. To solve it we develop a novel Transformer architecture that generates polygons of multiple rooms in parallel, in a holistic manner without hand-crafted intermediate stages. The model features two-level queries for polygons and corners, and includes polygon matching to make the network end-to-end trainable. Our method achieves a new state-of-the-art for two challenging datasets, Structured3D and SceneCAD, along with significantly faster inference than previous methods. Moreover, it can readily be extended to predict additional information, i.e., semantic room types and architectural elements like doors and windows. Our code and models are available at: https://github.com/ywyue/RoomFormer.

Analyzing and Diagnosing Pose Estimation With Attributions
He, QiyuanandYang, LinlinandGu, KeruiandLin, QiuxiaandYao, Angela



研究问题:设计一种用于姿态估计的可解释性技术。
动机:为了理解不同姿态框架的影响,需要一种可以生成像素级归因图的技术。
方法:提出Pose Integrated Gradient(PoseIG),将不同的姿态输出统一到一个共同的输出空间,并使用似然近似函数进行梯度反向传播。
效果:通过这些工具,我们系统地比较了不同的人体姿态估计框架,揭示了手部姿态估计中关节的捷径和身体姿态估计中关键点的未探索的反转误差。

We present Pose Integrated Gradient (PoseIG), the first interpretability technique designed for pose estimation. We extend the concept of integrated gradients for pose estimation to generate pixel-level attribution maps. To enable comparison across different pose frameworks, we unify different pose outputs into a common output space, along with a likelihood approximation function for gradient back-propagation. To complement the qualitative insight from the attribution maps, we propose three indices for quantitative analysis. With these tools, we systematically compare different pose estimation frameworks to understand the impacts of network design, backbone and auxiliary tasks. Our analysis reveals an interesting shortcut of the knuckles (MCP joints) for hand pose estimation and an under-explored inversion error for keypoints in body pose estimation. Project page: https://qy-h00.github.io/poseig/.

Scalable, Detailed and Mask-Free Universal Photometric Stereo
Ikehata, Satoshi



研究问题:本文旨在开发一种名为SDM-UniPS的通用光度立体网络,用于在未知、空间变化的照明条件下恢复复杂的表面法线映射。
动机:现有的光度立体网络在非受控环境中的表现受到限制,尤其是在未知、空间变化的照明条件下。
方法:我们扩展了之前的通用光度立体网络,以提取空间光特性,利用高分辨率输入图像中的所有可用信息,并考虑表面点之间的非局部交互。我们还创建了一个包含真实世界场景中各种形状、材料和照明情况的新合成训练数据集。
效果:实验结果表明,我们的方法不仅在公共基准测试上超越了校准的、特定于照明的技术,而且在输入图像数量显著减少的情况下也能表现出色,甚至无需物体掩码。

In this paper, we introduce SDM-UniPS, a groundbreaking Scalable, Detailed, Mask-free, and Universal Photometric Stereo network. Our approach can recover astonishingly intricate surface normal maps, rivaling the quality of 3D scanners, even when images are captured under unknown, spatially-varying lighting conditions in uncontrolled environments. We have extended previous universal photometric stereo networks to extract spatial-light features, utilizing all available information in high-resolution input images and accounting for non-local interactions among surface points. Moreover, we present a new synthetic training dataset that encompasses a diverse range of shapes, materials, and illumination scenarios found in real-world scenes. Through extensive evaluation, we demonstrate that our method not only surpasses calibrated, lighting-specific techniques on public benchmarks, but also excels with a significantly smaller number of input images even without object masks.

Persistent Nature: A Generative Model of Unbounded 3D Worlds
Chai, LucyandTucker, RichardandLi, ZhengqiandIsola, PhillipandSnavely, Noah



研究问题:本文旨在解决当前3D图像生成模型在固定范围和有限相机运动下的问题。
动机:尽管最近的3D图像生成模型的图像质量越来越高,但它们通常在固定范围内的3D体积上操作,并且相机的运动受到限制。
方法:我们提出了一种无条件合成无限自然场景的方法,可以在保持3D世界模型持续性的同时实现任意大的相机运动。我们的场景表示由可扩展的平面场景布局网格和全景天空穹顶组成,可以通过3D解码器和体积渲染从任意相机姿态进行渲染。基于这种表示,我们仅从单视图互联网照片中学习生成世界模型。
效果:我们的方法是当前3D生成模型的固定边界之外的场景外推,同时支持与自回归3D预测模型形成对比的持续、与相机无关的世界表示。

Despite increasingly realistic image quality, recent 3D image generative models often operate on 3D volumes of fixed extent with limited camera motions. We investigate the task of unconditionally synthesizing unbounded nature scenes, enabling arbitrarily large camera motion while maintaining a persistent 3D world model. Our scene representation consists of an extendable, planar scene layout grid, which can be rendered from arbitrary camera poses via a 3D decoder and volume rendering, and a panoramic skydome. Based on this representation, we learn a generative world model solely from single-view internet photos. Our method enables simulating long flights through 3D landscapes, while maintaining global scene consistency---for instance, returning to the starting point yields the same view of the scene. Our approach enables scene extrapolation beyond the fixed bounds of current 3D generative models, while also supporting a persistent, camera-independent world representation that stands in contrast to auto-regressive 3D prediction models. Our project page: https://chail.github.io/persistent-nature/.

Behind the Scenes: Density Fields for Single View Reconstruction
Wimbauer, FelixandYang, NanandRupprecht, ChristianandCremers, Daniel



研究问题:如何从单张图像中推断有意义的几何场景表示。
动机:传统的深度图预测方法只能推理图像中可见的区域,而神经辐射场(NeRFs)虽然能捕获真实的3D颜色信息,但生成过程过于复杂。
方法:提出从单张图像中预测隐密度场的方法,将图像的锥体中的每个位置映射到体积密度,通过直接从可用视图中采样颜色而不是在密度场中存储颜色,使得场景表示比NeRFs明显简单,神经网络可以在一次前向传递中进行预测。
效果:实验表明,该方法能够对输入图像中遮挡的区域进行有意义的几何预测,并在三个数据集上展示了深度预测和新颖视图合成的潜力。

Inferring a meaningful geometric scene representation from a single image is a fundamental problem in computer vision. Approaches based on traditional depth map prediction can only reason about areas that are visible in the image. Currently, neural radiance fields (NeRFs) can capture true 3D including color, but are too complex to be generated from a single image. As an alternative, we propose to predict an implicit density field from a single image. It maps every location in the frustum of the image to volumetric density. By directly sampling color from the available views instead of storing color in the density field, our scene representation becomes significantly less complex compared to NeRFs, and a neural network can predict it in a single forward pass. The network is trained through self-supervision from only video data. Our formulation allows volume rendering to perform both depth prediction and novel view synthesis. Through experiments, we show that our method is able to predict meaningful geometry for regions that are occluded in the input image. Additionally, we demonstrate the potential of our approach on three datasets for depth prediction and novel-view synthesis.

Sphere-Guided Training of Neural Implicit Surfaces
Dogaru, AndreeaandArdelean, Andrei-TimoteiandIgnatyev, SavvaandZakharov, EgorandBurnaev, Evgeny



研究问题:近年来,通过体积光线行进训练的神经距离函数已被广泛用于多视角3D重建。然而,这些方法在整个场景体积上应用光线行进程序,导致采样效率降低,从而在高频细节区域产生较低的重建质量。
动机:本文通过联合训练隐式函数和新的基于粗球的表面重建来解决此问题。我们使用粗糙表示来有效地从体积光线行进过程中排除场景的空体积,而无需额外的神经表面网络前向传递,从而提高与基本系统的重建保真度。
方法:我们将这种方法纳入几种隐式表面建模方法的训练过程中,并在合成和真实世界数据集上进行评估。
效果:实验结果表明,我们的方法在所有数据集上都取得了一致的改进。我们的代码库可以通过项目页面访问。

In recent years, neural distance functions trained via volumetric ray marching have been widely adopted for multi-view 3D reconstruction. These methods, however, apply the ray marching procedure for the entire scene volume, leading to reduced sampling efficiency and, as a result, lower reconstruction quality in the areas of high-frequency details. In this work, we address this problem via joint training of the implicit function and our new coarse sphere-based surface reconstruction. We use the coarse representation to efficiently exclude the empty volume of the scene from the volumetric ray marching procedure without additional forward passes of the neural surface network, which leads to an increased fidelity of the reconstructions compared to the base systems. We evaluate our approach by incorporating it into the training procedures of several implicit surface modeling methods and observe uniform improvements across both synthetic and real-world datasets. Our codebase can be accessed via the project page.

BAD-NeRF: Bundle Adjusted Deblur Neural Radiance Fields
Wang, PengandZhao, LingzheandMa, RuijieandLiu, Peidong



研究问题:如何提高神经网络辐射场(NeRF)在真实世界场景中,对质量下降的图像(如低光照条件下的运动模糊图像)的处理能力。
动机:现有的方法通常假设输入的图像质量良好,但在现实世界中,图像退化(例如低光照条件下的运动模糊)是常见的,这会影响NeRF的渲染质量。
方法:我们提出了一种新的捆绑调整去模糊神经辐射场(BAD-NeRF),它可以处理严重运动模糊的图像和不准确的相机姿态。我们的方法模拟了运动模糊图像的实际物理成像过程,并联合学习了NeRF的参数和恢复曝光期间的相机运动轨迹。
效果:实验表明,通过直接模拟实际物理成像过程,BAD-NeRF在合成数据集和真实数据集上都优于先前的工作。

Neural Radiance Fields (NeRF) have received considerable attention recently, due to its impressive capability in photo-realistic 3D reconstruction and novel view synthesis, given a set of posed camera images. Earlier work usually assumes the input images are of good quality. However, image degradation (e.g. image motion blur in low-light conditions) can easily happen in real-world scenarios, which would further affect the rendering quality of NeRF. In this paper, we present a novel bundle adjusted deblur Neural Radiance Fields (BAD-NeRF), which can be robust to severe motion blurred images and inaccurate camera poses. Our approach models the physical image formation process of a motion blurred image, and jointly learns the parameters of NeRF and recovers the camera motion trajectories during exposure time. In experiments, we show that by directly modeling the real physical image formation process, BAD-NeRF achieves superior performance over prior works on both synthetic and real datasets. Code and data are available at https://github.com/WU-CVGL/BAD-NeRF.

Multiscale Tensor Decomposition and Rendering Equation Encoding for View Synthesis
Han, KangandXiang, Wei



研究问题:如何提高从捕获的多视图图像中渲染新视角的质量。
动机:随着神经辐射场的出现,从捕获的多视图图像中渲染新视角已经取得了显著的进步,但质量仍有待提高。
方法:提出了一种名为神经辐射特征场(NRFF)的新方法。首先,提出一个多尺度张量分解方案来组织可学习的特徵,以从粗到细的比例表示场景。然后,通过使用从提出的多尺度表示预测的各向异性球形高斯混合模型在特征空间中编码渲染方程,而不是编码视图方向以模拟依赖视图的效果。
效果:实验结果表明,提出的NRFF在NeRF和NSVF合成数据集上将最先进的渲染结果提高了1 dB以上的PSNR,并且在现实世界的Tanks & Temples数据集上也观察到了显著的改进。

Rendering novel views from captured multi-view images has made considerable progress since the emergence of the neural radiance field. This paper aims to further advance the quality of view rendering by proposing a novel approach dubbed the neural radiance feature field (NRFF). We first propose a multiscale tensor decomposition scheme to organize learnable features so as to represent scenes from coarse to fine scales. We demonstrate many benefits of the proposed multiscale representation, including more accurate scene shape and appearance reconstruction, and faster convergence compared with the single-scale representation. Instead of encoding view directions to model view-dependent effects, we further propose to encode the rendering equation in the feature space by employing the anisotropic spherical Gaussian mixture predicted from the proposed multiscale representation. The proposed NRFF improves state-of-the-art rendering results by over 1 dB in PSNR on both the NeRF and NSVF synthetic datasets. A significant improvement has also been observed on the real-world Tanks & Temples dataset. Code can be found at https://github.com/imkanghan/nrff.

Learning Accurate 3D Shape Based on Stereo Polarimetric Imaging
Huang, TianyuandLi, HaoangandHe, KejingandSui, CongyingandLi, BinandLiu, Yun-Hui



研究问题:本文旨在解决现有的形状从极化(SfP)方法在恢复表面法线时的两个主要问题,即极化线索的模糊性导致的错误法线估计和广泛使用的正射投影假设过于理想。
动机:为了解决这些问题,作者提出了第一个结合深度学习和立体极化信息的方法,不仅可以恢复法线,还可以恢复视差。
方法:对于模糊性问题,设计了一个基于形状一致性的掩码预测模块,利用法线和视差的固有一致性来识别错误法线估计的区域,并用全局注意力机制提取的新特征替换这些区域中的不可靠特征。对于正射投影问题,提出了一种新的视角辅助位置编码策略,使神经网络能够处理非正射投影。
效果:实验表明,与现有的SfP方法相比,该方法更准确,对光照变化更具鲁棒性。

Shape from Polarization (SfP) aims to recover surface normal using the polarization cues of light. The accuracy of existing SfP methods is affected by two main problems. First, the ambiguity of polarization cues partially results in false normal estimation. Second, the widely-used assumption about orthographic projection is too ideal. To solve these problems, we propose the first approach that combines deep learning and stereo polarization information to recover not only normal but also disparity. Specifically, for the ambiguity problem, we design a Shape Consistency-based Mask Prediction (SCMP) module. It exploits the inherent consistency between normal and disparity to identify the areas with false normal estimation. We replace the unreliable features enclosed by these areas with new features extracted by global attention mechanism. As to the orthographic projection problem, we propose a novel Viewing Direction-aided Positional Encoding (VDPE) strategy. This strategy is based on the unique pixel-viewing direction encoding, and thus enables our neural network to handle the non-orthographic projection. In addition, we establish a real-world stereo SfP dataset that contains various object categories and illumination conditions. Experiments showed that compared with existing SfP methods, our approach is more accurate. Moreover, our approach shows higher robustness to light variation.

GANmouflage: 3D Object Nondetection With Texture Fields
Guo, RuiandCollins, JasmineanddeLima, OscarandOwens, Andrew



研究问题:提出一种在场景中隐藏3D物体的方法。
动机:为了解决在各种视角下准确复制场景纹理并处理冲突约束的问题,我们提出了基于纹理场和对抗学习的模型。
方法:我们的模型通过学习从输入场景中的随机位置和视角隐藏各种物体形状,并首次解决了隐藏复杂物体形状的问题。
效果:通过人类视觉搜索研究,我们发现估计的纹理比之前的方法更好地隐藏了物体。

We propose a method that learns to camouflage 3D objects within scenes. Given an object's shape and a distribution of viewpoints from which it will be seen, we estimate a texture that will make it difficult to detect. Successfully solving this task requires a model that can accurately reproduce textures from the scene, while simultaneously dealing with the highly conflicting constraints imposed by each viewpoint. We address these challenges with a model based on texture fields and adversarial learning. Our model learns to camouflage a variety of object shapes from randomly sampled locations and viewpoints within the input scene, and is the first to address the problem of hiding complex object shapes. Using a human visual search study, we find that our estimated textures conceal objects significantly better than previous methods.

OReX: Object Reconstruction From Planar Cross-Sections Using Neural Fields
Sawdayee, HaimandVaxman, AmirandBermano, AmitH.



研究问题:如何仅从平面切片重建三维形状。
动机:解决稀疏和病态问题,改善现有方法的低质量结果或对额外先验(如目标拓扑、外观信息或输入法线方向)的依赖。
方法:提出了OReX方法,使用神经场作为插值先验进行三维形状重建。通过在输入平面上训练一个适度的神经网络来估计给定3D坐标的内外,产生强大的先验,引入平滑性和自相似性。
效果:通过广泛的定性和定量实验,证明该方法稳健、准确且具有良好的扩展性。与先前的方法和最近的可能解决方案相比,取得了最先进的结果。

Reconstructing 3D shapes from planar cross-sections is a challenge inspired by downstream applications like medical imaging and geographic informatics. The input is an in/out indicator function fully defined on a sparse collection of planes in space, and the output is an interpolation of the indicator function to the entire volume. Previous works addressing this sparse and ill-posed problem either produce low quality results, or rely on additional priors such as target topology, appearance information, or input normal directions. In this paper, we present OReX, a method for 3D shape reconstruction from slices alone, featuring a Neural Field as the interpolation prior. A modest neural network is trained on the input planes to return an inside/outside estimate for a given 3D coordinate, yielding a powerful prior that induces smoothness and self-similarities. The main challenge for this approach is high-frequency details, as the neural prior is overly smoothing. To alleviate this, we offer an iterative estimation architecture and a hierarchical input sampling scheme that encourage coarse-to-fine training, allowing the training process to focus on high frequencies at later stages. In addition, we identify and analyze a ripple-like effect stemming from the mesh extraction step. We mitigate it by regularizing the spatial gradients of the indicator function around input in/out boundaries during network training, tackling the problem at the root. Through extensive qualitative and quantitative experimentation, we demonstrate our method is robust, accurate, and scales well with the size of the input. We report state-of-the-art results compared to previous approaches and recent potential solutions, and demonstrate the benefit of our individual contributions through analysis and ablation studies.

SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting With Neural Radiance Fields
Mirzaei, AshkanandAumentado-Armstrong, TristanandDerpanis, KonstantinosG.andKelly, JonathanandBrubaker, MarcusA.andGilitschenski, IgorandLevinshtein, Alex



研究问题:如何有效地从3D场景中移除不需要的对象,并保持视觉可信度和上下文一致性。
动机:现有的方法在处理3D场景的编辑任务时,如移除不需要的对象,仍存在挑战。
方法:提出一种新的3D场景修复方法,通过使用少量的图像和稀疏的标注,首先快速获取目标对象的3D分割掩码,然后利用学习到的2D图像修复器的信息,将其信息提炼到3D空间,同时确保视图一致性。
效果:实验结果表明,该方法在多视图分割和3D场景修复任务上均取得了最先进的性能。

Neural Radiance Fields (NeRFs) have emerged as a popular approach for novel view synthesis. While NeRFs are quickly being adapted for a wider set of applications, intuitively editing NeRF scenes is still an open challenge. One important editing task is the removal of unwanted objects from a 3D scene, such that the replaced region is visually plausible and consistent with its context. We refer to this task as 3D inpainting. In 3D, solutions must be both consistent across multiple views and geometrically valid. In this paper, we propose a novel 3D inpainting method that addresses these challenges. Given a small set of posed images and sparse annotations in a single input image, our framework first rapidly obtains a 3D segmentation mask for a target object. Using the mask, a perceptual optimization-based approach is then introduced that leverages learned 2D image inpainters, distilling their information into 3D space, while ensuring view consistency. We also address the lack of a diverse benchmark for evaluating 3D scene inpainting methods by introducing a dataset comprised of challenging real-world scenes. In particular, our dataset contains views of the same scene with and without a target object, enabling more principled benchmarking of the 3D inpainting task. We first demonstrate the superiority of our approach on multiview segmentation, comparing to NeRF-based methods and 2D segmentation approaches. We then evaluate on the task of 3D inpainting, establishing state-of-the-art performance against other NeRF manipulation algorithms, as well as a strong 2D image inpainter baseline.

Patch-Based 3D Natural Scene Generation From a Single Example
Li, WeiyuandChen, XuelinandWang, JueandChen, Baoquan



研究问题:我们的目标是为通常独特且复杂的自然场景开发一种3D生成模型。由于缺乏必要的训练数据量,以及在场景特征变化的情况下进行特殊设计的难度,使得现有的设置难以处理。
动机:受到经典基于补丁的图像模型的启发,我们主张在给定单个示例的情况下,在补丁级别合成3D场景。这项工作的核心在于解决从经典的2D补丁基础框架提升到3D生成所面临的独特挑战的重要算法设计。
方法:我们采用了一种新颖的、有效的、高效的模型,该模型能够生成具有真实几何结构和视觉外观的高质量通用自然场景,并在大量和各种类型的示范场景上进行了演示。
效果:实验结果表明,我们的模型可以生成大量的高质量通用自然场景,这些场景既具有真实的几何结构,又具有视觉外观。

We target a 3D generative model for general natural scenes that are typically unique and intricate. Lacking the necessary volumes of training data, along with the difficulties of having ad hoc designs in presence of varying scene characteristics, renders existing setups intractable. Inspired by classical patch-based image models, we advocate for synthesizing 3D scenes at the patch level, given a single example. At the core of this work lies important algorithmic designs w.r.t the scene representation and generative patch nearest-neighbor module, that address unique challenges arising from lifting classical 2D patch-based framework to 3D generation. These design choices, on a collective level, contribute to a robust, effective, and efficient model that can generate high-quality general natural scenes with both realistic geometric structure and visual appearance, in large quantities and varieties, as demonstrated upon a variety of exemplar scenes. Data and code can be found at http://wyysf-98.github.io/Sin3DGen.

Efficient View Synthesis and 3D-Based Multi-Frame Denoising With Multiplane Feature Representations
Tanay, ThomasandLeonardis, Ale\v{s



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

While current multi-frame restoration methods combine information from multiple input images using 2D alignment techniques, recent advances in novel view synthesis are paving the way for a new paradigm relying on volumetric scene representations. In this work, we introduce the first 3D-based multi-frame denoising method that significantly outperforms its 2D-based counterparts with lower computational requirements. Our method extends the multiplane image (MPI) framework for novel view synthesis by introducing a learnable encoder-renderer pair manipulating multiplane representations in feature space. The encoder fuses information across views and operates in a depth-wise manner while the renderer fuses information across depths and operates in a view-wise manner. The two modules are trained end-to-end and learn to separate depths in an unsupervised way, giving rise to Multiplane Feature (MPF) representations. Experiments on the Spaces and Real Forward-Facing datasets as well as on raw burst data validate our approach for view synthesis, multi-frame denoising, and view synthesis under noisy conditions.

HairStep: Transfer Synthetic to Real Using Strand and Depth Maps for Single-View 3D Hair Modeling
Zheng, YujianandJin, ZirongandLi, MoranandHuang, HaibinandMa, ChongyangandCui, ShuguangandHan, Xiaoguang



研究问题:解决基于学习的单视图3D头发建模难题。
动机:由于收集真实图像和3D头发数据的困难,使用合成数据为真实领域提供先验知识成为主要解决方案,但这引入了领域差距的挑战。
方法:提出一种新的中间表示法——HairStep,包括一个发丝图和一个深度图。设计了一个学习框架将真实图像转化为发丝图和深度图。
效果:实验表明,HairStep缩小了合成与真实的领域差距,并在单视图3D头发重建方面取得了最先进的性能。

In this work, we tackle the challenging problem of learning-based single-view 3D hair modeling. Due to the great difficulty of collecting paired real image and 3D hair data, using synthetic data to provide prior knowledge for real domain becomes a leading solution. This unfortunately introduces the challenge of domain gap. Due to the inherent difficulty of realistic hair rendering, existing methods typically use orientation maps instead of hair images as input to bridge the gap. We firmly think an intermediate representation is essential, but we argue that orientation map using the dominant filtering-based methods is sensitive to uncertain noise and far from a competent representation. Thus, we first raise this issue up and propose a novel intermediate representation, termed as HairStep, which consists of a strand map and a depth map. It is found that HairStep not only provides sufficient information for accurate 3D hair modeling, but also is feasible to be inferred from real images. Specifically, we collect a dataset of 1,250 portrait images with two types of annotations. A learning framework is further designed to transfer real images to the strand map and depth map. It is noted that, an extra bonus of our new dataset is the first quantitative metric for 3D hair modeling. Our experiments show that HairStep narrows the domain gap between synthetic and real and achieves state-of-the-art performance on single-view 3D hair reconstruction.

Complete 3D Human Reconstruction From a Single Incomplete Image
Wang, JunyingandYoon, JaeShinandWang, TuanfengY.andSingh, KrishnaKumarandNeumann, Ulrich



研究问题:如何从只有部分身体(如躯干)可见的图像中重建完整的人体几何和纹理。
动机:由于遮挡的存在,许多现有的单视图人体重建方法无法处理不可见部分,导致3D中存在缺失数据。
方法:提出了一种新的从粗到细的人体重建框架。对于粗重建,通过学习体积特征生成具有3D卷积神经网络的完整人体几何,该网络由3D身体模型和可见部分的风格特征进行条件化。一个隐式网络将学习的3D特征与从多视图增强的高质表面法线相结合,以产生精细的局部细节,例如高频皱纹。最后,执行渐进式纹理修复,以一致的方式重建人的完整外观,这在没有完整几何重建的情况下是不可能的。
效果:实验表明,该方法可以重建高质量的3D人体,对遮挡具有鲁棒性。

This paper presents a method to reconstruct a complete human geometry and texture from an image of a person with only partial body observed, e.g., a torso. The core challenge arises from the occlusion: there exists no pixel to reconstruct where many existing single-view human reconstruction methods are not designed to handle such invisible parts, leading to missing data in 3D. To address this challenge, we introduce a novel coarse-to-fine human reconstruction framework. For coarse reconstruction, explicit volumetric features are learned to generate a complete human geometry with 3D convolutional neural networks conditioned by a 3D body model and the style features from visible parts. An implicit network combines the learned 3D features with the high-quality surface normals enhanced from multiview to produce fine local details, e.g., high-frequency wrinkles. Finally, we perform progressive texture inpainting to reconstruct a complete appearance of the person in a view-consistent way, which is not possible without the reconstruction of a complete geometry. In experiments, we demonstrate that our method can reconstruct high-quality 3D humans, which is robust to occlusion.

Reconstructing Animatable Categories From Videos
Yang, GengshanandWang, ChaoyangandReddy, N.DineshandRamanan, Deva



研究问题:如何从单目视频中构建类别级别的3D模型,解决实例间变化和时间运动的问题。
动机:现有的方法需要3D扫描、繁琐的注册和手动装配,而基于可区分渲染的方法又仅限于刚性类别或单个实例。
方法:提出了一种名为RAC的新方法,通过专门针对实例的类别级骨架、鼓励类别共享结构同时保留实例细节的潜空间正则化以及使用3D背景模型将对象与背景分离等三个关键思想来解决这个问题。
效果:成功为人类、猫和狗构建了基于单目视频的3D模型。

Building animatable 3D models is challenging due to the need for 3D scans, laborious registration, and manual rigging. Recently, differentiable rendering provides a pathway to obtain high-quality 3D models from monocular videos, but these are limited to rigid categories or single instances. We present RAC, a method to build category-level 3D models from monocular videos, disentangling variations over instances and motion over time. Three key ideas are introduced to solve this problem: (1) specializing a category-level skeleton to instances, (2) a method for latent space regularization that encourages shared structure across a category while maintaining instance details, and (3) using 3D background models to disentangle objects from the background. We build 3D models for humans, cats, and dogs given monocular videos. Project page: gengshan-y.github.io/rac-www/

High-Fidelity 3D Human Digitization From Single 2K Resolution Images
Han, Sang-HunandPark, Min-GyuandYoon, JuHongandKang, Ju-MiandPark, Young-JaeandJeon, Hae-Gon



研究问题:如何有效地利用高分辨率输入图像进行高质量的3D人体重建。
动机:现有的3D人体重建方法需要大量的训练数据和适当的网络设计,以充分利用高分辨率的输入图像。
方法:我们提出了一种名为2K2K的简单而有效的3D人体数字化方法,构建了一个大规模的2K人体数据集,并从2K分辨率的图像中推断出3D人体模型。该方法分别恢复人体的全局形状和细节。低分辨率深度网络从低分辨率图像中预测全局结构,部分图像到法线网络预测3D人体结构的详细部分。高分辨率深度网络将全局3D形状和详细结构合并,以推断高分辨率的前侧和后侧深度图。最后,一个现成的网格生成器重建完整的3D人体模型。
效果:在实验中,我们在各种数据集上的表现优于最近的工作,提供了2050个包含纹理映射、3D关节和SMPL参数的3D人体模型用于研究目的。

High-quality 3D human body reconstruction requires high-fidelity and large-scale training data and appropriate network design that effectively exploits the high-resolution input images. To tackle these problems, we propose a simple yet effective 3D human digitization method called 2K2K, which constructs a large-scale 2K human dataset and infers 3D human models from 2K resolution images. The proposed method separately recovers the global shape of a human and its details. The low-resolution depth network predicts the global structure from a low-resolution image, and the part-wise image-to-normal network predicts the details of the 3D human body structure. The high-resolution depth network merges the global 3D shape and the detailed structures to infer the high-resolution front and back side depth maps. Finally, an off-the-shelf mesh generator reconstructs the full 3D human model, which are available at https://github.com/SangHunHan92/2K2K. In addition, we also provide 2,050 3D human models, including texture maps, 3D joints, and SMPL parameters for research purposes. In experiments, we demonstrate competitive performance over the recent works on various datasets.

NeFII: Inverse Rendering for Reflectance Decomposition With Near-Field Indirect Illumination
Wu, HaoqianandHu, ZhipengandLi, LinchengandZhang, YongqiangandFan, ChangjieandYu, Xin



研究问题:本文旨在解决从多视角RGB图像中估计几何、材料和照明的问题。
动机:现有的逆渲染方法通过球面高斯模型间接照明,但这种方法往往会模糊高频反射的细节。
方法:本文提出了一种端到端的逆渲染流程,从多视角图像中分解材料和照明,同时考虑近场间接照明。具体来说,我们引入了基于蒙特卡洛采样的路径追踪,并将间接照明缓存为神经辐射度,实现了一种物理真实且易于优化的逆渲染方法。为了提高效率和实用性,我们利用SG表示平滑的环境照明,并应用了重要性采样技术。为了监督未被观察到方向的间接照明,我们开发了一种新颖的辐射度一致性约束,结合材料和照明的联合优化,从而显著提高了分解性能。
效果:大量实验表明,我们的方法在多个合成和真实数据集上优于最先进的方法,特别是在互反射分解方面。

Inverse rendering methods aim to estimate geometry, materials and illumination from multi-view RGB images. In order to achieve better decomposition, recent approaches attempt to model indirect illuminations reflected from different materials via Spherical Gaussians (SG), which, however, tends to blur the high-frequency reflection details. In this paper, we propose an end-to-end inverse rendering pipeline that decomposes materials and illumination from multi-view images, while considering near-field indirect illumination. In a nutshell, we introduce the Monte Carlo sampling based path tracing and cache the indirect illumination as neural radiance, enabling a physics-faithful and easy-to-optimize inverse rendering method. To enhance efficiency and practicality, we leverage SG to represent the smooth environment illuminations and apply importance sampling techniques. To supervise indirect illuminations from unobserved directions, we develop a novel radiance consistency constraint between implicit neural radiance and path tracing results of unobserved rays along with the joint optimization of materials and illuminations, thus significantly improving the decomposition performance. Extensive experiments demonstrate that our method outperforms the state-of-the-art on multiple synthetic and real datasets, especially in terms of inter-reflection decomposition.

Fully Self-Supervised Depth Estimation From Defocus Clue
Si, HaozheandZhao, BinandWang, DongandGao, YunpengandChen, MulinandWang, ZhigangandLi, Xuelong



研究问题:如何仅从稀疏散焦堆栈中估计深度,以克服在真实世界场景中无法获取深度和全焦点图像(AIF)的问题。
动机:现有的基于散焦的深度估计方法需要依赖深度和全焦点图像的真实值,这在实际场景中是无法获取的。
方法:提出一种完全自监督的框架,通过预测深度和全焦点图像来估计深度,同时利用光学模型对预测结果进行验证和优化。
效果:在具有渲染散焦堆栈和真实散焦堆栈的三个基准数据集上进行验证,实验结果表明该方法为自监督的基于散焦的深度估计任务提供了强有力的基线。

Depth-from-defocus (DFD), modeling the relationship between depth and defocus pattern in images, has demonstrated promising performance in depth estimation. Recently, several self-supervised works try to overcome the difficulties in acquiring accurate depth ground-truth. However, they depend on the all-in-focus (AIF) images, which cannot be captured in real-world scenarios. Such limitation discourages the applications of DFD methods. To tackle this issue, we propose a completely self-supervised framework that estimates depth purely from a sparse focal stack. We show that our framework circumvents the needs for the depth and AIF image ground-truth, and receives superior predictions, thus closing the gap between the theoretical success of DFD works and their applications in the real world. In particular, we propose (i) a more realistic setting for DFD tasks, where no depth or AIF image ground-truth is available; (ii) a novel self-supervision framework that provides reliable predictions of depth and AIF image under the the challenging setting. The proposed framework uses a neural model to predict the depth and AIF image, and utilizes an optical model to validate and refine the prediction. We verify our framework on three benchmark datasets with rendered focal stacks and real focal stacks. Qualitative and quantitative evaluations show that our method provides a strong baseline for self-supervised DFD tasks. The source code is publicly available at https://github.com/Ehzoahis/DEReD.

Deep Stereo Video Inpainting
Wu, ZhiliangandSun, ChangchangandXuan, HanyuandYan, Yan



研究问题:本文旨在解决立体视频修复中,如何同时对左右视图的缺失区域进行合理填充的问题。
动机:虽然单视频修复已取得显著成果,但立体视频修复尚未得到充分探索。其核心挑战在于保持左右视图的立体一致性,以减轻观众的3D疲劳感。
方法:本文提出了一种名为SVINet的新型深度立体视频修复网络,首次尝试使用深度卷积神经网络进行立体视频修复。首先,利用自我监督的流引导变形时序对齐模块分别对左右视图分支的特征进行对齐;然后,将对齐后的特征输入到共享的自适应特征聚合模块中生成各自的缺失内容;最后,引入视差注意力模块(PAM),利用跨视图信息考虑显著的立体相关性,融合左右视图的完成特征。此外,还开发了一种立体一致性损失来规范训练参数,使模型能够产生具有更好立体一致性的高质量立体视频修复结果。
效果:实验结果表明,我们的SVINet在性能上超过了最先进的单视频修复模型。

Stereo video inpainting aims to fill the missing regions on the left and right views of the stereo video with plausible content simultaneously. Compared with the single video inpainting that has achieved promising results using deep convolutional neural networks, inpainting the missing regions of stereo video has not been thoroughly explored. In essence, apart from the spatial and temporal consistency that single video inpainting needs to achieve, another key challenge for stereo video inpainting is to maintain the stereo consistency between left and right views and hence alleviate the 3D fatigue for viewers. In this paper, we propose a novel deep stereo video inpainting network named SVINet, which is the first attempt for stereo video inpainting task utilizing deep convolutional neural networks. SVINet first utilizes a self-supervised flow-guided deformable temporal alignment module to align the features on the left and right view branches, respectively. Then, the aligned features are fed into a shared adaptive feature aggregation module to generate missing contents of their respective branches. Finally, the parallax attention module (PAM) that uses the cross-view information to consider the significant stereo correlation is introduced to fuse the completed features of left and right views. Furthermore, we develop a stereo consistency loss to regularize the trained parameters, so that our model is able to yield high-quality stereo video inpainting results with better stereo consistency. Experimental results demonstrate that our SVINet outperforms state-of-the-art single video inpainting models.

CLOTH4D: A Dataset for Clothed Human Reconstruction
Zou, XingxingandHan, XintongandWong, Waikeung



研究问题:本文旨在解决虚拟世界中重建穿衣人类的问题,以提升虚拟化身的质量。
动机:高质量的穿衣人类重建是创建虚拟世界的关键,其质量在很大程度上决定了Metaverse是否会成为一时的热潮。
方法:本文介绍了CLOTH4D,这是一个包含1000个主题、1000个3D服装和超过10万个带配对未穿衣人类的穿衣网格的穿衣人类数据集,以填补大规模高质量4D服装数据的空白。
效果:基于CLOTH4D,我们设计了一系列新颖的时间感知指标来评估生成的3D人体网格的时间稳定性,这是之前被忽视的。此外,通过评估和再训练当前最先进的穿衣人类重建方法,我们揭示了一些见解,展示了改进的性能,并提出了可能的未来研究方向,证实了我们数据集的进步。该数据集可在www.github.com/AemikaChow/AiDLab-fAshIon-Data获取。

Clothed human reconstruction is the cornerstone for creating the virtual world. To a great extent, the quality of recovered avatars decides whether the Metaverse is a passing fad. In this work, we introduce CLOTH4D, a clothed human dataset containing 1,000 subjects with varied appearances, 1,000 3D outfits, and over 100,000 clothed meshes with paired unclothed humans, to fill the gap in large-scale and high-quality 4D clothing data. It enjoys appealing characteristics: 1) Accurate and detailed clothing textured meshes---all clothing items are manually created and then simulated in professional software, strictly following the general standard in fashion design. 2) Separated textured clothing and under-clothing body meshes, closer to the physical world than single-layer raw scans. 3) Clothed human motion sequences simulated given a set of 289 actions, covering fundamental and complicated dynamics. Upon CLOTH4D, we novelly designed a series of temporally-aware metrics to evaluate the temporal stability of the generated 3D human meshes, which has been overlooked previously. Moreover, by assessing and retraining current state-of-the-art clothed human reconstruction methods, we reveal insights, present improved performance, and propose potential future research directions, confirming our dataset's advancement. The dataset is available at www.github.com/AemikaChow/AiDLab-fAshIon-Data

Learning To Render Novel Views From Wide-Baseline Stereo Pairs
Du, YilunandSmith, CameronandTewari, AyushandSitzmann, Vincent



研究问题:如何仅通过单个宽基线立体图像对进行新视角合成。
动机:现有的稀疏观测新视角合成方法由于恢复错误的3D几何和高成本的可微渲染,导致其无法扩展至大规模训练。
方法:提出一种多视图变换编码器,设计了一种高效的图像空间极线采样方案来组装目标射线的图像特征,以及一个轻量级的交叉注意力渲染器。
效果:在两个真实世界数据集上进行了广泛的比较,显著优于现有基于稀疏图像观测的新视角合成方法,实现了多视图一致的新视角合成。

We introduce a method for novel view synthesis given only a single wide-baseline stereo image pair. In this challenging regime, 3D scene points are regularly observed only once, requiring prior-based reconstruction of scene geometry and appearance. We find that existing approaches to novel view synthesis from sparse observations fail due to recovering incorrect 3D geometry and the high cost of differentiable rendering that precludes their scaling to large-scale training. We take a step towards resolving these shortcomings by formulating a multi-view transformer encoder, proposing an efficient, image-space epipolar line sampling scheme to assemble image features for a target ray, and a lightweight cross-attention-based renderer. Our contributions enable training of our method on a large-scale real-world dataset of indoor and outdoor scenes. In several ablation studies, we demonstrate that our contributions enable learning of powerful multi-view geometry priors while reducing both rendering time and memory footprint. We conduct extensive comparisons on held-out test scenes across two real-world datasets, significantly outperforming prior work on novel view synthesis from sparse image observations and achieving multi-view-consistent novel view synthesis.

Multi-Object Manipulation via Object-Centric Neural Scattering Functions
Tian, StephenandCai, YanchengandYu, Hong-XingandZakharov, SergeyandLiu, KatherineandGaidon, AdrienandLi, YunzhuandWu, Jiajun



研究问题:如何更好地表示涉及多物体交互的场景。
动机:现有的方法将场景分解为离散的对象,但在具有挑战性的光照条件下,它们在精确建模和操纵方面存在困难,因为它们只编码了与特定照明相关的外观。
方法:提出使用以对象为中心的神经散射函数(OSFs)作为模型预测控制框架中的对象表示。OSFs对每个对象的光传输进行建模,使能在对象重新排列和变化的光照条件下进行组合场景的重新渲染。通过将这种方法与逆参数估计和基于图的神经动力学模型相结合,展示了在组合多物体环境中改进的模型预测控制性能和泛化能力,即使在以前未见过的场景和恶劣的光照条件下也是如此。
效果:实验结果表明,该方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Learned visual dynamics models have proven effective for robotic manipulation tasks. Yet, it remains unclear how best to represent scenes involving multi-object interactions. Current methods decompose a scene into discrete objects, yet they struggle with precise modeling and manipulation amid challenging lighting conditions since they only encode appearance tied with specific illuminations. In this work, we propose using object-centric neural scattering functions (OSFs) as object representations in a model-predictive control framework. OSFs model per-object light transport, enabling compositional scene re-rendering under object rearrangement and varying lighting conditions. By combining this approach with inverse parameter estimation and graph-based neural dynamics models, we demonstrate improved model-predictive control performance and generalization in compositional multi-object environments, even in previously unseen scenarios and harsh lighting conditions.

Invertible Neural Skinning
Kant, YashandSiarohin, AliaksandrandGuler, RizaAlpandChai, MengleiandRen, JianandTulyakov, SergeyandGilitschenski, Igor



研究问题:如何从原始3D扫描和姿势建立可动画化和可编辑的穿衣人类模型。
动机:现有的重新定位方法受到线性混合蒙皮(LBS)表达性的限制,需要昂贵的网格提取来生成每个新的姿势,并且通常不会保留不同姿势之间的表面对应关系。
方法:引入可逆神经蒙皮(INS),通过学习额外的随姿势变化的形状变形来保持对应关系。结合了可微分LBS模块,构建了一个具有表现力和端到端的可逆神经蒙皮(INS)管道。
效果:在穿衣人类上超越了最先进的重新定位技术,同时保留了表面对应关系,速度快了一个数量级。定性结果显示,INS可以很好地纠正LBS引入的人工制品。

Building animatable and editable models of clothed humans from raw 3D scans and poses is a challenging problem. Existing reposing methods suffer from the limited expressiveness of Linear Blend Skinning (LBS), require costly mesh extraction to generate each new pose, and typically do not preserve surface correspondences across different poses. In this work, we introduce Invertible Neural Skinning (INS) to address these shortcomings. To maintain correspondences, we propose a Pose-conditioned Invertible Network (PIN) architecture, which extends the LBS process by learning additional pose-varying deformations. Next, we combine PIN with a differentiable LBS module to build an expressive and end-to-end Invertible Neural Skinning (INS) pipeline. We demonstrate the strong performance of our method by outperforming the state-of-the-art reposing techniques on clothed humans and preserving surface correspondences, while being an order of magnitude faster. We also perform an ablation study, which shows the usefulness of our pose-conditioning formulation, and our qualitative results display that INS can rectify artefacts introduced by LBS well.

Seeing a Rose in Five Thousand Ways
Zhang, YunzhiandWu, ShangzheandSnavely, NoahandWu, Jiajun



研究问题:如何从单张图像中学习并捕捉物体的内在属性(如几何分布、纹理和材质)。
动机:现有的模型在处理多实例对象时,往往忽视了对象的内在属性,导致生成的图像缺乏真实感。
方法:本文提出了一种生成模型,通过学习单张图像中多个实例对象的内在属性,实现对不同大小、形状和姿态的物体进行渲染。
效果:实验证明,该方法能成功捕捉多种物体的内在属性,并在内在图像分解、形状和图像生成、视图合成和重光照等下游任务上取得优越结果。

What is a rose, visually? A rose comprises its intrinsics, including the distribution of geometry, texture, and material specific to its object category. With knowledge of these intrinsic properties, we may render roses of different sizes and shapes, in different poses, and under different lighting conditions. In this work, we build a generative model that learns to capture such object intrinsics from a single image, such as a photo of a bouquet. Such an image includes multiple instances of an object type. These instances all share the same intrinsics, but appear different due to a combination of variance within these intrinsics and differences in extrinsic factors, such as pose and illumination. Experiments show that our model successfully learns object intrinsics (distribution of geometry, texture, and material) for a wide range of objects, each from a single Internet image. Our method achieves superior results on multiple downstream tasks, including intrinsic image decomposition, shape and image generation, view synthesis, and relighting.

Neural Residual Radiance Fields for Streamably Free-Viewpoint Videos
Wang, LiaoandHu, QiangandHe, QihanandWang, ZiyuandYu, JingyiandTuytelaars, TinneandXu, LanandWu, Minye



研究问题:如何利用神经网络渲染技术实现动态场景的实时自由视点视频(FVV)渲染。
动机:目前的神经渲染技术在处理动态场景的FVV时,要么仅限于离线渲染,要么只能处理运动最小的短暂序列。
方法:提出了一种新的技术——残差辐射场(ReRF),作为一种高度紧凑的神经表示,以实现对长时间动态场景的实时FVV渲染。ReRF显式地在时空特征空间中建模相邻时间戳之间的残差信息,使用全局坐标基础的微型MLP作为特征解码器。
效果:实验表明,这种策略可以在不牺牲质量的情况下处理大的运动。基于ReRF,设计了一个特殊的FVV编解码器,实现了三个数量级的压缩率,并提供了一个配套的ReRF播放器,支持在线流式传输长时间动态场景的FVV。

The success of the Neural Radiance Fields (NeRFs) for modeling and free-view rendering static objects has inspired numerous attempts on dynamic scenes. Current techniques that utilize neural rendering for facilitating free-view videos (FVVs) are restricted to either offline rendering or are capable of processing only brief sequences with minimal motion. In this paper, we present a novel technique, Residual Radiance Field or ReRF, as a highly compact neural representation to achieve real-time FVV rendering on long-duration dynamic scenes. ReRF explicitly models the residual information between adjacent timestamps in the spatial-temporal feature space, with a global coordinate-based tiny MLP as the feature decoder. Specifically, ReRF employs a compact motion grid along with a residual feature grid to exploit inter-frame feature similarities. We show such a strategy can handle large motions without sacrificing quality. We further present a sequential training scheme to maintain the smoothness and the sparsity of the motion/residual grids. Based on ReRF, we design a special FVV codec that achieves three orders of magnitudes compression rate and provides a companion ReRF player to support online streaming of long-duration FVVs of dynamic scenes. Extensive experiments demonstrate the effectiveness of ReRF for compactly representing dynamic radiance fields, enabling an unprecedented free-viewpoint viewing experience in speed and quality.

NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds
Yang, ChenandLi, PeihaoandZhou, ZanweiandYuan, ShanxinandLiu, BingbingandYang, XiaokangandQiu, WeichaoandShen, Wei



研究问题:如何使NeRF在渲染与训练视图显著不同的新视图时,也能获得好的效果。
动机:目前的NeRF方法在渲染与训练视图相似的新视图上表现优秀,但在处理与训练视图有显著差异的新视图时表现不佳。
方法:提出一种名为NeRFVS的新方法,利用神经重建的整体先验(包括伪深度图和视图覆盖信息)来指导3D室内场景的隐式神经表示的学习。具体来说,首先利用现成的神经重建方法生成一个几何框架,然后根据整体先验提出两种改进NeRF学习的损失函数:1)鲁棒的深度损失,可以容忍伪深度图的误差,引导NeRF的几何学习;2)方差损失,用于规范隐式神经表示的方差,减少学习过程中的几何和颜色歧义。这两种损失函数会根据视图覆盖信息在NeRF优化过程中进行调整,以减少视图覆盖不平衡带来的负面影响。
效果:大量实验表明,我们的NeRFVS在室内场景上优于最先进的视图合成方法,无论是定量还是定性,都实现了高保真的自由导航结果。

We present NeRFVS, a novel neural radiance fields (NeRF) based method to enable free navigation in a room. NeRF achieves impressive performance in rendering images for novel views similar to the input views while suffering for novel views that are significantly different from the training views. To address this issue, we utilize the holistic priors, including pseudo depth maps and view coverage information, from neural reconstruction to guide the learning of implicit neural representations of 3D indoor scenes. Concretely, an off-the-shelf neural reconstruction method is leveraged to generate a geometry scaffold. Then, two loss functions based on the holistic priors are proposed to improve the learning of NeRF: 1) A robust depth loss that can tolerate the error of the pseudo depth map to guide the geometry learning of NeRF; 2) A variance loss to regularize the variance of implicit neural representations to reduce the geometry and color ambiguity in the learning procedure. These two loss functions are modulated during NeRF optimization according to the view coverage information to reduce the negative influence brought by the view coverage imbalance. Extensive results demonstrate that our NeRFVS outperforms state-of-the-art view synthesis methods quantitatively and qualitatively on indoor scenes, achieving high-fidelity free navigation results.

A Unified Spatial-Angular Structured Light for Single-View Acquisition of Shape and Reflectance
Xu, XianminandLin, YuxinandZhou, HaoyangandZeng, ChongandYu, YaxinandZhou, KunandWu, Hongzhi



研究问题:提出一种由LED阵列和LCD掩模组成的统一结构光,以从单一视角获取高质量的形状和反射率。
动机:为了解决传统方法在获取物体形状和反射率方面的不足,提出了一种结合结构光和深度学习的方法。
方法:使用LED投影一组学习到的掩模模式来准确编码空间信息,然后聚合多个LED的解码结果以生成最终的深度图。对于外观,将学习到的光模式投射通过透明掩模以高效地探测角度变化的反射率。优化BRDF参数并存储在纹理图中作为最终的反射率。建立了一个可微分的管道来进行联合捕获,自动优化掩模和光模式以提高采集质量。
效果:通过各种实物对象展示了该方法的有效性,并与最先进的技术进行了比较。

We propose a unified structured light, consisting of an LED array and an LCD mask, for high-quality acquisition of both shape and reflectance from a single view. For geometry, one LED projects a set of learned mask patterns to accurately encode spatial information; the decoded results from multiple LEDs are then aggregated to produce a final depth map. For appearance, learned light patterns are cast through a transparent mask to efficiently probe angularly-varying reflectance. Per-point BRDF parameters are differentiably optimized with respect to corresponding measurements, and stored in texture maps as the final reflectance. We establish a differentiable pipeline for the joint capture to automatically optimize both the mask and light patterns towards optimal acquisition quality. The effectiveness of our light is demonstrated with a wide variety of physical objects. Our results compare favorably with state-of-the-art techniques.

On the Importance of Accurate Geometry Data for Dense 3D Vision Tasks
Jung, HyunJunandRuhkamp, PatrickandZhai, GuangyaoandBrasch, NikolasandLi, YitongandVerdie, YannickandSong, JifeiandZhou, YirenandArmagan, AnilandIlic, SlobodanandLeonardis, Ale\v{s



研究问题:现有的基于学习的方法在解决密集3D视觉问题时,通常使用3D传感器数据进行训练,但并未对传感器测量的距离原理的优缺点进行比较和讨论。
动机:由于缺乏多模态数据集,纹理较少的区域对于运动结构和立体视觉存在问题,反射材料对主动传感造成影响,半透明物体的距离难以用现有硬件准确测量。训练不准确或损坏的数据会引入模型偏差,阻碍泛化能力。
方法:本文通过调查传感器误差对密集3D视觉任务(如深度估计和重建)的影响,严格展示了传感器特性对学习预测的重大影响,并注意到了在日常生活中各种技术引起的泛化问题。
效果:我们引入了一个精心设计的数据集,包括来自商用传感器(如D-ToF、I-ToF、被动/主动立体视觉和单目RGB+P)的测量数据。这项研究量化了显著的传感器噪声影响,并为改进密集视觉估计和有针对性的数据融合铺平了道路。

Learning-based methods to solve dense 3D vision problems typically train on 3D sensor data. The respectively used principle of measuring distances provides advantages and drawbacks. These are typically not compared nor discussed in the literature due to a lack of multi-modal datasets. Texture-less regions are problematic for structure from motion and stereo, reflective material poses issues for active sensing, and distances for translucent objects are intricate to measure with existing hardware. Training on inaccurate or corrupt data induces model bias and hampers generalisation capabilities. These effects remain unnoticed if the sensor measurement is considered as ground truth during the evaluation. This paper investigates the effect of sensor errors for the dense 3D vision tasks of depth estimation and reconstruction. We rigorously show the significant impact of sensor characteristics on the learned predictions and notice generalisation issues arising from various technologies in everyday household environments. For evaluation, we introduce a carefully designed dataset comprising measurements from commodity sensors, namely D-ToF, I-ToF, passive/active stereo, and monocular RGB+P. Our study quantifies the considerable sensor noise impact and paves the way to improved dense vision estimates and targeted data fusion.

K-Planes: Explicit Radiance Fields in Space, Time, and Appearance
Fridovich-Keil, SaraandMeanti, GiacomoandWarburg, FrederikRahb{\ae



研究问题:如何有效地表示任意维度的辐射场?
动机:目前的模型在处理动态场景时存在困难,需要一种能够无缝转换静态和动态场景的方法。
方法:提出了k-planes模型,使用d-choose-2平面来表示d维场景,通过平面分解方便地添加特定维度的先验知识,如时间平滑和多分辨率空间结构,并自然地将场景的静态和动态部分进行分解。
效果:在一系列合成和真实的静态、动态、固定和变化外观的场景中,k-planes模型具有竞争力且通常达到最先进的重建保真度,同时内存使用率低,实现了对完整4D网格的1000倍压缩,并且优化速度快,采用纯PyTorch实现。

We introduce k-planes, a white-box model for radiance fields in arbitrary dimensions. Our model uses d-choose-2 planes to represent a d-dimensional scene, providing a seamless way to go from static (d=3) to dynamic (d=4) scenes. This planar factorization makes adding dimension-specific priors easy, e.g. temporal smoothness and multi-resolution spatial structure, and induces a natural decomposition of static and dynamic components of a scene. We use a linear feature decoder with a learned color basis that yields similar performance as a nonlinear black-box MLP decoder. Across a range of synthetic and real, static and dynamic, fixed and varying appearance scenes, k-planes yields competitive and often state-of-the-art reconstruction fidelity with low memory usage, achieving 1000x compression over a full 4D grid, and fast optimization with a pure PyTorch implementation. For video results and code, please see sarafridov.github.io/K-Planes.

Viewpoint Equivariance for Multi-View 3D Object Detection
Chen, DianandLi, JieandGuizilini, VitorandAmbrus, RaresAndreiandGaidon, Adrien



研究问题:如何从视觉传感器中进行3D物体检测,这是机器人系统的关键能力。
动机:多视角一致性在3D场景理解和几何学习中起着关键作用,我们希望通过利用这一点来改进物体定位。
方法:我们提出了一种新的3D物体检测框架VEDet,该框架通过利用3D多视角几何来提高定位的准确性,具有视图意识和等变性。
效果:我们在nuScenes基准测试上取得了最先进的性能,代码和模型可以在https://github.com/TRI-ML/VEDet获取。

3D object detection from visual sensors is a cornerstone capability of robotic systems. State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input. In this work we gain intuition from the integral role of multi-view consistency in 3D scene understanding and geometric learning. To this end, we introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry to improve localization through viewpoint awareness and equivariance. VEDet leverages a query-based transformer architecture and encodes the 3D scene by augmenting image features with positional encodings from their 3D perspective geometry. We design view-conditioned queries at the output level, which enables the generation of multiple virtual frames during training to learn viewpoint equivariance by enforcing multi-view consistency. The multi-view geometry injected at the input level as positional encodings and regularized at the loss level provides rich geometric cues for 3D object detection, leading to state-of-the-art performance on the nuScenes benchmark. The code and model are made available at https://github.com/TRI-ML/VEDet.

Putting People in Their Place: Affordance-Aware Human Insertion Into Scenes
Kulal, SumithandBrooks, TimandAiken, AlexandWu, JiajunandYang, JimeiandLu, JingwanandEfros, AlexeiA.andSingh, KrishnaKumar



研究问题:通过在场景中插入人物来推断场景的可利用性。
动机:现有的方法无法真实地将人物插入到场景中,并尊重场景的可利用性。
方法:我们提出了一种方法,该方法可以推断出给定场景上下文的一组真实的姿态,重新定位参考人物,并协调构图。
效果:我们的模型可以在没有条件的情况下,当提示时,也可以想象出真实的人和场景,并且可以实现交互式编辑。实验结果表明,与现有工作相比,我们的方法合成了更真实的人物外观和更自然的人类-场景互动。

We study the problem of inferring scene affordances by presenting a method for realistically inserting people into scenes. Given a scene image with a marked region and an image of a person, we insert the person into the scene while respecting the scene affordances. Our model can infer the set of realistic poses given the scene context, re-pose the reference person, and harmonize the composition. We set up the task in a self-supervised fashion by learning to re- pose humans in video clips. We train a large-scale diffusion model on a dataset of 2.4M video clips that produces diverse plausible poses while respecting the scene context. Given the learned human-scene composition, our model can also hallucinate realistic people and scenes when prompted without conditioning and also enables interactive editing. We conduct quantitative evaluation and show that our method synthesizes more realistic human appearance and more natural human-scene interactions when compared to prior work.

3D Neural Field Generation Using Triplane Diffusion
Shue, J.RyanandChan, EricRyanandPo, RyanandAnkner, ZacharyandWu, JiajunandWetzstein, Gordon



研究问题:本文旨在提出一种有效的基于扩散模型的3D感知神经场生成方法。
动机:现有的3D感知神经场生成方法效果不理想,需要寻找新的优化方案。
方法:通过将训练数据(如ShapeNet网格)转换为连续占用场并分解为一系列轴对齐的三平面特征表示,将其转化为2D特征平面进行训练,直接在2D特征平面上训练现有的2D扩散模型以生成高质量的、多样化的3D神经场。
效果:实验结果表明,该方法在ShapeNet的几个对象类别上的3D生成任务中取得了最先进的结果,优于其他3D感知生成方法。

Diffusion models have emerged as the state-of-the-art for image generation, among other tasks. Here, we present an efficient diffusion-based model for 3D-aware generation of neural fields. Our approach pre-processes training data, such as ShapeNet meshes, by converting them to continuous occupancy fields and factoring them into a set of axis-aligned triplane feature representations. Thus, our 3D training scenes are all represented by 2D feature planes, and we can directly train existing 2D diffusion models on these representations to generate 3D neural fields with high quality and diversity, outperforming alternative approaches to 3D-aware generation. Our approach requires essential modifications to existing triplane factorization pipelines to make the resulting features easy to learn for the diffusion model. We demonstrate state-of-the-art results on 3D generation on several object classes from ShapeNet.

Semantic Scene Completion With Cleaner Self
Wang, FengyunandZhang, DongandZhang, HanwangandTang, JinhuiandSun, Qianru



研究问题:如何改善语义场景补全(SSC)中由于深度相机的感官缺陷导致的预测不完整和语义标签混乱的问题。
动机:现有的基于从深度值估计的有噪声TSDF的方法存在预测不完整和语义标签混乱的问题。
方法:使用真实3D体素生成完美的可见表面(TSDF-CAD),然后训练一个“清洁”的SSC模型,并将这个“清洁”的知识提炼到另一个输入有噪声TSDF的模型中。
效果:实验结果表明,该方法提高了有噪声模型的IoU和mIoU,分别达到了3.1%和2.2%,并在流行的NYU数据集上取得了新的最先进的精度。

Semantic Scene Completion (SSC) transforms an image of single-view depth and/or RGB 2D pixels into 3D voxels, each of whose semantic labels are predicted. SSC is a well-known ill-posed problem as the prediction model has to "imagine" what is behind the visible surface, which is usually represented by Truncated Signed Distance Function (TSDF). Due to the sensory imperfection of the depth camera, most existing methods based on the noisy TSDF estimated from depth values suffer from 1) incomplete volumetric predictions and 2) confused semantic labels. To this end, we use the ground-truth 3D voxels to generate a perfect visible surface, called TSDF-CAD, and then train a "cleaner" SSC model. As the model is noise-free, it is expected to focus more on the "imagination" of unseen voxels. Then, we propose to distill the intermediate "cleaner" knowledge into another model with noisy TSDF input. In particular, we use the 3D occupancy feature and the semantic relations of the "cleaner self" to supervise the counterparts of the "noisy self" to respectively address the above two incorrect predictions. Experimental results validate that the proposed method improves the noisy counterparts with 3.1% IoU and 2.2% mIoU for measuring scene completion and SSC, and also achieves new state-of-the-art accuracy on the popular NYU dataset. The code is available at https://github.com/fereenwong/CleanerS.

3D Human Mesh Estimation From Virtual Markers
Ma, XiaoxuanandSu, JiajunandWang, ChunyuandZhu, WentaoandWang, Yizhou



研究问题:如何从无标记的野外图像中恢复具有现实形状的完整3D人体网格。
动机:现有的方法在提取骨架时会丢失身体形状信息,导致性能一般。高级运动捕捉系统通过在身体表面放置密集的物理标记解决了这个问题,但不能应用于没有标记的野外图像。
方法:提出一种名为虚拟标记的中间表示方法,该方法基于大规模mocap数据以生成式风格学习身体表面的64个关键点,模仿物理标记的效果。虚拟标记可以从野外图像中准确检测出来,并通过简单的插值重建具有现实形状的完整网格。
效果:在三个数据集上,该方法都优于最先进的方法,特别是在具有多样身体形状的SURREAL数据集上,其表现显著超过现有方法。

Inspired by the success of volumetric 3D pose estimation, some recent human mesh estimators propose to estimate 3D skeletons as intermediate representations, from which, the dense 3D meshes are regressed by exploiting the mesh topology. However, body shape information is lost in extracting skeletons, leading to mediocre performance. The advanced motion capture systems solve the problem by placing dense physical markers on the body surface, which allows to extract realistic meshes from their non-rigid motions. However, they cannot be applied to wild images without markers. In this work, we present an intermediate representation, named virtual markers, which learns 64 landmark keypoints on the body surface based on the large-scale mocap data in a generative style, mimicking the effects of physical markers. The virtual markers can be accurately detected from wild images and can reconstruct the intact meshes with realistic shapes by simple interpolation. Our approach outperforms the state-of-the-art methods on three datasets. In particular, it surpasses the existing methods by a notable margin on the SURREAL dataset, which has diverse body shapes. Code is available at https://github.com/ShirleyMaxx/VirtualMarker.

High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition
Luan, TianyuandZhai, YuanhaoandMeng, JingjingandLi, ZhongandChen, ZhangandXu, YiandYuan, Junsong



研究问题:尽管最新的单图像手势建模技术取得了令人印象深刻的性能,但它们缺乏捕获3D手部网格的足够细节的能力。
动机:这种缺陷在需要高保真度手势建模时(如个性化手势建模)大大限制了其应用。
方法:设计了一个频率分割网络,以粗到细的方式使用不同的频率带生成3D手部网格。为了捕捉高频个性化细节,我们将3D网格转换为频率域,并提出了一种新的频率分解损失来监督每个频率组件。通过利用这种粗到细的方案,可以保留对应于更高频率域的手部细节。此外,所提出的方法具有可扩展性,可以在任何分辨率级别停止推理,以适应计算能力不同的各种硬件。
效果:大量的实验表明,我们的方法为高保真3D手部重建生成了精细的细节,而且我们的评估指标在测量网格细节方面比传统指标更有效。

Despite the impressive performance obtained by recent single-image hand modeling techniques, they lack the capability to capture sufficient details of the 3D hand mesh. This deficiency greatly limits their applications when high fidelity hand modeling is required, e.g., personalized hand modeling. To address this problem, we design a frequency split network to generate 3D hand mesh using different frequency bands in a coarse-to-fine manner. To capture high-frequency personalized details, we transform the 3D mesh into the frequency domain, and propose a novel frequency decomposition loss to supervise each frequency component. By leveraging such a coarse-to-fine scheme, hand details that correspond to the higher frequency domain can be preserved. In addition, the proposed network is scalable, and can stop the inference at any resolution level to accommodate different hardwares with varying computational powers. To quantitatively evaluate the performance of our method in terms of recovering personalized shape details, we introduce a new evaluation metric named Mean Signal-to-Noise Ratio (MSNR) to measure the signal-to-noise ratio of each mesh frequency component. Extensive experiments demonstrate that our approach generates fine-grained details for high fidelity 3D hand reconstruction, and our evaluation metric is more effective for measuring mesh details compared with traditional metrics.

Neural Scene Chronology
Lin, HaotongandWang, QianqianandCai, RuojinandPeng, SidaandAverbuch-Elor, HadarandZhou, XiaoweiandSnavely, Noah



研究问题:本文旨在重建一个能够从互联网大尺度地标照片中生成具有独立视角、光照和时间控制的逼真渲染的时变3D模型。
动机:主要挑战在于图像中的不同类型时间变化(如光照和场景本身的变化)是交织在一起的,并且场景级别的时间变化通常是离散和零星的。
方法:提出一种新的场景表示方法,配备一种新颖的时间步长函数编码方法,可以将离散的场景级内容变化建模为随时间的分段常数函数。具体来说,我们将场景表示为带有每张图像光照嵌入的空间-时间辐射场,其中随时间变化的场景通过一组学习到的步长函数进行编码。
效果:在展示各种时间变化的四个场景的新数据集上,我们的方法表现出了最先进的视图合成结果,同时实现了独立的视角、时间和光照控制。

In this work, we aim to reconstruct a time-varying 3D model, capable of rendering photo-realistic renderings with independent control of viewpoint, illumination, and time, from Internet photos of large-scale landmarks. The core challenges are twofold. First, different types of temporal changes, such as illumination and changes to the underlying scene itself (such as replacing one graffiti artwork with another) are entangled together in the imagery. Second, scene-level temporal changes are often discrete and sporadic over time, rather than continuous. To tackle these problems, we propose a new scene representation equipped with a novel temporal step function encoding method that can model discrete scene-level content changes as piece-wise constant functions over time. Specifically, we represent the scene as a space-time radiance field with a per-image illumination embedding, where temporally-varying scene changes are encoded using a set of learned step functions. To facilitate our task of chronology reconstruction from Internet imagery, we also collect a new dataset of four scenes that exhibit various changes over time. We demonstrate that our method exhibits state-of-the-art view synthesis results on this dataset, while achieving independent control of viewpoint, time, and illumination. Code and data are available at https://zju3dv.github.io/NeuSC/.

Light Source Separation and Intrinsic Image Decomposition Under AC Illumination
Yoshida, YusakuandKawahara, RyoandOkabe, Takahiro



研究问题:如何利用交流电照明的闪烁提取场景信息,进行固有图像分解。
动机:交流电照明下的闪烁可以提取丰富的场景信息,有助于固有图像分解。
方法:通过矩阵分解解决盲目光源分离和Lambert模型下的固有图像分解中的模糊性,并进行光源分离和固有图像分解。
效果:实验证明该方法可以恢复各光源的颜色、漫反射值以及各光源下的漫反射和镜面强度,且交流电照明下的固有图像分解对自动白平衡应用有效。

Artificial light sources are often powered by an electric grid, and then their intensities rapidly oscillate in response to the grid's alternating current (AC). Interestingly, the flickers of scene radiance values due to AC illumination are useful for extracting rich information on a scene of interest. In this paper, we show that the flickers due to AC illumination is useful for intrinsic image decomposition (IID). Our proposed method conducts the light source separation (LSS) followed by the IID under AC illumination. In particular, we reveal the ambiguity in the blind LSS via matrix factorization and the ambiguity in the IID assuming the Lambert model, and then show why and how those ambiguities can be resolved. We experimentally confirmed that our method can recover the colors of the light sources, the diffuse reflectance values, and the diffuse and specular intensities (shadings) under each of the light sources, and that the IID under AC illumination is effective for application to auto white balancing.

Plateau-Reduced Differentiable Path Tracing
Fischer, MichaelandRitschel, Tobias



研究问题:目前的可微渲染器在优化过程中可能会遇到梯度为零的区域,导致无法收敛。
动机:为了解决这个问题,我们提出了一种通过模糊参数空间来消除梯度为零区域的新的优化方法。
方法:我们将高维的渲染函数(将场景参数映射到图像)与另一个用于模糊参数空间的核进行卷积,从而得到无梯度平坦区域的优化方法。我们还描述了两种高效的蒙特卡洛估计器来计算无梯度平坦区域的梯度。
效果:我们的方法是对黑盒和可微渲染器的直接扩展,能够成功优化复杂的光传输问题,如焦散或全局照明,而现有的可微路径跟踪器无法收敛。

Current differentiable renderers provide light transport gradients with respect to arbitrary scene parameters. However, the mere existence of these gradients does not guarantee useful update steps in an optimization. Instead, inverse rendering might not converge due to inherent plateaus, i.e., regions of zero gradient, in the objective function. We propose to alleviate this by convolving the high-dimensional rendering function that maps scene parameters to images with an additional kernel that blurs the parameter space. We describe two Monte Carlo estimators to compute plateau-free gradients efficiently, i.e., with low variance, and show that these translate into net-gains in optimization error and runtime performance. Our approach is a straightforward extension to both black-box and differentiable renderers and enables the successful optimization of problems with intricate light transport, such as caustics or global illumination, that existing differentiable path tracers do not converge on. Our code is at github.com/mfischer-ucl/prdpt.

ECON: Explicit Clothed Humans Optimized via Normal Integration
Xiu, YuliangandYang, JinlongandCao, XuandTzionas, DimitriosandBlack, MichaelJ.



研究问题:如何从图像中创建详细、穿着衣服的3D人物模型。
动机:现有的方法在处理新的姿势或衣物时,会出现无肢体或形状退化的问题。
方法:提出了一种新的方法ECON,它结合了深度学习、艺术家策划的扫描和隐式函数(IF),首先推断出穿着人的详细2D贴图,然后根据这些贴图恢复出2.5D前后面,最后"inpaints"缺失的几何形状。
效果:实验结果表明,ECON在宽松的衣服和挑战性的姿态下也能推断出高保真度的3D人物,超越了以往的方法。

The combination of deep learning, artist-curated scans, and Implicit Functions (IF), is enabling the creation of detailed, clothed, 3D humans from images. However, existing methods are far from perfect. IF-based methods recover free-form geometry, but produce disembodied limbs or degenerate shapes for novel poses or clothes. To increase robustness for these cases, existing work uses an explicit parametric body model to constrain surface reconstruction, but this limits the recovery of free-form surfaces such as loose clothing that deviates from the body. What we want is a method that combines the best properties of implicit representation and explicit body regularization. To this end, we make two key observations: (1) current networks are better at inferring detailed 2D maps than full-3D surfaces, and (2) a parametric model can be seen as a "canvas" for stitching together detailed surface patches. Based on these, our method, ECON, has three main steps: (1) It infers detailed 2D normal maps for the front and back side of a clothed person. (2) From these, it recovers 2.5D front and back surfaces, called d-BiNI, that are equally detailed, yet incomplete, and registers these w.r.t. each other with the help of a SMPL-X body mesh recovered from the image. (3) It "inpaints" the missing geometry between d-BiNI surfaces. If the face and hands are noisy, they can optionally be replaced with the ones of SMPL-X. As a result, ECON infers high-fidelity 3D humans even in loose clothes and challenging poses. This goes beyond previous methods, according to the quantitative evaluation on the CAPE and Renderpeople datasets. Perceptual studies also show that ECON's perceived realism is better by a large margin. Code and models are available for research purposes at econ.is.tue.mpg.de

F2-NeRF: Fast Neural Radiance Field Training With Free Camera Trajectories
Wang, PengandLiu, YuanandChen, ZhaoxiandLiu, LingjieandLiu, ZiweiandKomura, TakuandTheobalt, ChristianandWang, Wenping



研究问题:本文旨在提出一种新的基于网格的NeRF模型F^2-NeRF,用于新视角合成,能够处理任意输入的相机轨迹,并且训练时间只需几分钟。
动机:现有的快速网格基NeRF训练框架,如Instant-NGP、Plenoxels、DVGO或TensoRF,主要设计用于有限的场景,并依赖于空间扭曲来处理无限的场景。现有的两种广泛使用的空间扭曲方法只能处理向前的轨迹或360度的对象中心轨迹,不能处理任意的轨迹。
方法:本文深入研究了处理无限场景的空间扭曲机制。基于我们的分析,我们进一步提出了一种新的空间扭曲方法,称为透视扭曲,使我们能够在网格基NeRF框架中处理任意的轨迹。
效果:大量的实验证明,F^2-NeRF能够使用相同的透视扭曲在两个标准数据集和我们收集的新的自由轨迹数据集上渲染高质量的图像。

This paper presents a novel grid-based NeRF called F^2-NeRF (Fast-Free-NeRF) for novel view synthesis, which enables arbitrary input camera trajectories and only costs a few minutes for training. Existing fast grid-based NeRF training frameworks, like Instant-NGP, Plenoxels, DVGO, or TensoRF, are mainly designed for bounded scenes and rely on space warping to handle unbounded scenes. Existing two widely-used space-warping methods are only designed for the forward-facing trajectory or the 360deg object-centric trajectory but cannot process arbitrary trajectories. In this paper, we delve deep into the mechanism of space warping to handle unbounded scenes. Based on our analysis, we further propose a novel space-warping method called perspective warping, which allows us to handle arbitrary trajectories in the grid-based NeRF framework. Extensive experiments demonstrate that F^2-NeRF is able to use the same perspective warping to render high-quality images on two standard datasets and a new free trajectory dataset collected by us.

Balanced Spherical Grid for Egocentric View Synthesis
Choi, ChangwoonandKim, SangMinandKim, YoungMin



研究问题:如何有效地重建大规模的真实世界环境,用于虚拟现实资产。
动机:现有的方法在处理大规模无界场景时效率低下,且存在奇异性问题。
方法:提出了EgoNeRF模型,采用球形参数化代替传统的笛卡尔坐标,通过组合两个平衡的网格解决了两极的不规则性和无法表示无界场景的问题,同时使用重采样技术增加了训练NeRF体积的有效样本数量。
效果:在新的合成和真实世界的以自我为中心的360度视频数据集上进行了广泛的评估,EgoNeRF模型始终实现了最先进的性能。

We present EgoNeRF, a practical solution to reconstruct large-scale real-world environments for VR assets. Given a few seconds of casually captured 360 video, EgoNeRF can efficiently build neural radiance fields which enable high-quality rendering from novel viewpoints. Motivated by the recent acceleration of NeRF using feature grids, we adopt spherical coordinate instead of conventional Cartesian coordinate. Cartesian feature grid is inefficient to represent large-scale unbounded scenes because it has a spatially uniform resolution, regardless of distance from viewers. The spherical parameterization better aligns with the rays of egocentric images, and yet enables factorization for performance enhancement. However, the naive spherical grid suffers from irregularities at two poles, and also cannot represent unbounded scenes. To avoid singularities near poles, we combine two balanced grids, which results in a quasi-uniform angular grid. We also partition the radial grid exponentially and place an environment map at infinity to represent unbounded scenes. Furthermore, with our resampling technique for grid-based methods, we can increase the number of valid samples to train NeRF volume. We extensively evaluate our method in our newly introduced synthetic and real-world egocentric 360 video datasets, and it consistently achieves state-of-the-art performance.

Unsupervised 3D Shape Reconstruction by Part Retrieval and Assembly
Xu, XianghaoandGuerrero, PaulandFisher, MatthewandChaudhuri, SiddharthaandRitchie, Daniel



研究问题:如何有效地表示和分解3D形状。
动机:现有的方法要么使用简单的参数化原始模型,要么学习部分的生成形状空间,两者都有局限性。
方法:我们提出使用用户提供的3D部件库来分解形状,给予用户对部件选择的完全控制。这种方法通过从库中检索并优化部件的位置来实现。
效果:该方法在重建精度和理想分解方面优于现有方法,并且可以通过使用不同的部件库来重构相同的形状来控制分解。

Representing a 3D shape with a set of primitives can aid perception of structure, improve robotic object manipulation, and enable editing, stylization, and compression of 3D shapes. Existing methods either use simple parametric primitives or learn a generative shape space of parts. Both have limitations: parametric primitives lead to coarse approximations, while learned parts offer too little control over the decomposition. We instead propose to decompose shapes using a library of 3D parts provided by the user, giving full control over the choice of parts. The library can contain parts with high-quality geometry that are suitable for a given category, resulting in meaningful decom- positions with clean geometry. The type of decomposition can also be controlled through the choice of parts in the library. Our method works via a unsupervised approach that iteratively retrieves parts from the library and refines their placements. We show that this approach gives higher reconstruction accuracy and more desirable decompositions than existing approaches. Additionally, we show how the decom- position can be controlled through the part library by using different part libraries to reconstruct the same shapes.

Instant-NVR: Instant Neural Volumetric Rendering for Human-Object Interactions From Monocular RGBD Stream
Jiang, YuhengandYao, KaixinandSu, ZhuoandShen, ZhehaoandLuo, HaiminandXu, Lan



研究问题:如何有效地进行单目RGBD相机的人体-物体交互的即时体积跟踪和渲染。
动机:传统的非刚性跟踪与最近的即时辐射场技术相结合,可以解决复杂交互场景下的单目跟踪和渲染问题。
方法:提出了一种名为Instant-NVR的神经方法,通过多线程跟踪-渲染机制将传统非刚性跟踪与最新的即时辐射场技术相结合。在跟踪前端,采用了一种稳健的人体-物体捕获方案以提供足够的运动先验知识。进一步引入了一种新的混合变形模块,用于处理互动场景的即时神经表示。同时,还提供了一种基于有效运动先验搜索的动态/静态辐射场的在线重建方案。此外,还引入了一种在线关键帧选择方案和一种注重渲染的优化策略,以显著提高在线新视角合成的外观细节。
效果:大量的实验证明,该方法能够有效地实时生成人体-物体辐射场,特别是在复杂的人体-物体交互下,实现了实时照片级的新视角合成。

Convenient 4D modeling of human-object interactions is essential for numerous applications. However, monocular tracking and rendering of complex interaction scenarios remain challenging. In this paper, we propose Instant-NVR, a neural approach for instant volumetric human-object tracking and rendering using a single RGBD camera. It bridges traditional non-rigid tracking with recent instant radiance field techniques via a multi-thread tracking-rendering mechanism. In the tracking front-end, we adopt a robust human-object capture scheme to provide sufficient motion priors. We further introduce a separated instant neural representation with a novel hybrid deformation module for the interacting scene. We also provide an on-the-fly reconstruction scheme of the dynamic/static radiance fields via efficient motion-prior searching. Moreover, we introduce an online key frame selection scheme and a rendering-aware refinement strategy to significantly improve the appearance details for online novel-view synthesis. Extensive experiments demonstrate the effectiveness and efficiency of our approach for the instant generation of human-object radiance fields on the fly, notably achieving real-time photo-realistic novel view synthesis under complex human-object interactions.

DINER: Depth-Aware Image-Based NEural Radiance Fields
Prinzler, MalteandHilliges, OtmarandThies, Justus



研究问题:如何利用稀疏的RGB输入视图预测深度和特征图,以重建体积场景表示并渲染新视角下的3D对象。
动机:目前的技术水平在处理具有较大视差的输入视图时,合成质量较低且需要改变捕获硬件要求。
方法:提出一种深度感知的基于图像的神经辐射场(DINER)模型,通过将深度信息融入特征融合和有效场景采样中,来预测深度和特征图。
效果:与现有技术相比,DINER模型在合成新视角、人头和一般物体等方面,显著提高了合成质量和感知度量。

We present Depth-aware Image-based NEural Radiance fields (DINER). Given a sparse set of RGB input views, we predict depth and feature maps to guide the reconstruction of a volumetric scene representation that allows us to render 3D objects under novel views. Specifically, we propose novel techniques to incorporate depth information into feature fusion and efficient scene sampling. In comparison to the previous state of the art, DINER achieves higher synthesis quality and can process input views with greater disparity. This allows us to capture scenes more completely without changing capturing hardware requirements and ultimately enables larger viewpoint changes during novel view synthesis. We evaluate our method by synthesizing novel views, both for human heads and for general objects, and observe significantly improved qualitative results and increased perceptual metrics compared to the previous state of the art.

AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction
Chatziagapi, AggelinaandSamaras, Dimitris



研究问题:解决从单目视频中重建4D面部的问题。
动机:由于深度的模糊性,从2D图像进行3D面部重建是一个欠约束的问题。最先进的方法试图通过利用单个图像或视频的视觉信息来解决此问题,而3D网格动画方法则更多地依赖于音频。然而,在大多数情况下(例如AR/VR应用程序),视频同时包含视觉和语音信息。我们提出AVFace,该方法结合了这两种模态,能够准确重建任何说话者的4D面部和唇部运动,无需任何3D真实数据进行训练。
方法:首先进行粗略阶段,估计3D变形模型的每帧参数,然后进行唇部细化,最后进行精细阶段以恢复面部几何细节。由于由基于变压器的模块捕获的音频和视频信息具有时间性,因此我们的方法在任一模态不足的情况下(例如面部遮挡)都具有很强的鲁棒性。
效果:大量的定性和定量评估表明,我们的方法优于当前最先进的方法。

In this work, we present a multimodal solution to the problem of 4D face reconstruction from monocular videos. 3D face reconstruction from 2D images is an under-constrained problem due to the ambiguity of depth. State-of-the-art methods try to solve this problem by leveraging visual information from a single image or video, whereas 3D mesh animation approaches rely more on audio. However, in most cases (e.g. AR/VR applications), videos include both visual and speech information. We propose AVFace that incorporates both modalities and accurately reconstructs the 4D facial and lip motion of any speaker, without requiring any 3D ground truth for training. A coarse stage estimates the per-frame parameters of a 3D morphable model, followed by a lip refinement, and then a fine stage recovers facial geometric details. Due to the temporal audio and video information captured by transformer-based modules, our method is robust in cases when either modality is insufficient (e.g. face occlusions). Extensive qualitative and quantitative evaluation demonstrates the superiority of our method over the current state-of-the-art.

A Characteristic Function-Based Method for Bottom-Up Human Pose Estimation
Qu, HaoxuanandCai, YujunandFoo, LinGengandKumar, AjayandLiu, Jun



研究问题:在人体姿态估计任务中,如何优化热力图预测以更准确地定位不同子区域的身体关节。
动机:现有的方法使用整体L2损失来优化热力图预测,但在包含多个身体关节的自底向上人体姿态估计中,这种方法可能不是最佳选择。
方法:提出一种新的自底向上人体姿态估计方法,通过最小化从预测热力图和真实热力图构造的两个特征函数之间的距离来优化热力图预测。
效果:实验表明,该方法在COCO数据集和CrowdPose数据集上均取得了良好的效果,能更准确地定位预测热力图中不同子区域的身体关节。

Most recent methods formulate the task of human pose estimation as a heatmap estimation problem, and use the overall L2 loss computed from the entire heatmap to optimize the heatmap prediction. In this paper, we show that in bottom-up human pose estimation where each heatmap often contains multiple body joints, using the overall L2 loss to optimize the heatmap prediction may not be the optimal choice. This is because, minimizing the overall L2 loss cannot always lead the model to locate all the body joints across different sub-regions of the heatmap more accurately. To cope with this problem, from a novel perspective, we propose a new bottom-up human pose estimation method that optimizes the heatmap prediction via minimizing the distance between two characteristic functions respectively constructed from the predicted heatmap and the groundtruth heatmap. Our analysis presented in this paper indicates that the distance between these two characteristic functions is essentially the upper bound of the L2 losses w.r.t. sub-regions of the predicted heatmap. Therefore, via minimizing the distance between the two characteristic functions, we can optimize the model to provide a more accurate localization result for the body joints in different sub-regions of the predicted heatmap. We show the effectiveness of our proposed method through extensive experiments on the COCO dataset and the CrowdPose dataset.

RefSR-NeRF: Towards High Fidelity and Super Resolution View Synthesis
Huang, XudongandLi, WeiandHu, JieandChen, HantingandWang, Yunhe



研究问题:本文旨在解决NeRF在高分辨率渲染时存在的模糊问题,以及如何结合高分辨率参考图像进行超分辨率重建。
动机:尽管NeRF在神经渲染领域取得了非凡的成功,但其内在的多层感知机在高分辨率渲染时难以学习高频细节,且随着分辨率的增加,计算量会爆炸性增长。
方法:提出一种端到端的RefSR-NeRF框架,首先学习低分辨率的NeRF表示,然后借助高分辨率参考图像重构高频细节。为了解决直接引入文献中的预训练模型会产生不满足的伪影的问题,设计了一种新颖的轻量级RefSR模型,从NeRF渲染中学习到目标HR的逆退化过程。
效果:在多个基准测试上的大量实验表明,该方法在渲染质量、速度和内存使用方面表现出令人印象深刻的权衡,优于或与NeRF及其变体相当,同时速度提高了52倍,内存使用略有增加。

We present Reference-guided Super-Resolution Neural Radiance Field (RefSR-NeRF) that extends NeRF to super resolution and photorealistic novel view synthesis. Despite NeRF's extraordinary success in the neural rendering field, it suffers from blur in high resolution rendering because its inherent multilayer perceptron struggles to learn high frequency details and incurs a computational explosion as resolution increases. Therefore, we propose RefSR-NeRF, an end-to-end framework that first learns a low resolution NeRF representation, and then reconstructs the high frequency details with the help of a high resolution reference image. We observe that simply introducing the pre-trained models from the literature tends to produce unsatisfied artifacts due to the divergence in the degradation model. To this end, we design a novel lightweight RefSR model to learn the inverse degradation process from NeRF renderings to target HR ones. Extensive experiments on multiple benchmarks demonstrate that our method exhibits an impressive trade-off among rendering quality, speed, and memory usage, outperforming or on par with NeRF and its variants while being 52x speedup with minor extra memory usage.

Polarimetric iToF: Measuring High-Fidelity Depth Through Scattering Media
Jeon, DanielS.andMeuleman, Andr\'easandBaek, Seung-HwanandKim, MinH.



研究问题:本文旨在解决间接飞行时间(iToF)成像在散射介质中受到多路径干扰(MPI)影响,导致深度精度严重下降的问题。
动机:由于iToF成像经常在散射介质中受到MPI的影响,导致深度精度严重下降,例如在雾气中无法准确测量深度。
方法:本文提出了一种极化iToF成像方法,通过散射介质可以稳健地捕获深度信息。通过对间接ToF成像原理和光的偏振性的观察,我们制定了一种新的计算模型,可以纠正MPI错误。
效果:我们在一个使用定制的现成iToF相机的实验设置上验证了我们的方法。通过我们的散射模型和极化相位测量,我们的方法比基线方法有显著的改进。

Indirect time-of-flight (iToF) imaging allows us to capture dense depth information at a low cost. However, iToF imaging often suffers from multipath interference (MPI) artifacts in the presence of scattering media, resulting in severe depth-accuracy degradation. For instance, iToF cameras cannot measure depth accurately through fog because ToF active illumination scatters back to the sensor before reaching the farther target surface. In this work, we propose a polarimetric iToF imaging method that can capture depth information robustly through scattering media. Our observations on the principle of indirect ToF imaging and polarization of light allow us to formulate a novel computational model of scattering-aware polarimetric phase measurements that enables us to correct MPI errors. We first devise a scattering-aware polarimetric iToF model that can estimate the phase of unpolarized backscattered light. We then combine the optical filtering of polarization and our computational modeling of unpolarized backscattered light via scattering analysis of phase and amplitude. This allows us to tackle the MPI problem by estimating the scattering energy through the participating media. We validate our method on an experimental setup using a customized off-the-shelf iToF camera. Our method outperforms baseline methods by a significant margin by means of our scattering model and polarimetric phase measurements.

Self-Supervised Super-Plane for Neural 3D Reconstruction
Ye, BotaoandLiu, SifeiandLi, XuetingandYang, Ming-Hsuan



研究问题:现有的神经隐式表面表示方法在处理室内场景中广泛存在的无纹理平面区域时表现不佳。
动机:为了解决这个问题,本文提出了一种自我监督的超级平面约束方法,通过探索预测表面的自由几何线索来进一步规范平面区域的重建,而无需任何其他地面真值标注。
方法:我们引入了一种迭代训练方案,包括(i)将像素分组以形成一个超级平面(类似于超像素),以及(ii)通过超级平面约束优化场景重建网络。
效果:实验表明,使用超级平面训练的模型性能优于使用传统标注平面的模型,因为单个超级平面在统计上占据更大的面积,导致更稳定的训练。大量的实验还表明,我们的自我监督超级平面约束显著提高了3D重建质量,甚至优于使用地面真值平面分割的效果。此外,我们的模型生成的平面重建结果可以用于其他视觉任务的自动标注。

Neural implicit surface representation methods show impressive reconstruction results but struggle to handle texture-less planar regions that widely exist in indoor scenes. Existing approaches addressing this leverage image prior that requires assistive networks trained with large-scale annotated datasets. In this work, we introduce a self-supervised super-plane constraint by exploring the free geometry cues from the predicted surface, which can further regularize the reconstruction of plane regions without any other ground truth annotations. Specifically, we introduce an iterative training scheme, where (i) grouping of pixels to formulate a super-plane (analogous to super-pixels), and (ii) optimizing of the scene reconstruction network via a super-plane constraint, are progressively conducted. We demonstrate that the model trained with super-planes surprisingly outperforms the one using conventional annotated planes, as individual super-plane statistically occupies a larger area and leads to more stable training. Extensive experiments show that our self-supervised super-plane constraint significantly improves 3D reconstruction quality even better than using ground truth plane segmentation. Additionally, the plane reconstruction results from our model can be used for auto-labeling for other vision tasks. The code and models are available at https: //github.com/botaoye/S3PRecon.

GM-NeRF: Learning Generalizable Model-Based Neural Radiance Fields From Multi-View Images
Chen, JianchuanandYi, WentaoandMa, LiqianandJia, XuandLu, Huchuan



研究问题:如何从稀疏的多视角图像中为任意人体表演者合成高保真度的新视角图像。
动机:由于人体姿态变化大且存在严重的自我遮挡,这是一个具有挑战性的任务。
方法:提出了一种有效的、可泛化的框架——基于模型的神经辐射场(GM-NeRF),用于合成自由视点图像。具体来说,我们提出了一种几何引导的注意力机制,将来自多视角2D图像的外观代码注册到几何代理上,以缓解不准确几何先验和像素空间之间的错位。此外,我们还进行了神经渲染和部分梯度反向传播,以进行高效的感知监督并提高合成的感知质量。
效果:在THuman2.0和Multi-garment等合成数据集以及Genebody和ZJUMocap等真实世界数据集上进行的实验表明,我们的方法在新颖视图合成和几何重建方面优于最先进的方法。

In this work, we focus on synthesizing high-fidelity novel view images for arbitrary human performers, given a set of sparse multi-view images. It is a challenging task due to the large variation among articulated body poses and heavy self-occlusions. To alleviate this, we introduce an effective generalizable framework Generalizable Model-based Neural Radiance Fields (GM-NeRF) to synthesize free-viewpoint images. Specifically, we propose a geometry-guided attention mechanism to register the appearance code from multi-view 2D images to a geometry proxy which can alleviate the misalignment between inaccurate geometry prior and pixel space. On top of that, we further conduct neural rendering and partial gradient backpropagation for efficient perceptual supervision and improvement of the perceptual quality of synthesis. To evaluate our method, we conduct experiments on synthesized datasets THuman2.0 and Multi-garment, and real-world datasets Genebody and ZJUMocap. The results demonstrate that our approach outperforms state-of-the-art methods in terms of novel view synthesis and geometric reconstruction.

VDN-NeRF: Resolving Shape-Radiance Ambiguity via View-Dependence Normalization
Zhu, BingfanandYang, YanchaoandWang, XulongandZheng, YouyiandGuibas, Leonidas



研究问题:如何训练神经辐射场(NeRFs)以在非朗伯表面和动态光照条件下获得更好的几何形状?
动机:现有的方法在处理非朗伯表面和动态光照条件时,由于光线从不同角度照射到某一点时,其辐射量会有很大变化,导致几何形状表现不佳。
方法:提出VDN-NeRF方法,通过提取已学习到的NeRFs中的不变信息进行归一化,消除了视图依赖性,然后联合训练用于视图合成的NeRFs以获得高质量的几何形状。
效果:实验表明,尽管形状-辐射度歧义不可避免,但提出的归一化方法可以最小化其对几何形状的影响,从而有效地解释了视图依赖性的变化。该方法适用于各种基线,并在不改变体绘制流程的情况下显著提高了几何形状的质量,即使在移动光源下捕获数据也能取得良好效果。

We propose VDN-NeRF, a method to train neural radiance fields (NeRFs) for better geometry under non-Lambertian surface and dynamic lighting conditions that cause significant variation in the radiance of a point when viewed from different angles. Instead of explicitly modeling the underlying factors that result in the view-dependent phenomenon, which could be complex yet not inclusive, we develop a simple and effective technique that normalizes the view-dependence by distilling invariant information already encoded in the learned NeRFs. We then jointly train NeRFs for view synthesis with view-dependence normalization to attain quality geometry. Our experiments show that even though shape-radiance ambiguity is inevitable, the proposed normalization can minimize its effect on geometry, which essentially aligns the optimal capacity needed for explaining view-dependent variations. Our method applies to various baselines and significantly improves geometry without changing the volume rendering pipeline, even if the data is captured under a moving light source. Code is available at: https://github.com/BoifZ/VDN-NeRF.

Perspective Fields for Single Image Camera Calibration
Jin, LinyiandZhang, JianmingandHold-Geoffroy, YannickandWang, OliverandBlackburn-Matzen, KevinandSticha, MatthewandFouhey, DavidF.



研究问题:如何通过模型捕捉图像的局部视角特性,以进行更准确的相机标定。
动机:传统的相机标定方法需要对相机模型做出许多假设,且对常见的图像编辑操作如裁剪、扭曲和旋转等不具有不变性或等变性。
方法:提出使用透视场作为图像的视角特性表示,该表示包含每个像素关于相机视图的信息,并参数化为一个上向量和一个纬度值。训练神经网络预测透视场,并将预测的透视场转换为易于校准的参数。
效果:在各种场景下,该方法比基于相机标定的方法更稳健,并在图像合成等应用中表现出色。

Geometric camera calibration is often required for applications that understand the perspective of the image. We propose perspective fields as a representation that models the local perspective properties of an image. Perspective Fields contain per-pixel information about the camera view, parameterized as an up vector and a latitude value. This representation has a number of advantages as it makes minimal assumptions about the camera model and is invariant or equivariant to common image editing operations like cropping, warping, and rotation. It is also more interpretable and aligned with human perception. We train a neural network to predict Perspective Fields and the predicted Perspective Fields can be converted to calibration parameters easily. We demonstrate the robustness of our approach under various scenarios compared with camera calibration-based methods and show example applications in image compositing. Project page: https://jinlinyi.github.io/PerspectiveFields/

Iterative Geometry Encoding Volume for Stereo Matching
Xu, GangweiandWang, XianqiandDing, XiaohuanandYang, Xin



研究问题:现有的匹配任务中,全对关系缺乏非局部几何知识,在病态区域处理局部歧义性时存在困难。
动机:为了解决这些问题,提出了一种新的深度网络架构——迭代几何编码体积(IGEV-Stereo),用于立体匹配。
方法:通过构建一个组合的几何编码体积,将几何和上下文信息以及局部匹配细节进行编码,并迭代索引以更新视差图。同时利用GEV为ConvGRUs的迭代过程回归一个准确的起始点,以提高收敛速度。
效果:实验结果表明,IGEV-Stereo在KITTI 2015和2012(反射)上在所有已发布的技术中排名第一,并且在前10名中速度最快。此外,IGEV-Stereo具有强大的跨数据集泛化能力和高效的推理效率。将其扩展到多视图立体(MVS),即IGEV-MVS,在DTU基准测试上取得了有竞争力的准确性。代码可在https://github.com/gangweiX/IGEV获取。

Recurrent All-Pairs Field Transforms (RAFT) has shown great potentials in matching tasks. However, all-pairs correlations lack non-local geometry knowledge and have difficulties tackling local ambiguities in ill-posed regions. In this paper, we propose Iterative Geometry Encoding Volume (IGEV-Stereo), a new deep network architecture for stereo matching. The proposed IGEV-Stereo builds a combined geometry encoding volume that encodes geometry and context information as well as local matching details, and iteratively indexes it to update the disparity map. To speed up the convergence, we exploit GEV to regress an accurate starting point for ConvGRUs iterations. Our IGEV-Stereo ranks first on KITTI 2015 and 2012 (Reflective) among all published methods and is the fastest among the top 10 methods. In addition, IGEV-Stereo has strong cross-dataset generalization as well as high inference efficiency. We also extend our IGEV to multi-view stereo (MVS), i.e. IGEV-MVS, which achieves competitive accuracy on DTU benchmark. Code is available at https://github.com/gangweiX/IGEV.

Enhanced Stable View Synthesis
Jain, NishantandKumar, SuryanshandVanGool, Luc



研究问题:如何提高自由移动相机拍摄的图像中新颖视图合成的效果。
动机:现有的稳定视图合成(SVS)方法在户外场景中恢复准确的几何骨架和相机姿态时面临挑战,导致效果不佳。
方法:提出一种从多视图几何基本原理出发的方法来增强新颖视图合成解决方案。通过利用MVS和单目深度的互补性,分别对近点和远点获得更好的场景深度。此外,该方法通过多次旋转平均图优化联合优化图像渲染的相机姿态。恢复的场景深度和相机姿态有助于整个场景的依赖视图的表面特征聚合。
效果:在流行的基准数据集(如坦克和寺庙)上进行广泛评估,与现有技术相比,该方法在视图合成结果方面取得了显著改进。例如,在坦克和寺庙上,该方法显示出1.5 dB的PSNR改善。在其他基准数据集(如FVS、Mip-NeRF 360和DTU)上观察到类似的统计数据。

We introduce an approach to enhance the novel view synthesis from images taken from a freely moving camera. The introduced approach focuses on outdoor scenes where recovering accurate geometric scaffold and camera pose is challenging, leading to inferior results using the state-of-the-art stable view synthesis (SVS) method. SVS and related methods fail for outdoor scenes primarily due to (i) over-relying on the multiview stereo (MVS) for geometric scaffold recovery and (ii) assuming COLMAP computed camera poses as the best possible estimates, despite it being well-studied that MVS 3D reconstruction accuracy is limited to scene disparity and camera-pose accuracy is sensitive to key-point correspondence selection. This work proposes a principled way to enhance novel view synthesis solutions drawing inspiration from the basics of multiple view geometry. By leveraging the complementary behavior of MVS and monocular depth, we arrive at a better scene depth per view for nearby and far points, respectively. Moreover, our approach jointly refines camera poses with image-based rendering via multiple rotation averaging graph optimization. The recovered scene depth and the camera-pose help better view-dependent on-surface feature aggregation of the entire scene. Extensive evaluation of our approach on the popular benchmark dataset, such as Tanks and Temples, shows substantial improvement in view synthesis results compared to the prior art. For instance, our method shows 1.5 dB of PSNR improvement on the Tank and Temples. Similar statistics are observed when tested on other benchmark datasets such as FVS, Mip-NeRF 360, and DTU.

Biomechanics-Guided Facial Action Unit Detection Through Force Modeling
Cui, ZijunandKuang, ChenyiandGao, TianandTalamadupula, KartikandJi, Qiang



研究问题:现有的情感表达识别(AU)检测算法主要基于从二维图像中提取的外观信息,而很少考虑支配三维面部皮肤变形的成熟面部生物力学。
动机:提出一种生物力学引导的AU检测方法,模拟面部肌肉激活力,用于预测AU激活。
方法:模型由两个分支组成:3D物理分支和2D图像分支。在3D物理分支中,首先推导支配面部变形的欧拉-拉格朗日方程,并将其表示为常微分方程(ODE),嵌入可微分ODE求解器中。首先回归肌肉激活力和其他物理参数,然后通过解ODE来模拟3D变形。2D图像分支利用额外的二维图像外观信息补偿3D物理分支。使用估计的力量和外观特征进行AU检测。
效果:该方法在两个基准数据集上实现了有竞争力的AU检测性能。此外,通过利用生物力学,该方法在减少训练数据的情况下取得了出色的性能。

Existing AU detection algorithms are mainly based on appearance information extracted from 2D images, and well-established facial biomechanics that governs 3D facial skin deformation is rarely considered. In this paper, we propose a biomechanics-guided AU detection approach, where facial muscle activation forces are modelled, and are employed to predict AU activation. Specifically, our model consists of two branches: 3D physics branch and 2D image branch. In 3D physics branch, we first derive the Euler-Lagrange equation governing facial deformation. The Euler-Lagrange equation represented as an ordinary differential equation (ODE) is embedded into a differentiable ODE solver. Muscle activation forces together with other physics parameters are firstly regressed, and then are utilized to simulate 3D deformation by solving the ODE. By leveraging facial biomechanics, we obtain physically plausible facial muscle activation forces. 2D image branch compensates 3D physics branch by employing additional appearance information from 2D images. Both estimated forces and appearance features are employed for AU detection. The proposed approach achieves competitive AU detection performance on two benchmark datasets. Furthermore, by leveraging biomechanics, our approach achieves outstanding performance with reduced training data.

Clothed Human Performance Capture With a Double-Layer Neural Radiance Fields
Wang, KangkanandZhang, GuofengandCong, SuxuandYang, Jian



研究问题:如何从稀疏视图或单目视频中捕捉穿衣人的表演。
动机:现有的方法要么使用个性化模板捕捉全身的表演,要么从静态人体姿势的单帧中恢复服装,但这些方法在提取服装语义和捕捉服装运动方面存在不便,而基于单帧的方法可能会在跨视频中出现不稳定的跟踪。
方法:提出一种新的方法,通过分别追踪服装和人体运动来捕捉人类表演,该方法使用双层神经辐射场(NeRFs)。具体来说,我们为身体和服装提出了一个双层NeRFs,并通过联合优化变形场和规范的双层NeRFs来追踪密集变形的服装和身体的模板。
效果:与现有方法相比,我们的方法是完全可微分的,能够从动态视频中稳健地捕捉到身体和服装的运动。此外,我们的方法用独立的NeRFs表示服装,使我们能够合理地模拟一般衣服的隐含场。实验评估证实了它在真实多视角或单目视频上的有效性。

This paper addresses the challenge of capturing performance for the clothed humans from sparse-view or monocular videos. Previous methods capture the performance of full humans with a personalized template or recover the garments from a single frame with static human poses. However, it is inconvenient to extract cloth semantics and capture clothing motion with one-piece template, while single frame-based methods may suffer from instable tracking across videos. To address these problems, we propose a novel method for human performance capture by tracking clothing and human body motion separately with a double-layer neural radiance fields (NeRFs). Specifically, we propose a double-layer NeRFs for the body and garments, and track the densely deforming template of the clothing and body by jointly optimizing the deformation fields and the canonical double-layer NeRFs. In the optimization, we introduce a physics-aware cloth simulation network which can help generate physically plausible cloth dynamics and body-cloth interactions. Compared with existing methods, our method is fully differentiable and can capture both the body and clothing motion robustly from dynamic videos. Also, our method represents the clothing with an independent NeRFs, allowing us to model implicit fields of general clothes feasibly. The experimental evaluations validate its effectiveness on real multi-view or monocular videos.

NeuFace: Realistic 3D Neural Face Rendering From Multi-View Images
Zheng, MingwuandZhang, HaiyuandYang, HongyuandHuang, Di



研究问题:如何从多视角图像中恢复真实且高效的三维人脸表示。
动机:由于面部复杂的空间变化反射特性和几何特性,当前的研究在恢复三维人脸表示上仍然具有挑战性。
方法:本文提出了一种新的3D人脸渲染模型NeuFace,通过神经渲染技术学习准确且物理上有意义的底层3D表示。它自然地将神经BRDFs融入基于物理的渲染中,以协作的方式捕捉复杂的面部几何和外观线索。
效果:大量实验证明,NeuFace在人脸渲染方面具有优越性,并对常见物体具有良好的泛化能力。

Realistic face rendering from multi-view images is beneficial to various computer vision and graphics applications. Due to the complex spatially-varying reflectance properties and geometry characteristics of faces, however, it remains challenging to recover 3D facial representations both faithfully and efficiently in the current studies. This paper presents a novel 3D face rendering model, namely NeuFace, to learn accurate and physically-meaningful underlying 3D representations by neural rendering techniques. It naturally incorporates the neural BRDFs into physically based rendering, capturing sophisticated facial geometry and appearance clues in a collaborative manner. Specifically, we introduce an approximated BRDF integration and a simple yet new low-rank prior, which effectively lower the ambiguities and boost the performance of the facial BRDFs. Extensive experiments demonstrate the superiority of NeuFace in human face rendering, along with a decent generalization ability to common objects. Code is released at https://github.com/aejion/NeuFace.

Cross-Guided Optimization of Radiance Fields With Multi-View Image Super-Resolution for High-Resolution Novel View Synthesis
Yoon, YounghoandYoon, Kuk-Jin



研究问题:本文旨在解决现有知识图谱预训练语言模型无法充分利用结构化知识的问题,以及在研究问题:本文旨在解决现有知识图谱预训练语言模型无法充分利用结构化知识的问题,以及在高分辨率新视角合成(HRNVS)中,基于坐标的网络的光谱特性对光线场性能提升的限制。
动机:现有的预训练语言模型和高分辨率新视角合成方法存在一些局限性,如缺乏对结构化知识的利用,以及对高分辨率图像的处理能力有限。
方法:本文提出了一种联合优化单图像超分辨率(SISR)和光线场的新框架。在光线场优化过程中,对训练视图图像进行多视图图像超分辨率处理。通过融合从SISR获取的特征图和由训练视图图像集成误差生成的体素基不确定性场,得到更新的超分辨率结果。
效果:实验结果表明,该方法在各种基准数据集上的HRNVS和MVSR性能显著优于现有方法。

Novel View Synthesis (NVS) aims at synthesizing an image from an arbitrary viewpoint using multi-view images and camera poses. Among the methods for NVS, Neural Radiance Fields (NeRF) is capable of NVS for an arbitrary resolution as it learns a continuous volumetric representation. However, radiance fields rely heavily on the spectral characteristics of coordinate-based networks. Thus, there is a limit to improving the performance of high-resolution novel view synthesis (HRNVS). To solve this problem, we propose a novel framework using cross-guided optimization of the single-image super-resolution (SISR) and radiance fields. We perform multi-view image super-resolution (MVSR) on train-view images during the radiance fields optimization process. It derives the updated SR result by fusing the feature map obtained from SISR and voxel-based uncertainty fields generated by integrated errors of train-view images. By repeating the updates during radiance fields optimization, train-view images for radiance fields optimization have multi-view consistency and high-frequency details simultaneously, ultimately improving the performance of HRNVS. Experiments of HRNVS and MVSR on various benchmark datasets show that the proposed method significantly surpasses existing methods.

SMOC-Net: Leveraging Camera Pose for Self-Supervised Monocular Object Pose Estimation
Tan, TaoandDong, Qiulei



研究问题:如何利用未标注的真实图像进行自监督的单目物体姿态估计,并解决训练阶段由于渲染图像与真实图像的差距以及计算成本高的问题。
动机:现有的方法在训练阶段使用耗时的可微渲染器进行物体姿态预测,导致其在真实图像上的性能受限,且训练过程计算成本高。
方法:提出一种名为SMOC-Net的新型网络,通过利用未标注的真实图像中预测的相机姿态进行自监督的单目物体姿态估计。该网络在知识蒸馏框架下进行探索,包括一个教师模型和一个学生模型。教师模型包含一个用于初始物体姿态估计的主干估计模块和一个用于使用从相对相机姿态派生的几何约束(称为相对姿态约束)细化初始物体姿态的对象姿态细化器。学生模型通过施加相对姿态约束从教师模型获取物体姿态估计的知识。
效果:实验结果表明,SMOC-Net在两个公共数据集上都优于几种最先进的方法,同时比基于可微渲染器的方法需要更少的训练时间。

Recently, self-supervised 6D object pose estimation, where synthetic images with object poses (sometimes jointly with un-annotated real images) are used for training, has attracted much attention in computer vision. Some typical works in literature employ a time-consuming differentiable renderer for object pose prediction at the training stage, so that (i) their performances on real images are generally limited due to the gap between their rendered images and real images and (ii) their training process is computationally expensive. To address the two problems, we propose a novel Network for Self-supervised Monocular Object pose estimation by utilizing the predicted Camera poses from un-annotated real images, called SMOC-Net. The proposed network is explored under a knowledge distillation framework, consisting of a teacher model and a student model. The teacher model contains a backbone estimation module for initial object pose estimation, and an object pose refiner for refining the initial object poses using a geometric constraint (called relative-pose constraint) derived from relative camera poses. The student model gains knowledge for object pose estimation from the teacher model by imposing the relative-pose constraint. Thanks to the relative-pose constraint, SMOC-Net could not only narrow the domain gap between synthetic and real data but also reduce the training cost. Experimental results on two public datasets demonstrate that SMOC-Net outperforms several state-of-the-art methods by a large margin while requiring much less training time than the differentiable-renderer-based methods.

Learning Human Mesh Recovery in 3D Scenes
Shen, ZehongandCen, ZhiandPeng, SidaandShuai, QingandBao, HujunandZhou, Xiaowei



研究问题:如何在预扫描场景中,仅通过单张图像恢复人的绝对姿态和形状。
动机:与以往执行场景感知网格优化的方法不同,我们提出首先使用稀疏的3D CNN估计绝对位置和密集的场景接触,然后通过交叉注意力增强预先训练的人网格恢复网络。
方法:我们在图像和场景几何上进行联合学习,以减少深度和遮挡引起的模糊性,从而产生更合理的全局姿势和接触。在网络中编码场景感知线索还使得该方法无需优化,为实时应用开辟了机会。
效果:实验表明,该方法能够通过一次前向传递恢复准确且物理上可信的网格,并在准确性和速度方面优于最先进的方法。

We present a novel method for recovering the absolute pose and shape of a human in a pre-scanned scene given a single image. Unlike previous methods that perform sceneaware mesh optimization, we propose to first estimate absolute position and dense scene contacts with a sparse 3D CNN, and later enhance a pretrained human mesh recovery network by cross-attention with the derived 3D scene cues. Joint learning on images and scene geometry enables our method to reduce the ambiguity caused by depth and occlusion, resulting in more reasonable global postures and contacts. Encoding scene-aware cues in the network also allows the proposed method to be optimization-free, and opens up the opportunity for real-time applications. The experiments show that the proposed network is capable of recovering accurate and physically-plausible meshes by a single forward pass and outperforms state-of-the-art methods in terms of both accuracy and speed. Code is available on our project page: https://zju3dv.github.io/sahmr/.

Learning Locally Editable Virtual Humans
Ho, Hsuan-IandXue, LixinandSong, JieandHilliges, Otmar



研究问题:本文旨在提出一种新颖的混合表示和可端到端训练的网络架构,以实现完全可编辑和可定制的神经化身。
动机:现有的神经化身模型存在使用复杂、缺乏一致性等问题,因此需要一种结合了神经场的强大建模能力和基于蒙皮网格的易用性和固有3D一致性的新型表示方法。
方法:构建了一个可训练的特征码本来存储可变形身体模型顶点的局部几何和纹理特征,并利用其关节下的一致拓扑结构。然后,将这种表示应用于生成自动解码器架构,使其能够适应未见过的扫描并生成具有不同外观和几何形状的现实化身。此外,我们的表示还允许通过在3D资产之间交换局部特征进行局部编辑。
效果:通过定量和定性实验证明,我们的方法可以生成多样化的详细化身,并在模型拟合性能上优于最先进的方法。

In this paper, we propose a novel hybrid representation and end-to-end trainable network architecture to model fully editable and customizable neural avatars. At the core of our work lies a representation that combines the modeling power of neural fields with the ease of use and inherent 3D consistency of skinned meshes. To this end, we construct a trainable feature codebook to store local geometry and texture features on the vertices of a deformable body model, thus exploiting its consistent topology under articulation. This representation is then employed in a generative auto-decoder architecture that admits fitting to unseen scans and sampling of realistic avatars with varied appearances and geometries. Furthermore, our representation allows local editing by swapping local features between 3D assets. To verify our method for avatar creation and editing, we contribute a new high-quality dataset, dubbed CustomHumans, for training and evaluation. Our experiments quantitatively and qualitatively show that our method generates diverse detailed avatars and achieves better model fitting performance compared to state-of-the-art methods. Our code and dataset are available at https://ait.ethz.ch/custom-humans.

Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM
Wang, HengyiandWang, JingwenandAgapito, Lourdes



研究问题:本文旨在开发一种基于混合表示的神经RGB-D SLAM系统,实现实时的鲁棒相机跟踪和高保真表面重建。
动机:目前的神经网络SLAM系统在处理高频局部特征、保持表面连贯性和完成未观测区域方面存在挑战。
方法:Co-SLAM将场景表示为多分辨率哈希网格,利用其高速收敛能力和高频率局部特征表示能力。同时,引入一blob编码以增强表面连贯性和完成未观测区域。此外,采用射线采样策略进行全局关键帧捆绑调整,无需像其他竞争性神经网络SLAM方法那样选择关键帧以维持少量活动关键帧。
效果:实验结果表明,Co-SLAM运行频率为10-17Hz,并在各种数据集和基准测试(ScanNet、TUM、Replica、Synthetic RGBD)中实现了最先进的场景重建结果和具有竞争力的跟踪性能。

We present Co-SLAM, a neural RGB-D SLAM system based on a hybrid representation, that performs robust camera tracking and high-fidelity surface reconstruction in real time. Co-SLAM represents the scene as a multi-resolution hash-grid to exploit its high convergence speed and ability to represent high-frequency local features. In addition, Co-SLAM incorporates one-blob encoding, to encourage surface coherence and completion in unobserved areas. This joint parametric-coordinate encoding enables real-time and robust performance by bringing the best of both worlds: fast convergence and surface hole filling. Moreover, our ray sampling strategy allows Co-SLAM to perform global bundle adjustment over all keyframes instead of requiring keyframe selection to maintain a small number of active keyframes as competing neural SLAM approaches do. Experimental results show that Co-SLAM runs at 10-17Hz and achieves state-of-the-art scene reconstruction results, and competitive tracking performance in various datasets and benchmarks (ScanNet, TUM, Replica, Synthetic RGBD). Project page: https://hengyiwang.github.io/projects/CoSLAM

Incremental 3D Semantic Scene Graph Prediction From RGB Sequences
Wu, Shun-ChengandTateno, KeisukeandNavab, NassirandTombari, Federico



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

3D semantic scene graphs are a powerful holistic representation as they describe the individual objects and depict the relation between them. They are compact high-level graphs that enable many tasks requiring scene reasoning. In real-world settings, existing 3D estimation methods produce robust predictions that mostly rely on dense inputs. In this work, we propose a real-time framework that incrementally builds a consistent 3D semantic scene graph of a scene given an RGB image sequence. Our method consists of a novel incremental entity estimation pipeline and a scene graph prediction network. The proposed pipeline simultaneously reconstructs a sparse point map and fuses entity estimation from the input images. The proposed network estimates 3D semantic scene graphs with iterative message passing using multi-view and geometric features extracted from the scene entities. Extensive experiments on the 3RScan dataset show the effectiveness of the proposed method in this challenging task, outperforming state-of-the-art approaches.

TexPose: Neural Texture Learning for Self-Supervised 6D Object Pose Estimation
Chen, HanzhiandManhardt, FabianandNavab, NassirandBusam, Benjamin



研究问题:本文旨在解决从合成数据和少量未标记的真实图像中进行6D物体姿态估计的神经纹理学习问题。
动机:现有的方法存在对共模态或额外细化的强烈依赖性,这是为了提供收敛的训练信号。
方法:提出了一种新的学习方案,将纹理学习和姿态学习分为两个子优化问题。分别从真实图像集合中预测物体的真实纹理,并从像素完美的合成数据中学习姿态估计。结合这两种能力,可以合成逼真的新视图,以准确几何形状监督姿态估计器。同时,提出基于surfel的对抗训练损失和来自合成数据的纹理正则化,以减轻在纹理学习阶段出现的姿态噪声和分割不完美问题。
效果:实验结果表明,该方法在无需地面真值姿态注释的情况下显著优于最新的最先进技术,并在面对未见场景时表现出显著的泛化改进。即使使用性能较差的姿态估计器初始化,该方案也能显著提高其性能。

In this paper, we introduce neural texture learning for 6D object pose estimation from synthetic data and a few unlabelled real images. Our major contribution is a novel learning scheme which removes the drawbacks of previous works, namely the strong dependency on co-modalities or additional refinement. These have been previously necessary to provide training signals for convergence. We formulate such a scheme as two sub-optimisation problems on texture learning and pose learning. We separately learn to predict realistic texture of objects from real image collections and learn pose estimation from pixel-perfect synthetic data. Combining these two capabilities allows then to synthesise photorealistic novel views to supervise the pose estimator with accurate geometry. To alleviate pose noise and segmentation imperfection present during the texture learning phase, we propose a surfel-based adversarial training loss together with texture regularisation from synthetic data. We demonstrate that the proposed approach significantly outperforms the recent state-of-the-art methods without ground-truth pose annotations and demonstrates substantial generalisation improvements towards unseen scenes. Remarkably, our scheme improves the adopted pose estimators substantially even when initialised with much inferior performance.

DynIBaR: Neural Dynamic Image-Based Rendering
Li, ZhengqiandWang, QianqianandCole, ForresterandTucker, RichardandSnavely, Noah



研究问题:如何从单目视频中合成复杂动态场景的新视角。
动机:现有的基于时变神经辐射场(即动态NeRF)的方法在这项任务上表现出色,但对于具有复杂物体运动和不受控制的摄像机轨迹的长视频,这些方法可能会产生模糊或不准确的渲染,限制了其在现实世界应用中的使用。
方法:我们提出了一种新的方法,采用体积图像渲染框架,通过以场景运动感知的方式聚合场景中附近视图的特征来合成新的视角,而不是将整个动态场景编码到MLP的权重中。
效果:我们的系统在模拟复杂场景和视点相关效应方面保持了现有方法的优势,同时也能够从具有复杂场景动态和不受约束的摄像机轨迹的长视频中合成逼真的新视角。我们在动态场景数据集上取得了显著优于现有方法的改进,并将该方法应用于具有挑战性的摄像机和物体运动的野外视频,在这些情况下,现有方法无法生成高质量的渲染。

We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene. State-of-the-art methods based on temporally varying Neural Radiance Fields (aka dynamic NeRFs) have shown impressive results on this task. However, for long videos with complex object motions and uncontrolled camera trajectories,these methods can produce blurry or inaccurate renderings, hampering their use in real-world applications. Instead of encoding the entire dynamic scene within the weights of MLPs, we present a new approach that addresses these limitations by adopting a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views in a scene motion-aware manner.Our system retains the advantages of prior methods in its ability to model complex scenes and view-dependent effects,but also enables synthesizing photo-realistic novel views from long videos featuring complex scene dynamics with unconstrained camera trajectories. We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets, and also apply our approach to in-the-wild videos with challenging camera and object motion, where prior methods fail to produce high-quality renderings

Efficient Second-Order Plane Adjustment
Zhou, Lipu



研究问题:本文旨在解决三维重建中,如何优化平面和传感器姿态以最小化点到平面的距离。
动机:在3D重建中,通常使用深度传感器如RGB-D相机和LiDARs,但需要估计最优的平面和传感器姿态以最小化点到平面的距离,这是视觉重建中的一个问题。
方法:本文采用牛顿法来有效解决平面调整(PA)问题。具体来说,给定姿态后,最优平面有闭型解,因此可以从成本函数中消除平面,大大减少变量的数量。此外,由于最优平面是姿态的函数,这种方法实际上确保了在每次迭代时都能获得当前估计姿态的最优平面,有利于收敛。难点在于如何高效地计算海森矩阵和结果成本的梯度,本文提供了一种有效的解决方案。
效果:实验结果表明,我们的算法优于最先进的算法。

Planes are generally used in 3D reconstruction for depth sensors, such as RGB-D cameras and LiDARs. This paper focuses on the problem of estimating the optimal planes and sensor poses to minimize the point-to-plane distance. The resulting least-squares problem is referred to as plane adjustment (PA) in the literature, which is the counterpart of bundle adjustment (BA) in visual reconstruction. Iterative methods are adopted to solve these least-squares problems. Typically, Newton's method is rarely used for a large-scale least-squares problem, due to the high computational complexity of the Hessian matrix. Instead, methods using an approximation of the Hessian matrix, such as the Levenberg-Marquardt (LM) method, are generally adopted. This paper adopts the Newton's method to efficiently solve the PA problem. Specifically, given poses, the optimal plane have a close-form solution. Thus we can eliminate planes from the cost function, which significantly reduces the number of variables. Furthermore, as the optimal planes are functions of poses, this method actually ensures that the optimal planes for the current estimated poses can be obtained at each iteration, which benefits the convergence. The difficulty lies in how to efficiently compute the Hessian matrix and the gradient of the resulting cost. This paper provides an efficient solution. Empirical evaluation shows that our algorithm outperforms the state-of-the-art algorithms.

Fresnel Microfacet BRDF: Unification of Polari-Radiometric Surface-Body Reflection
Ichikawa, TomokiandFukao, YoshikiandNobuhara, ShoheiandNishino, Ko



研究问题:计算机视觉应用中,现有的反射模型由于物理不兼容和适用性有限,无法准确表示反射辐射。
动机:为了解决现有模型的问题,本文提出了一种新的反射模型——菲涅尔微面BRDF模型。
方法:通过使用一组定向镜面微面来模拟表面微观几何的菲涅尔反射和透射,同时考虑了体反射和表面反射。
效果:实验结果表明,该模型在准确性、表达能力、基于图像的估计和几何恢复等方面都表现出了有效性。

Computer vision applications have heavily relied on the linear combination of Lambertian diffuse and microfacet specular reflection models for representing reflected radiance, which turns out to be physically incompatible and limited in applicability. In this paper, we derive a novel analytical reflectance model, which we refer to as Fresnel Microfacet BRDF model, that is physically accurate and generalizes to various real-world surfaces. Our key idea is to model the Fresnel reflection and transmission of the surface microgeometry with a collection of oriented mirror facets, both for body and surface reflections. We carefully derive the Fresnel reflection and transmission for each microfacet as well as the light transport between them in the subsurface. This physically-grounded modeling also allows us to express the polarimetric behavior of reflected light in addition to its radiometric behavior. That is, FMBRDF unifies not only body and surface reflections but also light reflection in radiometry and polarization and represents them in a single model. Experimental results demonstrate its effectiveness in accuracy, expressive power, image-based estimation, and geometry recovery.

DiffusioNeRF: Regularizing Neural Radiance Fields With Denoising Diffusion Models
Wynn, JamieandTurmukhambetov, Daniyar



研究问题:现有的Neural Radiance Fields(NeRFs)在训练视图不足的情况下,其场景几何和颜色场受到严重限制,可能导致伪影。
动机:为了解决这个问题,我们使用去噪扩散模型(DDM)来学习场景几何和颜色的先验知识。
方法:我们在合成的Hypersim数据集上训练DDM,使其能够预测颜色和深度补丁的联合概率分布的对数梯度。然后,在NeRF训练过程中,渲染随机RGBD补丁,并将估计的对数似然梯度反向传播到颜色和密度字段。
效果:实验结果表明,我们的学习先验在重建的几何形状和对新视图的泛化上都取得了改进。在最相关的LLFF数据集上以及DTU上的评估都显示出比其他NeRF方法更好的重建质量。

Under good conditions, Neural Radiance Fields (NeRFs) have shown impressive results on novel view synthesis tasks. NeRFs learn a scene's color and density fields by minimizing the photometric discrepancy between training views and differentiable renderings of the scene. Once trained from a sufficient set of views, NeRFs can generate novel views from arbitrary camera positions. However, the scene geometry and color fields are severely under-constrained, which can lead to artifacts, especially when trained with few input views. To alleviate this problem we learn a prior over scene geometry and color, using a denoising diffusion model (DDM). Our DDM is trained on RGBD patches of the synthetic Hypersim dataset and can be used to predict the gradient of the logarithm of a joint probability distribution of color and depth patches. We show that, these gradients of logarithms of RGBD patch priors serve to regularize geometry and color of a scene. During NeRF training, random RGBD patches are rendered and the estimated gradient of the log-likelihood is backpropagated to the color and density fields. Evaluations on LLFF, the most relevant dataset, show that our learned prior achieves improved quality in the reconstructed geometry and improved generalization to novel views. Evaluations on DTU show improved reconstruction quality among NeRF methods.

Learning Neural Parametric Head Models
Giebenhain, SimonandKirschstein, TobiasandGeorgopoulos, MarkosandR\"unz, MartinandAgapito, LourdesandNie{\ss



研究问题:提出一种基于混合神经网络场的全新3D可变形人脸模型。
动机:现有的模型无法在分离的潜在空间中区分身份和表情,我们的目标是通过新的模型解决这个问题。
方法:我们的模型的核心是一种神经参数表示,将一个人的身份和表情分别表示在不同的潜在空间中。我们使用一个符号距离场(SDF)来捕捉一个人的身份,并使用神经形变场来模拟面部表情。此外,我们还引入了以面部锚点为中心的局部场的集合,以实现高保真的局部细节。
效果:我们在一个新的扫描数据集上训练我们的模型,该数据集包含超过3700个头部扫描,来自203个不同的个体。我们的数据集在质量和几何完整性方面都大大超过了现有的数据集。实验结果表明,我们的方法在拟合误差和重建质量上都优于最先进的方法。

We propose a novel 3D morphable model for complete human heads based on hybrid neural fields. At the core of our model lies a neural parametric representation that disentangles identity and expressions in disjoint latent spaces. To this end, we capture a person's identity in a canonical space as a signed distance field (SDF), and model facial expressions with a neural deformation field. In addition, our representation achieves high-fidelity local detail by introducing an ensemble of local fields centered around facial anchor points. To facilitate generalization, we train our model on a newly-captured dataset of over 3700 head scans from 203 different identities using a custom high-end 3D scanning setup. Our dataset significantly exceeds comparable existing datasets, both with respect to quality and completeness of geometry, averaging around 3.5M mesh faces per scan. Finally, we demonstrate that our approach outperforms state-of-the-art methods in terms of fitting error and reconstruction quality.

Removing Objects From Neural Radiance Fields
Weder, SilvanandGarcia-Hernando, GuillermoandMonszpart, \'AronandPollefeys, MarcandBrostow, GabrielJ.andFirman, MichaelandVicente, Sara



研究问题:如何有效地从NeRF场景表示中移除个人或不雅观的对象。
动机:在共享NeRF之前,可能需要删除其中包含的个人信息或不雅观的对象,但目前的编辑框架难以实现这一目标。
方法:提出一种基于用户提供的遮罩的NeRF图像修复方法,该方法利用了二维图像修复的最新研究成果,并通过基于置信度的视角选择过程来确保修复后的NeRF在三维空间中的一致性。
效果:通过提出一个新的、具有挑战性的数据集进行验证,结果表明该方法能有效生成多视角一致且合理的NeRF修复结果,优于其他竞争方法。

Neural Radiance Fields (NeRFs) are emerging as a ubiquitous scene representation that allows for novel view synthesis. Increasingly, NeRFs will be shareable with other people. Before sharing a NeRF, though, it might be desirable to remove personal information or unsightly objects. Such removal is not easily achieved with the current NeRF editing frameworks. We propose a framework to remove objects from a NeRF representation created from an RGB-D sequence. Our NeRF inpainting method leverages recent work in 2D image inpainting and is guided by a user-provided mask. Our algorithm is underpinned by a confidence based view selection procedure. It chooses which of the individual 2D inpainted images to use in the creation of the NeRF, so that the resulting inpainted NeRF is 3D consistent. We show that our method for NeRF editing is effective for synthesizing plausible inpaintings in a multi-view coherent manner, outperforming competing methods. We validate our approach by proposing a new and still-challenging dataset for the task of NeRF inpainting.

Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction
Zhang, MingfangandWang, JingluandLi, XiaoandHuang, YifeiandSato, YoichiandLu, Yan



研究问题:如何有效地从稀疏输入中进行视图合成,特别是在斜角表面成像时的性能限制。
动机:现有的多平面图像(MPI)虽然是一种有效且高效的表示方法,但其固定结构限制了性能,特别是对于倾斜角度的表面成像。
方法:提出了一种结构性MPI(S-MPI),其平面结构简洁地近似3D场景。通过几何上忠实的结构传递RGBA上下文,S-MPI直接连接了视图合成和3D重建。
效果:尽管应用S-MPI具有直观性和需求,但引入了巨大的挑战,如高保真度的RGBA层和平面位姿近似、多视图一致性、非平面区域建模以及带有相交平面的高效渲染等。因此,我们提出了一种基于分割模型的转换器网络,能够预测紧凑且表达性强的S-MPI层及其相应的掩码、位姿和RGBA上下文。我们的统一框架将非平面区域作为特殊情况进行处理,并通过共享全局代理嵌入来确保多视图一致性。大量实验表明,我们的方法优于先前最先进的基于MPI的视图合成方法和平面重建方法。

The Multiplane Image (MPI), containing a set of fronto-parallel RGBA layers, is an effective and efficient representation for view synthesis from sparse inputs. Yet, its fixed structure limits the performance, especially for surfaces imaged at oblique angles. We introduce the Structural MPI (S-MPI), where the plane structure approximates 3D scenes concisely. Conveying RGBA contexts with geometrically-faithful structures, the S-MPI directly bridges view synthesis and 3D reconstruction. It can not only overcome the critical limitations of MPI, i.e., discretization artifacts from sloped surfaces and abuse of redundant layers, and can also acquire planar 3D reconstruction. Despite the intuition and demand of applying S-MPI, great challenges are introduced, e.g., high-fidelity approximation for both RGBA layers and plane poses, multi-view consistency, non-planar regions modeling, and efficient rendering with intersected planes. Accordingly, we propose a transformer-based network based on a segmentation model. It predicts compact and expressive S-MPI layers with their corresponding masks, poses, and RGBA contexts. Non-planar regions are inclusively handled as a special case in our unified framework. Multi-view consistency is ensured by sharing global proxy embeddings, which encode plane-level features covering the complete 3D scenes with aligned coordinates. Intensive experiments show that our method outperforms both previous state-of-the-art MPI-based view synthesis methods and planar reconstruction methods.

3D Human Pose Estimation via Intuitive Physics
Tripathi, ShashankandM\"uller, LeaandHuang, Chun-HaoP.andTaheri, OmidandBlack, MichaelJ.andTzionas, Dimitrios



研究问题:从图像中估计3D人体时,常常会产生不合理的倾斜、浮动或穿透地板的身体。
动机:目前的方法忽视了身体通常由场景支撑的事实。虽然可以使用物理引擎来强制实现物理合理性,但这些方法不可微分,依赖于不切实际的代理身体,并且难以集成到现有的优化和学习框架中。
方法:我们利用可以从与场景交互的3D SMPL身体中推断出的直观物理(IP)术语。受生物力学启发,我们推断出身体上的压力热图、压力中心(CoP)以及SMPL身体的质心(CoM)。通过这些,我们开发了IPMAN,通过鼓励合理的地面接触和重叠的CoP和CoM,从彩色图像中估计一个“稳定”配置的3D身体。
效果:我们在标准数据集和MoYo(一个新的具有同步多视图图像、复杂姿势的地面真值3D身体、身体-地面接触、CoM和压力的数据集)上评估了IPMAN。IPMAN产生了比现有技术更合理的结果,提高了静态姿势的准确性,同时不影响动态姿势。代码和数据可在https://ipman.is.tue.mpg.de/获取。

Estimating 3D humans from images often produces implausible bodies that lean, float, or penetrate the floor. Such methods ignore the fact that bodies are typically supported by the scene. A physics engine can be used to enforce physical plausibility, but these are not differentiable, rely on unrealistic proxy bodies, and are difficult to integrate into existing optimization and learning frameworks. In contrast, we exploit novel intuitive-physics (IP) terms that can be inferred from a 3D SMPL body interacting with the scene. Inspired by biomechanics, we infer the pressure heatmap on the body, the Center of Pressure (CoP) from the heatmap, and the SMPL body's Center of Mass (CoM). With these, we develop IPMAN, to estimate a 3D body from a color image in a "stable" configuration by encouraging plausible floor contact and overlapping CoP and CoM. Our IP terms are intuitive, easy to implement, fast to compute, differentiable, and can be integrated into existing optimization and regression methods. We evaluate IPMAN on standard datasets and MoYo, a new dataset with synchronized multi-view images, ground-truth 3D bodies with complex poses, body-floor contact, CoM and pressure. IPMAN produces more plausible results than the state of the art, improving accuracy for static poses, while not hurting dynamic ones. Code and data are available for research at https://ipman.is.tue.mpg.de/.

Learning To Predict Scene-Level Implicit 3D From Posed RGBD Data
Kulkarni, NileshandJin, LinyiandJohnson, JustinandFouhey, DavidF.



研究问题:如何从已定位的RGBD数据中学习预测场景级的隐式函数用于3D重建?
动机:现有的3D重建隐式函数通常与网格相关,但作者提出可以通过仅使用一组已定位的RGBD图像进行训练。
方法:通过将先前未见过的RGB图像映射到场景的3D重建,实现了一种从已定位的RGBD数据中学习预测场景级隐式函数的方法。
效果:该方法可以匹配并有时超越当前使用网格监督的方法,并在稀疏数据上表现出更好的鲁棒性。

We introduce a method that can learn to predict scene-level implicit functions for 3D reconstruction from posed RGBD data. At test time, our system maps a previously unseen RGB image to a 3D reconstruction of a scene via implicit functions. While implicit functions for 3D reconstruction have often been tied to meshes, we show that we can train one using only a set of posed RGBD images. This setting may help 3D reconstruction unlock the sea of accelerometer+RGBD data that is coming with new phones. Our system, D2-DRDF, can match and sometimes outperform current methods that use mesh supervision and shows better robustness to sparse data.

Level-S\${\textasciicircum
Xiao, YuxiandXue, NanandWu, TianfuandXia, Gui-Song



研究问题:如何从一组未校准的图像中估计相机姿态和场景几何。
动机:现有的增量结构运动(SfM)方法在处理两视图和少视图配置时存在困难,优化体积神经渲染的坐标多层感知器(MLPs)具有挑战性。
方法:提出一种名为Level-S2fM的新方法,通过学习已建立的关键点对应关系的隐式表面和辐射场的坐标MLPs来估计相机姿态和场景几何。
效果:实验结果表明,Level-S2fM不仅在相机姿态估计和场景几何重建方面取得了良好的结果,而且为在不知道相机外参的情况下进行神经隐式渲染提供了一种有前景的方法。

This paper presents a neural incremental Structure-from-Motion (SfM) approach, Level-S2fM, which estimates the camera poses and scene geometry from a set of uncalibrated images by learning coordinate MLPs for the implicit surfaces and the radiance fields from the established keypoint correspondences. Our novel formulation poses some new challenges due to inevitable two-view and few-view configurations in the incremental SfM pipeline, which complicates the optimization of coordinate MLPs for volumetric neural rendering with unknown camera poses. Nevertheless, we demonstrate that the strong inductive basis conveying in the 2D correspondences is promising to tackle those challenges by exploiting the relationship between the ray sampling schemes. Based on this, we revisit the pipeline of incremental SfM and renew the key components, including two-view geometry initialization, the camera poses registration, the 3D points triangulation, and Bundle Adjustment, with a fresh perspective based on neural implicit surfaces. By unifying the scene geometry in small MLP networks through coordinate MLPs, our Level-S2fM treats the zero-level set of the implicit surface as an informative top-down regularization to manage the reconstructed 3D points, reject the outliers in correspondences via querying SDF, and refine the estimated geometries by NBA (Neural BA). Not only does our Level-S2fM lead to promising results on camera pose estimation and scene geometry reconstruction, but it also shows a promising way for neural implicit rendering without knowing camera extrinsic beforehand.

MEGANE: Morphable Eyeglass and Avatar Network
Li, JunxuanandSaito, ShunsukeandSimon, TomasandLombardi, StephenandLi, HongdongandSaragih, Jason



研究问题:如何有效地捕捉眼镜与面部的几何和外观交互作用,以在虚拟人脸表示中准确建模眼镜?
动机:目前的模型大多独立地对眼镜和面部进行建模,无法准确捕捉它们之间的物理交互。同时,现有的尝试将交互处理为2D图像合成问题,但存在视图和时间不一致的问题。
方法:提出一种3D组合变形眼镜模型,结合表面几何和体积表示,有效支持眼镜拓扑的大变化。该模型自然保留了眼镜间的对应关系,简化了透镜插入和框架变形等明确的几何修改操作。此外,该模型在点光源和自然光照下可重光照,支持各种镜框材料的高保真渲染。
效果:通过与最先进的方法进行比较,实验结果表明该方法显著提高了质量。

Eyeglasses play an important role in the perception of identity. Authentic virtual representations of faces can benefit greatly from their inclusion. However, modeling the geometric and appearance interactions of glasses and the face of virtual representations of humans is challenging. Glasses and faces affect each other's geometry at their contact points, and also induce appearance changes due to light transport. Most existing approaches do not capture these physical interactions since they model eyeglasses and faces independently. Others attempt to resolve interactions as a 2D image synthesis problem and suffer from view and temporal inconsistencies. In this work, we propose a 3D compositional morphable model of eyeglasses that accurately incorporates high-fidelity geometric and photometric interaction effects. To support the large variation in eyeglass topology efficiently, we employ a hybrid representation that combines surface geometry and a volumetric representation. Unlike volumetric approaches, our model naturally retains correspondences across glasses, and hence explicit modification of geometry, such as lens insertion and frame deformation, is greatly simplified. In addition, our model is relightable under point lights and natural illumination, supporting high-fidelity rendering of various frame materials, including translucent plastic and metal within a single morphable model. Importantly, our approach models global light transport effects, such as casting shadows between faces and glasses. Our morphable model for eyeglasses can also be fit to novel glasses via inverse rendering. We compare our approach to state-of-the-art methods and demonstrate significant quality improvements.

Rethinking the Approximation Error in 3D Surface Fitting for Point Cloud Normal Estimation
Du, HangandYan, XuejunandWang, JingjingandXie, DiandPu, Shiliang



研究问题:现有的点云法线估计方法主要通过局部拟合几何表面来计算法线,但这种方法忽视了拟合问题的近似误差,导致拟合表面的准确性较低。
动机:为了解决这一问题,我们提出了一种新的方法,该方法通过对表面拟合问题的近似误差进行深入分析,并设计了两个基本的原则来提高法线估计的准确性。
方法:我们的方法包括两个主要步骤。首先,我们应用Z方向变换旋转局部补丁以获得更低近似误差的更好表面拟合。其次,我们将法线估计的误差建模为一个可学习的项。这两个原则都通过深度神经网络实现,并与最先进的法线估计方法无缝集成。
效果:大量的实验验证了我们的方法在点云法线估计上的优势,并在合成和真实世界的数据集上都推动了最先进性能的发展。

Most existing approaches for point cloud normal estimation aim to locally fit a geometric surface and calculate the normal from the fitted surface. Recently, learning-based methods have adopted a routine of predicting point-wise weights to solve the weighted least-squares surface fitting problem. Despite achieving remarkable progress, these methods overlook the approximation error of the fitting problem, resulting in a less accurate fitted surface. In this paper, we first carry out in-depth analysis of the approximation error in the surface fitting problem. Then, in order to bridge the gap between estimated and precise surface normals, we present two basic design principles: 1) applies the Z-direction Transform to rotate local patches for a better surface fitting with a lower approximation error; 2) models the error of the normal estimation as a learnable term. We implement these two principles using deep neural networks, and integrate them with the state-of-the-art (SOTA) normal estimation methods in a plug-and-play manner. Extensive experiments verify our approaches bring benefits to point cloud normal estimation and push the frontier of state-of-the-art performance on both synthetic and real-world datasets. The code is available at https://github.com/hikvision-research/3DVision.

VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction
Ren, YufanandWang, FangjinhuaandZhang, TongandPollefeys, MarcandS\"usstrunk, Sabine



研究问题:现有的神经隐式重建方法缺乏对新场景的泛化能力。
动机:为了解决这一问题,研究者提出了一种新的具有Signed Ray Distance Function(SRDF)的通用隐式重建方法VolRecon。
方法:VolRecon通过结合多视图特征的投影特征和从粗全局特征体积插值得到的体积特征来重建具有精细细节和少量噪声的场景。使用射线变换器计算采样点上的SRDF值,然后渲染颜色和深度。
效果:在DTU数据集上,VolRecon在稀疏视图重建方面比SparseNeuS提高了约30%,并在全视图重建方面实现了与MVSNet相当的准确性。此外,该方法在大规模的ETH3D基准测试中表现出良好的泛化性能。

The success of the Neural Radiance Fields (NeRF) in novel view synthesis has inspired researchers to propose neural implicit scene reconstruction. However, most existing neural implicit reconstruction methods optimize per-scene parameters and therefore lack generalizability to new scenes. We introduce VolRecon, a novel generalizable implicit reconstruction method with Signed Ray Distance Function (SRDF). To reconstruct the scene with fine details and little noise, VolRecon combines projection features aggregated from multi-view features, and volume features interpolated from a coarse global feature volume. Using a ray transformer, we compute SRDF values of sampled points on a ray and then render color and depth. On DTU dataset, VolRecon outperforms SparseNeuS by about 30% in sparse view reconstruction and achieves comparable accuracy as MVSNet in full view reconstruction. Furthermore, our approach exhibits good generalization performance on the large-scale ETH3D benchmark.

CutMIB: Boosting Light Field Super-Resolution via Multi-View Image Blending
Xiao, ZeyuandLiu, YutongandGao, RuishengandXiong, Zhiwei



研究问题:如何提高深度神经网络的性能?
动机:现有的数据增强策略在单图像超分辨率上表现出了效用,但在需要利用多视图信息的光场超分辨率上的研究却很少。
方法:首次在光场超分辨率中提出了一种有效的数据增强策略——CutMIB,该策略通过剪切低分辨率的补丁并混合生成混合补丁,然后将混合补丁粘贴到高分辨率光场视图的相应区域,从而提高现有光场超分辨率网络的性能。
效果:实验结果表明,CutMIB可以改善现有光场超分辨率网络的重建性能和角度一致性,并在真实世界的光场超分辨率和光场去噪上验证了其有效性。

Data augmentation (DA) is an efficient strategy for improving the performance of deep neural networks. Recent DA strategies have demonstrated utility in single image super-resolution (SR). Little research has, however, focused on the DA strategy for light field SR, in which multi-view information utilization is required. For the first time in light field SR, we propose a potent DA strategy called CutMIB to improve the performance of existing light field SR networks while keeping their structures unchanged. Specifically, CutMIB first cuts low-resolution (LR) patches from each view at the same location. Then CutMIB blends all LR patches to generate the blended patch and finally pastes the blended patch to the corresponding regions of high-resolution light field views, and vice versa. By doing so, CutMIB enables light field SR networks to learn from implicit geometric information during the training stage. Experimental results demonstrate that CutMIB can improve the reconstruction performance and the angular consistency of existing light field SR networks. We further verify the effectiveness of CutMIB on real-world light field SR and light field denoising. The implementation code is available at https://github.com/zeyuxiao1997/CutMIB.

Energy-Efficient Adaptive 3D Sensing
Tilmon, BrevinandSun, ZhanghaoandKoppal, SanjeevJ.andWu, YichengandEvangelidis, GeorgiosandZahreddine, RamziandKrishnan, GurunandanandMa, SizhuoandWang, Jian



研究问题:如何实现深度感应的优化,同时解决其探测范围受限和人眼安全问题。
动机:现有的主动深度感应虽然能实现稳健的深度估计,但通常受到感知范围的限制。简单地增加光学功率可以提高感知范围,但对包括自主机器人和增强现实在内的许多应用来说,可能会引发人眼安全的问题。
方法:提出一种自适应主动深度传感器,该传感器联合优化了探测范围、功耗和人眼安全。主要观察发现,我们不需要将光模式投射到整个场景,而只需要投射到对应用来说深度必要的小区域,以及被动立体深度估计失败的地方。
效果:实验结果表明,与其他感知策略(如全帧投影、线扫描和点扫描)相比,为实现相同的最大感知距离,该方法消耗的功率最少,且具有最短(最佳)的人眼安全距离。通过相位空间光调制器(SLM)和一个微机电系统(MEMS)镜和衍射光学元件(DOE)两个硬件原型实现了这种自适应感知方案。

Active depth sensing achieves robust depth estimation but is usually limited by the sensing range. Naively increasing the optical power can improve sensing range but induces eye-safety concerns for many applications, including autonomous robots and augmented reality. In this paper, we propose an adaptive active depth sensor that jointly optimizes range, power consumption, and eye-safety. The main observation is that we need not project light patterns to the entire scene but only to small regions of interest where depth is necessary for the application and passive stereo depth estimation fails. We theoretically compare this adaptive sensing scheme with other sensing strategies, such as full-frame projection, line scanning, and point scanning. We show that, to achieve the same maximum sensing distance, the proposed method consumes the least power while having the shortest (best) eye-safety distance. We implement this adaptive sensing scheme with two hardware prototypes, one with a phase-only spatial light modulator (SLM) and the other with a micro-electro-mechanical (MEMS) mirror and diffractive optical elements (DOE). Experimental results validate the advantage of our method and demonstrate its capability of acquiring higher quality geometry adaptively.

Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo
Mehl, LukasandSchmalfuss, JennyandJahedi, AzinandNalivayko, YaroslavaandBruhn, Andr\'es



研究问题:现有的基准测试和评估方法无法充分反映高度详细的运动和立体估计结构。
动机:为了解决这个问题,我们引入了Spring,这是一个大规模的、高分辨率的、高细节的计算机生成的场景流、光流和立体基准。
方法:基于开源软件Blender电影"Spring"的渲染场景,我们提供了具有最先进的视觉特效和真实训练数据的逼真高清数据集。我们还提供了一个网站来上传、分析和比较结果。
效果:我们的Spring基准可以评估精细结构的质量,并提供不同图像区域的详细性能统计数据。初步结果显示,估计精细细节确实具有挑战性,其准确性还有很大的提升空间。

While recent methods for motion and stereo estimation recover an unprecedented amount of details, such highly detailed structures are neither adequately reflected in the data of existing benchmarks nor their evaluation methodology. Hence, we introduce Spring -- a large, high-resolution, high-detail, computer-generated benchmark for scene flow, optical flow, and stereo. Based on rendered scenes from the open-source Blender movie "Spring", it provides photo-realistic HD datasets with state-of-the-art visual effects and ground truth training data. Furthermore, we provide a website to upload, analyze and compare results. Using a novel evaluation methodology based on a super-resolved UHD ground truth, our Spring benchmark can assess the quality of fine structures and provides further detailed performance statistics on different image regions. Regarding the number of ground truth frames, Spring is 60x larger than the only scene flow benchmark, KITTI 2015, and 15x larger than the well-established MPI Sintel optical flow benchmark. Initial results for recent methods on our benchmark show that estimating fine details is indeed challenging, as their accuracy leaves significant room for improvement. The Spring benchmark and the corresponding datasets are available at http://spring-benchmark.org.

BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion
Black, MichaelJ.andPatel, PriyankaandTesch, JoachimandYang, Jinlong



研究问题:训练仅在合成数据上表现出色的神经网络,实现从真实图像中估计3D人体姿态和形状(HPS)的问题。
动机:现有的合成数据集规模小、不现实或缺乏真实的衣物,而达到足够的现实性是具有挑战性的。
方法:创建了一个名为BEDLAM的数据集,包含单眼RGB视频和SMPL-X格式的真实3D身体。该数据集包括各种体型、运动、肤色、头发和服装,并使用商业服装物理模拟在移动的身体上逼真地模拟服装。
效果:使用BEDLAM训练了各种HPS回归器,并在真实图像基准测试上实现了最先进的准确性,尽管使用的是合成数据。

We show, for the first time, that neural networks trained only on synthetic data achieve state-of-the-art accuracy on the problem of 3D human pose and shape (HPS) estimation from real images. Previous synthetic datasets have been small, unrealistic, or lacked realistic clothing. Achieving sufficient realism is non-trivial and we show how to do this for full bodies in motion. Specifically, our BEDLAM dataset contains monocular RGB videos with ground-truth 3D bodies in SMPL-X format. It includes a diversity of body shapes, motions, skin tones, hair, and clothing. The clothing is realistically simulated on the moving bodies using commercial clothing physics simulation. We render varying numbers of people in realistic scenes with varied lighting and camera motions. We then train various HPS regressors using BEDLAM and achieve state-of-the-art accuracy on real-image benchmarks despite training with synthetic data. We use BEDLAM to gain insights into what model design choices are important for accuracy. With good synthetic training data, we find that a basic method like HMR approaches the accuracy of the current SOTA method (CLIFF). BEDLAM is useful for a variety of tasks and all images, ground truth bodies, 3D clothing, support code, and more are available for research purposes. Additionally, we provide detailed information about our synthetic data generation pipeline, enabling others to generate their own datasets. See the project page: https://bedlam.is.tue.mpg.de/.

Accidental Light Probes
Yu, Hong-XingandAgarwala, SamirandHerrmann, CharlesandSzeliski, RichardandSnavely, NoahandWu, JiajunandSun, Deqing



研究问题:如何从单张图像中恢复场景的光照。
动机:虽然镜子球光探头可以捕获全向照明,但在日常图像中通常无法获得光探头。
方法:提出一种基于物理的方法来模拟偶然光探头(ALPs),并估计其外观中的光照。主要思想是通过摄影测量原理的着色来模拟ALPs的外观,并通过可微渲染进行逆向处理以恢复入射照明。
效果:通过将ALP放入场景中,可以实现高保真度的光照估计。该方法还可以恢复包含ALP的现有图像的光照。

Recovering lighting in a scene from a single image is a fundamental problem in computer vision. While a mirror ball light probe can capture omnidirectional lighting, light probes are generally unavailable in everyday images. In this work, we study recovering lighting from accidental light probes (ALPs)---common, shiny objects like Coke cans, which often accidentally appear in daily scenes. We propose a physically-based approach to model ALPs and estimate lighting from their appearances in single images. The main idea is to model the appearance of ALPs by photogrammetrically principled shading and to invert this process via differentiable rendering to recover incidental illumination. We demonstrate that we can put an ALP into a scene to allow high-fidelity lighting estimation. Our model can also recover lighting for existing images that happen to contain an ALP.

HexPlane: A Fast Representation for Dynamic Scenes
Cao, AngandJohnson, Justin



研究问题:动态3D场景的建模和重渲染是3D视觉中的一个挑战。
动机:现有的方法依赖于神经辐射场(NeRF)和隐式表示,这在实际应用中速度较慢,因为需要多次MLP评估。
方法:我们提出了一种名为HexPlane的方法,通过学习的特征平面对空间时间中的点进行特征计算,实现了高效的动态3D场景建模。
效果:实验结果表明,HexPlane在动态场景的新颖视图合成方面取得了令人印象深刻的结果,图像质量与现有工作相当,但训练时间减少了100倍以上。

Modeling and re-rendering dynamic 3D scenes is a challenging task in 3D vision. Prior approaches build on NeRF and rely on implicit representations. This is slow since it requires many MLP evaluations, constraining real-world applications. We show that dynamic 3D scenes can be explicitly represented by six planes of learned features, leading to an elegant solution we call HexPlane. A HexPlane computes features for points in spacetime by fusing vectors extracted from each plane, which is highly efficient. Pairing a HexPlane with a tiny MLP to regress output colors and training via volume rendering gives impressive results for novel view synthesis on dynamic scenes, matching the image quality of prior work but reducing training time by more than 100x. Extensive ablations confirm our HexPlane design and show that it is robust to different feature fusion mechanisms, coordinate systems, and decoding mechanisms. HexPlane is a simple and effective solution for representing 4D volumes, and we hope they can broadly contribute to modeling spacetime for dynamic 3D scenes.

Novel-View Acoustic Synthesis
Chen, ChanganandRichard, AlexanderandShapovalov, RomanandIthapu, VamsiKrishnaandNeverova, NataliaandGrauman, KristenandVedaldi, Andrea



研究问题:我们提出了一种新的声音合成任务——新视角声学合成(NVAS),即给定研究问题:我们提出了一种新的声音合成任务——新视角声学合成(NVAS),即给定一个源视角的视觉和听觉观察,能否从未见过的目标视角合成该场景的声音?
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:我们提出了一种名为视觉引导声学合成(ViGAS)的神经网络渲染方法,通过分析输入的音视频线索,学习合成空间中任意点的 sound。
效果:我们收集了两个首创的大型多视角音视频数据集进行基准测试,结果显示我们的模型能够成功推理空间线索并在两个数据集上合成忠实的声音。我们认为这项工作是解决新视角声学合成任务的第一个形式、数据集和方法,具有从AR/VR到艺术和设计的激动人心的潜在应用。

We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benchmark this task, we collect two first-of-their-kind large-scale multi-view audio-visual datasets, one synthetic and one real. We show that our model successfully reasons about the spatial cues and synthesizes faithful audio on both datasets. To our knowledge, this work represents the very first formulation, dataset, and approach to solve the novel-view acoustic synthesis task, which has exciting potential applications ranging from AR/VR to art and design. Unlocked by this work, we believe that the future of novel-view synthesis is in multi-modal learning from videos.

ReLight My NeRF: A Dataset for Novel View Synthesis and Relighting of Real World Objects
Toschi, MarcoandDeMatteo, RiccardoandSpezialetti, RiccardoandDeGregorio, DanieleandDiStefano, LuigiandSalti, Samuele



研究问题:如何在未被观察的光条件下,从神经辐射场(NeRF)中生成新的视角。
动机:为了解决这一问题,我们引入了一个名为ReNe的新数据集,通过单光源照射条件来模拟真实世界的物体,并带有精确的相机和光源姿态的注释。
方法:我们的采集管道利用两个机械臂分别持有相机和全向点光源。我们发布了总共20个场景,描绘了具有复杂几何形状和挑战性材料的多种物体。每个场景包括2000张图像,从50个不同的视角在40个不同的单光源照射条件下获取。
效果:通过利用该数据集,我们对原始NeRF架构的各种变体进行了重光照能力的消融研究,并确定了一种轻量级架构,可以在新的光照条件下渲染物体的新视角,我们将其用于为该数据集建立非平凡的基线。数据集和基准可在https://eyecan-ai.github.io/rene 获取。

In this paper, we focus on the problem of rendering novel views from a Neural Radiance Field (NeRF) under unobserved light conditions. To this end, we introduce a novel dataset, dubbed ReNe (Relighting NeRF), framing real world objects under one-light-at-time (OLAT) conditions, annotated with accurate ground-truth camera and light poses. Our acquisition pipeline leverages two robotic arms holding, respectively, a camera and an omni-directional point-wise light source. We release a total of 20 scenes depicting a variety of objects with complex geometry and challenging materials. Each scene includes 2000 images, acquired from 50 different points of views under 40 different OLAT conditions. By leveraging the dataset, we perform an ablation study on the relighting capability of variants of the vanilla NeRF architecture and identify a lightweight architecture that can render novel views of an object under novel light conditions, which we use to establish a non-trivial baseline for the dataset. Dataset and benchmark are available at https://eyecan-ai.github.io/rene.

ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation
Fan, ZicongandTaheri, OmidandTzionas, DimitriosandKocabas, MuhammedandKaufmann, ManuelandBlack, MichaelJ.andHilliges, Otmar



研究问题:如何让机器理解物体的运动是由人类操作引起的,而非物体自身移动。
动机:目前的机器对物体运动的理解尚不完善,缺乏具有精确3D标注的数据集来研究手和关节对象物理一致且同步的运动。
方法:介绍了一个名为ARCTIC的数据集,其中包含210万个视频帧,配以精确的3D手部和物体网格以及详细的动态接触信息。同时提出了两个新的关节手-物体交互任务:一致运动重建和交互场估计。
效果:通过在ARCTIC数据集上进行评估,证明了提出的方法可以有效地帮助机器理解物体的运动是由人类操作引起的。

Humans intuitively understand that inanimate objects do not move by themselves, but that state changes are typically caused by human manipulation (e.g., the opening of a book). This is not yet the case for machines. In part this is because there exist no datasets with ground-truth 3D annotations for the study of physically consistent and synchronised motion of hands and articulated objects. To this end, we introduce ARCTIC -- a dataset of two hands that dexterously manipulate objects, containing 2.1M video frames paired with accurate 3D hand and object meshes and detailed, dynamic contact information. It contains bi-manual articulation of objects such as scissors or laptops, where hand poses and object states evolve jointly in time. We propose two novel articulated hand-object interaction tasks: (1) Consistent motion reconstruction: Given a monocular video, the goal is to reconstruct two hands and articulated objects in 3D, so that their motions are spatio-temporally consistent. (2) Interaction field estimation: Dense relative hand-object distances must be estimated from images. We introduce two baselines ArcticNet and InterField, respectively and evaluate them qualitatively and quantitatively on ARCTIC. Our code and data are available at https://arctic.is.tue.mpg.de.

Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal Supervision
Ding, FangqiangandPalffy, AndrasandGavrila, DariuM.andLu, ChrisXiaoxuan



研究问题:如何通过跨模态学习进行4D雷达场景流估计。
动机:现代自动驾驶车辆的同位传感器冗余提供了各种形式的雷达场景流估计监督线索。
方法:提出了一种多任务模型架构,用于识别的跨模态学习问题,并设计了损失函数,以利用多种跨模态约束进行有效的场景流估计模型训练。
效果:大量实验表明,该方法在性能上达到了最先进的水平,证明了跨模态监督学习推断更准确的4D雷达场景流的有效性。同时,也展示了其在运动分割和自我运动估计两个子任务上的实用性。

This work proposes a novel approach to 4D radar-based scene flow estimation via cross-modal learning. Our approach is motivated by the co-located sensing redundancy in modern autonomous vehicles. Such redundancy implicitly provides various forms of supervision cues to the radar scene flow estimation. Specifically, we introduce a multi-task model architecture for the identified cross-modal learning problem and propose loss functions to opportunistically engage scene flow estimation using multiple cross-modal constraints for effective model training. Extensive experiments show the state-of-the-art performance of our method and demonstrate the effectiveness of cross-modal supervised learning to infer more accurate 4D radar scene flow. We also show its usefulness to two subtasks - motion segmentation and ego-motion estimation. Our source code will be available on https://github.com/Toytiny/CMFlow.

Omnimatte3D: Associating Objects and Their Effects in Unconstrained Monocular Video
Suhail, MohammedandLu, ErikaandLi, ZhengqiandSnavely, NoahandSigal, LeonidandCole, Forrester



研究问题:如何将视频分解为背景和一组前景层,以捕捉静态元素和移动对象及其相关效果。
动机:目前的方法限制了摄像机运动的可能性,而我们的方法则利用单目摄像机姿态和深度估计的最新进展来创建完整的RGBD视频背景层和每个前景对象的视频层。
方法:我们的方法将视频分解为背景和一组前景层,其中背景层捕获静态元素,前景层捕获移动对象及其相关效果。为了解决这个欠约束的分解问题,我们提出了一种新的基于多视图一致性的损失函数。
效果:我们在具有复杂摄像机运动的具有挑战性的视频上测试了我们的方法,并显示出比当前方法显著的定性改进。

We propose a method to decompose a video into a background and a set of foreground layers, where the background captures stationary elements while the foreground layers capture moving objects along with their associated effects (e.g. shadows and reflections). Our approach is designed for unconstrained monocular videos, with arbitrary camera and object motion. Prior work that tackles this problem assumes that the video can be mapped onto a fixed 2D canvas, severely limiting the possible space of camera motion. Instead, our method applies recent progress in monocular camera pose and depth estimation to create a full, RGBD video layer for the background, along with a video layer for each foreground object. To solve the underconstrained decomposition problem, we propose a new loss formulation based on multi-view consistency. We test our method on challenging videos with complex camera motion and show significant qualitative improvement over current approaches.

High-Fidelity Clothed Avatar Reconstruction From a Single Image
Liao, TingtingandZhang, XiaomeiandXiu, YuliangandYi, HongweiandLiu, XudongandQi, Guo-JunandZhang, YongandWang, XuanandZhu, XiangyuandLei, Zhen



研究问题:本文旨在提出一种有效的三维穿衣虚拟人物重建框架。
动机:结合优化方法和学习方法的优点,实现从单张图片中高精度的穿衣虚拟人物重建。
方法:首先使用隐式模型在标准空间中以学习的方式获取人物的大致形状,然后通过估计姿态空间中的非刚性变形来优化表面细节。利用超网络生成良好的初始值,大大加速了优化过程的收敛。
效果:实验证明,该方法能成功重建真实场景中任意穿着的人类高保真虚拟人物。

This paper presents a framework for efficient 3D clothed avatar reconstruction. By combining the advantages of the high accuracy of optimization-based methods and the efficiency of learning-based methods, we propose a coarse-to-fine way to realize a high-fidelity clothed avatar reconstruction (CAR) from a single image. At the first stage, we use an implicit model to learn the general shape in the canonical space of a person in a learning-based way, and at the second stage, we refine the surface detail by estimating the non-rigid deformation in the posed space in an optimization way. A hyper-network is utilized to generate a good initialization so that the convergence of the optimization process is greatly accelerated. Extensive experiments on various datasets show that the proposed CAR successfully produces high-fidelity avatars for arbitrarily clothed humans in real scenes. The codes will be released.

CAMS: CAnonicalized Manipulation Spaces for Category-Level Functional Hand-Object Manipulation Synthesis
Zheng, JuntianandZheng, QingyuanandFang, LixingandLiu, YunandYi, Li



研究问题:本文旨在解决一种新的类别级功能性手-物体操作合成任务,包括刚体和铰接物体类别。
动机:给定一个物体几何形状、初始人手姿态以及稀疏的物体姿态控制序列,目标是生成一段像人类一样物理合理的手-物体操作序列。
方法:首先设计了CAMS(Canonicalized Manipulation Spaces),这是一个两层的空间层次结构,以对象为中心和接触中心的观点对手中的位姿进行规范化。然后提出了一个两阶段框架来合成类人的操作动画。
效果:该框架在刚体和铰接类别上都取得了最先进的性能,并产生了令人印象深刻的视觉效果。代码和视频结果可以在项目主页上找到:https://cams-hoi.github.io/。

In this work, we focus on a novel task of category-level functional hand-object manipulation synthesis covering both rigid and articulated object categories. Given an object geometry, an initial human hand pose as well as a sparse control sequence of object poses, our goal is to generate a physically reasonable hand-object manipulation sequence that performs like human beings. To address such a challenge, we first design CAnonicalized Manipulation Spaces (CAMS), a two-level space hierarchy that canonicalizes the hand poses in an object-centric and contact-centric view. Benefiting from the representation capability of CAMS, we then present a two-stage framework for synthesizing human-like manipulation animations. Our framework achieves state-of-the-art performance for both rigid and articulated categories with impressive visual effects. Codes and video results can be found at our project homepage: https://cams-hoi.github.io/.

Neural Lens Modeling
Xian, WenqiandBo\v{z



研究问题:目前的3D重建和渲染方法越来越依赖整个图像形成过程的端到端优化,但这种方法受限于难以统一建模光学硬件堆栈和镜头的影响。
动机:为了提高相机校准的质量以及3D重建结果的准确性,本文提出了一种神经透镜模型——NeuroLens,用于畸变和渐晕处理。
方法:NeuroLens可以用于点投影和光线投射,并通过这两种操作进行优化。这意味着它可以(可选)使用传统的校准目标进行预捕获校准,也可以在3D重建过程中进行校准或优化,例如在优化辐射场时。
效果:通过从Lensfun数据库中收集的大量镜头创建的综合数据集和其他真实世界数据集,我们证明了提出的透镜模型的性能优于标准包和最近的其他方法,同时更易于使用和扩展。该模型适用于多种镜头类型,并可轻松集成到现有的3D重建和渲染系统中。

Recent methods for 3D reconstruction and rendering increasingly benefit from end-to-end optimization of the entire image formation process. However, this approach is currently limited: effects of the optical hardware stack and in particular lenses are hard to model in a unified way. This limits the quality that can be achieved for camera calibration and the fidelity of the results of 3D reconstruction. In this paper, we propose NeuroLens, a neural lens model for distortion and vignetting that can be used for point projection and ray casting and can be optimized through both operations. This means that it can (optionally) be used to perform pre-capture calibration using classical calibration targets, and can later be used to perform calibration or refinement during 3D reconstruction, e.g., while optimizing a radiance field. To evaluate the performance of our proposed model, we create a comprehensive dataset assembled from the Lensfun database with a multitude of lenses. Using this and other real-world datasets, we show that the quality of our proposed lens model outperforms standard packages as well as recent approaches while being much easier to use and extend. The model generalizes across many lens types and is trivial to integrate into existing 3D reconstruction and rendering systems. Visit our project website at: https://neural-lens.github.io.

Shape-Erased Feature Learning for Visible-Infrared Person Re-Identification
Feng, JiaweiandWu, AncongandZheng, Wei-Shi



研究问题:由于可见光和红外图像的模态差异以及高度视觉模糊性,学习多样化研究问题:由于可见光和红外图像的模态差异以及高度视觉模糊性,学习多样化的模态共享语义概念对于可见光-红外人体再识别(VI-ReID)仍然是一个挑战。
动机:身体形状是VI-ReID的重要模态共享线索之一。为了挖掘更多多样化的模态共享线索,我们期望在已学习的特征中擦除与身体形状相关的语义概念,以迫使ReID模型提取更多其他模态共享特征进行识别。
方法:我们提出了一种形状擦除特征学习范式,该范式将模态共享特征解耦到两个正交子空间中。在一个子空间中联合学习与形状相关的特征,在另一个正交补空间中学习形状擦除特征,从而实现形状擦除特征与身份之间的条件互信息最大化,从而显式增强学习的表示的多样性。
效果:我们在SYSU-MM01、RegDB和HITSZ-VCM数据集上进行了大量实验,证明了我们方法的有效性。

Due to the modality gap between visible and infrared images with high visual ambiguity, learning diverse modality-shared semantic concepts for visible-infrared person re-identification (VI-ReID) remains a challenging problem. Body shape is one of the significant modality-shared cues for VI-ReID. To dig more diverse modality-shared cues, we expect that erasing body-shape-related semantic concepts in the learned features can force the ReID model to extract more and other modality-shared features for identification. To this end, we propose shape-erased feature learning paradigm that decorrelates modality-shared features in two orthogonal subspaces. Jointly learning shape-related feature in one subspace and shape-erased features in the orthogonal complement achieves a conditional mutual information maximization between shape-erased feature and identity discarding body shape information, thus enhancing the diversity of the learned representation explicitly. Extensive experiments on SYSU-MM01, RegDB, and HITSZ-VCM datasets demonstrate the effectiveness of our method.

Neural Part Priors: Learning To Optimize Part-Based Object Completion in RGB-D Scans
Bokhovkin, AlekseiandDai, Angela



研究问题:本文旨在解决3D场景理解中,独立的对象预测方法无法实现全局一致性的问题。
动机:目前的3D场景理解主要关注独立的对象预测,缺乏全局一致性的考虑。
方法:提出学习神经部分先验(NPPs)的方法,通过参数化对象及其部分的空间,优化以适应新的输入3D扫描几何形状,同时满足全局场景一致性约束。
效果:实验结果表明,NPPs在ScanNet数据集上显著优于现有技术,实现了更准确的场景重建和对象补全。

3D scene understanding has seen significant advances in recent years, but has largely focused on object understanding in 3D scenes with independent per-object predictions. We thus propose to learn Neural Part Priors (NPPs), parametric spaces of objects and their parts, that enable optimizing to fit to a new input 3D scan geometry with global scene consistency constraints. The rich structure of our NPPs enables accurate, holistic scene reconstruction across similar objects in the scene. Both objects and their part geometries are characterized by coordinate field MLPs, facilitating optimization at test time to fit to input geometric observations as well as similar objects in the input scan. This enables more accurate reconstructions than independent per-object predictions as a single forward pass, while establishing global consistency within a scene. Experiments on the ScanNet dataset demonstrate that NPPs significantly outperforms the state-of-the-art in part decomposition and object completion in real-world scenes.

NeuralUDF: Learning Unsigned Distance Fields for Multi-View Reconstruction of Surfaces With Arbitrary Topologies
Long, XiaoxiaoandLin, ChengandLiu, LingjieandLiu, YuanandWang, PengandTheobalt, ChristianandKomura, TakuandWang, Wenping



研究问题:如何从2D图像重建具有任意拓扑结构的表面?
动机:现有的基于神经渲染的重建方法仅限于封闭表面,因为它们采用需要将目标形状分为内外部分的符号距离函数(SDF)作为表面表示。
方法:提出将表面表示为无符号距离函数(UDF),并开发新的体积渲染方案来学习神经UDF表示。具体来说,引入了一种新的密度函数,该函数将UDF的属性与体积渲染方案相关联,以稳健优化UDF场。
效果:在DTU和DeepFashion3D数据集上的实验表明,该方法不仅能够高质量重建具有复杂类型的非封闭形状,而且在封闭表面的重建上也能与基于SDF的方法相媲美。

We present a novel method, called NeuralUDF, for reconstructing surfaces with arbitrary topologies from 2D images via volume rendering. Recent advances in neural rendering based reconstruction have achieved compelling results. However, these methods are limited to objects with closed surfaces since they adopt Signed Distance Function (SDF) as surface representation which requires the target shape to be divided into inside and outside. In this paper, we propose to represent surfaces as the Unsigned Distance Function (UDF) and develop a new volume rendering scheme to learn the neural UDF representation. Specifically, a new density function that correlates the property of UDF with the volume rendering scheme is introduced for robust optimization of the UDF fields. Experiments on the DTU and DeepFashion3D datasets show that our method not only enables high-quality reconstruction of non-closed shapes with complex typologies, but also achieves comparable performance to the SDF based methods on the reconstruction of closed surfaces. Visit our project page at https://www.xxlong.site/NeuralUDF/.

Shape-Constraint Recurrent Flow for 6D Object Pose Estimation
Hai, YangandSong, RuiandLi, JiaojiaoandHu, Yinlin



研究问题:现有的6D物体姿态估计方法主要依赖2D光流网络来优化结果,但这些研究问题:现有的6D物体姿态估计方法主要依赖2D光流网络来优化结果,但这些方法在匹配过程中通常不考虑目标的3D形状信息,导致其在6D物体姿态估计上表现不佳。
动机:为了解决这个问题,我们提出了一种形状约束循环流网络用于6D物体姿态估计,该方法将目标的3D形状信息嵌入到匹配过程中。
方法:我们首先引入了一个从当前流估计中学习中间姿态的流到姿态组件,然后从当前姿态对4D关联体积的查找空间施加形状约束,大大减少了匹配空间,使其更易于学习。最后,我们在一个循环的方式下同时优化流和姿态,直到收敛。
效果:我们在三个具有挑战性的6D物体姿态数据集上评估了我们的方法,结果显示它在准确性和效率上都优于现有技术。

Most recent 6D object pose estimation methods rely on 2D optical flow networks to refine their results. However, these optical flow methods typically do not consider any 3D shape information of the targets during matching, making them suffer in 6D object pose estimation. In this work, we propose a shape-constraint recurrent flow network for 6D object pose estimation, which embeds the 3D shape information of the targets into the matching procedure. We first introduce a flow-to-pose component to learn an intermediate pose from the current flow estimation, then impose a shape constraint from the current pose on the lookup space of the 4D correlation volume for flow estimation, which reduces the matching space significantly and is much easier to learn. Finally, we optimize the flow and pose simultaneously in a recurrent manner until convergence. We evaluate our method on three challenging 6D object pose datasets and show that it outperforms the state of the art in both accuracy and efficiency.

High-Fidelity Event-Radiance Recovery via Transient Event Frequency
Han, JinandAsano, YutaandShi, BoxinandZheng, YinqiangandSato, Imari



研究问题:如何通过事件相机和生物启发的硅传感器恢复精确的辐射度值,以改善场景信息重建和理解。
动机:传统的摄像机在动态范围、位深度和光谱响应等方面存在限制,而事件相机和生物启发的硅传感器对辐射度变化敏感,可以用于精确恢复辐射度值。
方法:本文提出了一种创新的方法,将事件信号的高时间分辨率转换为精确的辐射度值。在有源照明条件下,触发线性的事件信号瞬时频率反映了辐射度值。
效果:实验证明,仅通过瞬时事件频率(TEF)就可以恢复辐射度值,这种方法在图像分析方面具有多种能力。

High-fidelity radiance recovery plays a crucial role in scene information reconstruction and understanding. Conventional cameras suffer from limited sensitivity in dynamic range, bit depth, and spectral response, etc. In this paper, we propose to use event cameras with bio-inspired silicon sensors, which are sensitive to radiance changes, to recover precise radiance values. We reveal that, under active lighting conditions, the transient frequency of event signals triggering linearly reflects the radiance value. We propose an innovative method to convert the high temporal resolution of event signals into precise radiance values. The precise radiance values yields several capabilities in image analysis. We demonstrate the feasibility of recovering radiance values solely from the transient event frequency (TEF) through multiple experiments.

NeMo: Learning 3D Neural Motion Fields From Multiple Video Instances of the Same Action
Wang, Kuan-ChiehandWeng, ZhenzhenandXenochristou, MariaandAra\'ujo, Jo\~aoPedroandGu, JeffreyandLiu, KarenandYeung, Serena



研究问题:如何通过利用同一动作的多个视频实例的信息,弥合单眼人体网格恢复(HMR)和多视图运动捕捉(MoCap)系统之间的差距。
动机:现有的HMR方法在包含挑战性和动态运动的视频中的表现会下降,限制了其在3D运动恢复等应用中的吸引力。
方法:引入神经运动(NeMo)领域,优化表示同一动作的一系列视频中的底层3D运动。
效果:实验证明,NeMo可以在体育比赛中使用来自宾夕法尼亚行动数据集的视频恢复3D运动,并在2D关键点检测方面优于现有的HMR方法。进一步收集模仿宾夕法尼亚行动中的动作的小MoCap数据集,NeMo在各种基线上实现了更好的3D重建。

The task of reconstructing 3D human motion has wide-ranging applications. The gold standard Motion capture (MoCap) systems are accurate but inaccessible to the general public due to their cost, hardware, and space constraints. In contrast, monocular human mesh recovery (HMR) methods are much more accessible than MoCap as they take single-view videos as inputs. Replacing the multi-view MoCap systems with a monocular HMR method would break the current barriers to collecting accurate 3D motion thus making exciting applications like motion analysis and motion-driven animation accessible to the general public. However, the performance of existing HMR methods degrades when the video contains challenging and dynamic motion that is not in existing MoCap datasets used for training. This reduces its appeal as dynamic motion is frequently the target in 3D motion recovery in the aforementioned applications. Our study aims to bridge the gap between monocular HMR and multi-view MoCap systems by leveraging information shared across multiple video instances of the same action. We introduce the Neural Motion (NeMo) field. It is optimized to represent the underlying 3D motions across a set of videos of the same action. Empirically, we show that NeMo can recover 3D motion in sports using videos from the Penn Action dataset, where NeMo outperforms existing HMR methods in terms of 2D keypoint detection. To further validate NeMo using 3D metrics, we collected a small MoCap dataset mimicking actions in Penn Action, and show that NeMo achieves better 3D reconstruction compared to various baselines.

Distilling Neural Fields for Real-Time Articulated Shape Reconstruction
Tan, JeffandYang, GengshanandRamanan, Deva



研究问题:如何在实时视频中重建有关节的3D模型,同时无需在训练时进行测试优化或手动3D监督。
动机:目前的方法往往依赖于预构建的可变形模型(如SMAL/SMPL),或者通过不同的可微渲染进行慢速的场景优化(如动态NeRFs)。这些方法无法支持任意对象类别,或不适合实时应用。
方法:我们使用现成的基于视频的动态NeRFs作为3D监督来训练一个快速的前馈网络,将3D形状和运动预测转化为有监督的蒸馏任务。我们的时态感知网络使用有关节的骨骼和混合蒙皮表示任意形变,并在视频数据集上自我监督,无需输入3D形状或视点。
效果:通过蒸馏,我们的网络能够在交互式帧率下学习重建未见过的有关节物体,其3D重建的逼真度高于现有实时方法,并能在新的视点和姿态下渲染出真实的图像。

We present a method for reconstructing articulated 3D models from videos in real-time, without test-time optimization or manual 3D supervision at training time. Prior work often relies on pre-built deformable models (e.g. SMAL/SMPL), or slow per-scene optimization through differentiable rendering (e.g. dynamic NeRFs). Such methods fail to support arbitrary object categories, or are unsuitable for real-time applications. To address the challenge of collecting large-scale 3D training data for arbitrary deformable object categories, our key insight is to use off-the-shelf video-based dynamic NeRFs as 3D supervision to train a fast feed-forward network, turning 3D shape and motion prediction into a supervised distillation task. Our temporal-aware network uses articulated bones and blend skinning to represent arbitrary deformations, and is self-supervised on video datasets without requiring 3D shapes or viewpoints as input. Through distillation, our network learns to 3D-reconstruct unseen articulated objects at interactive frame rates. Our method yields higher-fidelity 3D reconstructions than prior real-time methods for animals, with the ability to render realistic images at novel viewpoints and poses.

AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation
Ohkawa, TakehikoandHe, KunandSener, FadimeandHodan, TomasandTran, LuanandKeskin, Cem



研究问题:本文旨在开发一个大规模的基准数据集AssemblyHands,以促进具有挑战性的手-物体交互的自我中心活动的研究。
动机:现有的自我中心3D手部姿态估计数据集规模较小,质量不高,限制了相关研究的发展。
方法:从近期的Assembly101数据集中采样同步的自我中心和外在中心图像,创建AssemblyHands数据集。为了获取高质量的自我中心图像的3D手部姿态注释,开发了一个高效的管道,使用初始的手动注释集训练模型自动注释更大的数据集。
效果:AssemblyHands提供了300万个标注的图像,包括49万个自我中心图像,使其成为现有最大的自我中心3D手部姿态估计基准数据集。使用此数据,开发了一个强大的单视图自我中心图像3D手部姿态估计基线。此外,设计了一个新颖的动作分类任务来评估预测的3D手部姿态。研究显示,拥有更高质量的手部姿态可以直接提高动作识别的能力。

We present AssemblyHands, a large-scale benchmark dataset with accurate 3D hand pose annotations, to facilitate the study of egocentric activities with challenging hand-object interactions. The dataset includes synchronized egocentric and exocentric images sampled from the recent Assembly101 dataset, in which participants assemble and disassemble take-apart toys. To obtain high-quality 3D hand pose annotations for the egocentric images, we develop an efficient pipeline, where we use an initial set of manual annotations to train a model to automatically annotate a much larger dataset. Our annotation model uses multi-view feature fusion and an iterative refinement scheme, and achieves an average keypoint error of 4.20 mm, which is 85 % lower than the error of the original annotations in Assembly101. AssemblyHands provides 3.0M annotated images, including 490K egocentric images, making it the largest existing benchmark dataset for egocentric 3D hand pose estimation. Using this data, we develop a strong single-view baseline of 3D hand pose estimation from egocentric images. Furthermore, we design a novel action classification task to evaluate predicted 3D hand poses. Our study shows that having higher-quality hand poses directly improves the ability to recognize actions.

Scene-Aware Egocentric 3D Human Pose Estimation
Wang, JianandLuvizon, DiogoandXu, WeipengandLiu, LingjieandSarkar, KripasindhuandTheobalt, Christian



研究问题:如何利用单目鱼眼镜头进行自我中心的三维人体姿态估计,特别是在人体被高度遮挡或与场景紧密互动的情况下。
动机:现有的方法在挑战性的姿态下仍然挣扎,例如当人体被高度遮挡或与场景紧密互动时。
方法:提出了一种场景感知的自我中心姿态估计方法,通过场景约束来指导自我中心姿态的预测。具体包括:提出一个自我中心深度估计网络,从广角自我中心鱼眼镜头预测场景深度图,同时使用深度修复网络减轻人体遮挡的影响;提出一个场景感知的姿态估计网络,将2D图像特征和场景的估计深度图投影到体素空间,并使用V2V网络回归3D姿态。
效果:实验结果表明,预测的3D自我中心姿态准确且在人-场景交互方面物理上可信,表明该方法在数量和质量上都优于最先进的方法。

Egocentric 3D human pose estimation with a single head-mounted fisheye camera has recently attracted attention due to its numerous applications in virtual and augmented reality. Existing methods still struggle in challenging poses where the human body is highly occluded or is closely interacting with the scene. To address this issue, we propose a scene-aware egocentric pose estimation method that guides the prediction of the egocentric pose with scene constraints. To this end, we propose an egocentric depth estimation network to predict the scene depth map from a wide-view egocentric fisheye camera while mitigating the occlusion of the human body with a depth-inpainting network. Next, we propose a scene-aware pose estimation network that projects the 2D image features and estimated depth map of the scene into a voxel space and regresses the 3D pose with a V2V network. The voxel-based feature representation provides the direct geometric connection between 2D image features and scene geometry, and further facilitates the V2V network to constrain the predicted pose based on the estimated scene geometry. To enable the training of the aforementioned networks, we also generated a synthetic dataset, called EgoGTA, and an in-the-wild dataset based on EgoPW, called EgoPW-Scene. The experimental results of our new evaluation sequences show that the predicted 3D egocentric poses are accurate and physically plausible in terms of human-scene interaction, demonstrating that our method outperforms the state-of-the-art methods both quantitatively and qualitatively.

Learning Geometry-Aware Representations by Sketching
Lee, HyundoandHwang, InwooandGo, HyunsungandChoi, Won-SeokandKim, KibeomandZhang, Byoung-Tak



研究问题:如何将几何概念(如距离和形状)融入场景的视觉表示中。
动机:理解几何概念对于理解现实世界和许多视觉任务至关重要。
方法:提出通过学习素描来表示场景,该方法在单个推理步骤中明确地结合了场景的几何信息,而无需素描数据集。
效果:实验结果表明,LBS显著提高了CLEVR数据集上的对象属性分类性能,以及CLEVR和STL-10数据集之间的领域转移,并证实了LBS提供了丰富的几何信息。

Understanding geometric concepts, such as distance and shape, is essential for understanding the real world and also for many vision tasks. To incorporate such information into a visual representation of a scene, we propose learning to represent the scene by sketching, inspired by human behavior. Our method, coined Learning by Sketching (LBS), learns to convert an image into a set of colored strokes that explicitly incorporate the geometric information of the scene in a single inference step without requiring a sketch dataset. A sketch is then generated from the strokes where CLIP-based perceptual loss maintains a semantic similarity between the sketch and the image. We show theoretically that sketching is equivariant with respect to arbitrary affine transformations and thus provably preserves geometric information. Experimental results show that LBS substantially improves the performance of object attribute classification on the unlabeled CLEVR dataset, domain transfer between CLEVR and STL-10 datasets, and for diverse downstream tasks, confirming that LBS provides rich geometric information.

X-Avatar: Expressive Human Avatars
Shen, KaiyueandGuo, ChenandKaufmann, ManuelandZarate, JuanJoseandValentin, JulienandSong, JieandHilliges, Otmar



研究问题:开发一种全新的数字化人像模型X-Avatar,以实现远程呈现、AR/VR等场景中逼真的体验。
动机:目前的数字化人像模型无法充分捕捉人类表达的丰富性,无法带来逼真的体验。
方法:我们的方法全面地对身体、手部、面部表情和外观进行建模,并可以从完整的3D扫描或RGB-D数据中学习。为实现这一目标,我们提出了一个部分感知的学习前皮肤绑定模块,该模块可以由SMPL-X的参数空间驱动,从而实现X-Avatar的富有表现力的动画效果。
效果:通过新颖的部分感知采样和初始化策略,我们的方法在保持有效训练的同时,实现了更高的保真度结果,尤其是在较小的身体部位上。我们还通过将纹理网络扩展到几何和变形字段,使其受到姿势、面部表情、几何形状和变形表面的法线条件的影响,从而捕捉到具有高频细节的头像外观。实验证明,我们的方法在动画任务上无论是定量还是定性都优于强大的基线。为了促进未来关于表现力头像的研究,我们贡献了一个名为X-Humans的新数据集,其中包含20名参与者的233个高质量纹理扫描序列,总共有35,500个数据帧。

We present X-Avatar, a novel avatar model that captures the full expressiveness of digital humans to bring about life-like experiences in telepresence, AR/VR and beyond. Our method models bodies, hands, facial expressions and appearance in a holistic fashion and can be learned from either full 3D scans or RGB-D data. To achieve this, we propose a part-aware learned forward skinning module that can be driven by the parameter space of SMPL-X, allowing for expressive animation of X-Avatars. To efficiently learn the neural shape and deformation fields, we propose novel part-aware sampling and initialization strategies. This leads to higher fidelity results, especially for smaller body parts while maintaining efficient training despite increased number of articulated bones. To capture the appearance of the avatar with high-frequency details, we extend the geometry and deformation fields with a texture network that is conditioned on pose, facial expression, geometry and the normals of the deformed surface. We show experimentally that our method outperforms strong baselines both quantitatively and qualitatively on the animation task. To facilitate future research on expressive avatars we contribute a new dataset, called X-Humans, containing 233 sequences of high-quality textured scans from 20 participants, totalling 35,500 data frames.

Recovering 3D Hand Mesh Sequence From a Single Blurry Image: A New Dataset and Temporal Unfolding
Oh, YeongukandPark, JoonKyuandKim, JaehaandMoon, GyeongsikandLee, KyoungMu



研究问题:现有的3D手部网格恢复方法主要关注清晰手部图像,而忽视了由于缺乏提供模糊手部图像的数据集而导致的模糊问题。
动机:我们首次提出了一个包含模糊手部图像和3D真实值的新数据集BlurHand,以解决这一问题。
方法:我们构建了BlurHandNet,这是一个从模糊手部图像准确恢复3D手部网格的基线网络。与以往将模糊输入图像展开为静态单个手部网格的工作不同,我们的网络将其展开为3D手部网格序列,以利用模糊输入图像中的时间信息。
效果:实验证明,BlurHand对于从模糊图像中恢复3D手部网格非常有用。我们提出的BlurHandNet在模糊图像上产生了更稳健的结果,同时在自然环境下的图像上也有很好的泛化能力。

Hands, one of the most dynamic parts of our body, suffer from blur due to their active movements. However, previous 3D hand mesh recovery methods have mainly focused on sharp hand images rather than considering blur due to the absence of datasets providing blurry hand images. We first present a novel dataset BlurHand, which contains blurry hand images with 3D groundtruths. The BlurHand is constructed by synthesizing motion blur from sequential sharp hand images, imitating realistic and natural motion blurs. In addition to the new dataset, we propose BlurHandNet, a baseline network for accurate 3D hand mesh recovery from a blurry hand image. Our BlurHandNet unfolds a blurry input image to a 3D hand mesh sequence to utilize temporal information in the blurry input image, while previous works output a static single hand mesh. We demonstrate the usefulness of BlurHand for the 3D hand mesh recovery from blurry images in our experiments. The proposed BlurHandNet produces much more robust results on blurry images while generalizing well to in-the-wild images. The training codes and BlurHand dataset are available at https://github.com/JaehaKim97/BlurHand_RELEASE.

Fast Monocular Scene Reconstruction With Global-Sparse Local-Dense Grids
Dong, WeiandChoy, ChristopherandLoop, CharlesandLitany, OrandZhu, YukeandAnandkumar, Anima



研究问题:如何从单目图像中重建室内场景,同时提高训练和渲染的速度。
动机:目前的方法主要依赖于多层感知机(MLP),这在训练和渲染速度上存在明显限制。
方法:我们提出直接使用稀疏体素块网格中的有符号距离函数(SDF)进行快速准确的场景重建,而不使用MLP。我们的全局稀疏和局部密集的数据结构利用了表面的空间稀疏性,实现了对缓存友好的查询,并允许直接扩展到多模态数据,如颜色和语义标签。
效果:实验表明,我们的方法在训练上快10倍,渲染上快100倍,同时达到了与最先进的神经隐式方法相当的准确性。

Indoor scene reconstruction from monocular images has long been sought after by augmented reality and robotics developers. Recent advances in neural field representations and monocular priors have led to remarkable results in scene-level surface reconstructions. The reliance on Multilayer Perceptrons (MLP), however, significantly limits speed in training and rendering. In this work, we propose to directly use signed distance function (SDF) in sparse voxel block grids for fast and accurate scene reconstruction without MLPs. Our globally sparse and locally dense data structure exploits surfaces' spatial sparsity, enables cache-friendly queries, and allows direct extensions to multi-modal data such as color and semantic labels. To apply this representation to monocular scene reconstruction, we develop a scale calibration algorithm for fast geometric initialization from monocular depth priors. We apply differentiable volume rendering from this initialization to refine details with fast convergence. We also introduce efficient high-dimensional Continuous Random Fields (CRFs) to further exploit the semantic-geometry consistency between scene objects. Experiments show that our approach is 10x faster in training and 100x faster in rendering while achieving comparable accuracy to state-of-the-art neural implicit methods.

Thermal Spread Functions (TSF): Physics-Guided Material Classification
Dashpute, AniketandSaragadam, VishwanathandAlexander, EmmaandWillomitzer, FlorianandKatsaggelos, AggelosandVeeraraghavan, AshokandCossairt, Oliver



研究问题:如何进行鲁棒且非破坏性的物质分类,这是许多视觉应用中的关键第一步。
动机:物体的加热和冷却速率取决于材料的独特内在属性,即发射率和扩散率。我们利用这一观察结果,通过用低功率激光轻轻加热场景中的物体一段时间,然后关闭它,同时热像仪在加热和冷却过程中进行测量。
方法:我们采用有限差分法解决逆向热方程,得到空间变化的扩散率和发射率估计值。然后使用这些元组训练一个分类器,为每个空间像素产生精细的材料标签。
效果:我们的方法非常简单,只需要一个小光源(低功率激光器)和一个热像仪,就可以产生鲁棒的分类结果,16个类别的准确率达到86%。

Robust and non-destructive material classification is a challenging but crucial first-step in numerous vision applications. We propose a physics-guided material classification framework that relies on thermal properties of the object. Our key observation is that the rate of heating and cooling of an object depends on the unique intrinsic properties of the material, namely the emissivity and diffusivity. We leverage this observation by gently heating the objects in the scene with a low-power laser for a fixed duration and then turning it off, while a thermal camera captures measurements during the heating and cooling process. We then take this spatial and temporal "thermal spread function" (TSF) to solve an inverse heat equation using the finite-differences approach, resulting in a spatially varying estimate of diffusivity and emissivity. These tuples are then used to train a classifier that produces a fine-grained material label at each spatial pixel. Our approach is extremely simple requiring only a small light source (low power laser) and a thermal camera, and produces robust classification results with 86% accuracy over 16 classes

ESLAM: Efficient Dense SLAM System Based on Hybrid Representation of Signed Distance Fields
Johari, MohammadMahdiandCarta, CamillaandFleuret, Fran\c{c



研究问题:如何有效地实现同时定位与地图构建(SLAM)的神经表征方法。
动机:现有的SLAM系统需要大量的预训练,并且效率低下。
方法:提出ESLAM模型,通过连续读取未知相机位姿的RGB-D帧并逐步重建场景表示,同时估计当前场景中的相机位置。将最新的Neural Radiance Fields(NeRF)技术整合到SLAM系统中,形成一种高效准确的密集视觉SLAM方法。
效果:在Replica、ScanNet和TUM RGB-D三个标准数据集上的大量实验表明,ESLAM比最先进的密集视觉SLAM方法提高了3D重建和相机定位的准确性超过50%,并且运行速度提高了10倍,无需任何预训练。

We present ESLAM, an efficient implicit neural representation method for Simultaneous Localization and Mapping (SLAM). ESLAM reads RGB-D frames with unknown camera poses in a sequential manner and incrementally reconstructs the scene representation while estimating the current camera position in the scene. We incorporate the latest advances in Neural Radiance Fields (NeRF) into a SLAM system, resulting in an efficient and accurate dense visual SLAM method. Our scene representation consists of multi-scale axis-aligned perpendicular feature planes and shallow decoders that, for each point in the continuous space, decode the interpolated features into Truncated Signed Distance Field (TSDF) and RGB values. Our extensive experiments on three standard datasets, Replica, ScanNet, and TUM RGB-D show that ESLAM improves the accuracy of 3D reconstruction and camera localization of state-of-the-art dense visual SLAM methods by more than 50%, while it runs up to 10 times faster and does not require any pre-training. Project page: https://www.idiap.ch/paper/eslam

iDisc: Internal Discretization for Monocular Depth Estimation
Piccinelli, LuigiandSakaridis, ChristosandYu, Fisher



研究问题:单目深度估计是3D场景理解和下游应用的基础,但在有监督的设置下,由于缺乏几何约束,仍然具有挑战性和不适定。
动机:尽管一个场景可能由数百万个像素组成,但高层次的模式要少得多。我们观察到这一点,并提出iDisc来学习这些模式的内部离散表示。
方法:我们提出了一种新的模块,内部离散化(ID),实现了连续-离散-连续的瓶颈,以在没有监督的情况下学习这些概念。与最先进的方法相比,我们的方法不对深度输出施加任何显式的约束或先验知识。
效果:我们的模型在NYU-Depth v2和KITTI上取得了显著的改进,并在官方KITTI基准测试上超越了所有已发布的方法。iDisc还可以在表面法线估计上取得最先进的结果。此外,我们还通过零射测试探索了模型的泛化能力。

Monocular depth estimation is fundamental for 3D scene understanding and downstream applications. However, even under the supervised setup, it is still challenging and ill-posed due to the lack of geometric constraints. We observe that although a scene can consist of millions of pixels, there are much fewer high-level patterns. We propose iDisc to learn those patterns with internal discretized representations. The method implicitly partitions the scene into a set of high-level concepts. In particular, our new module, Internal Discretization (ID), implements a continuous-discrete-continuous bottleneck to learn those concepts without supervision. In contrast to state-of-the-art methods, the proposed model does not enforce any explicit constraints or priors on the depth output. The whole network with the ID module can be trained in an end-to-end fashion thanks to the bottleneck module based on attention. Our method sets the new state of the art with significant improvements on NYU-Depth v2 and KITTI, outperforming all published methods on the official KITTI benchmark. iDisc can also achieve state-of-the-art results on surface normal estimation. Further, we explore the model generalization capability via zero-shot testing. From there, we observe the compelling need to promote diversification in the outdoor scenario and we introduce splits of two autonomous driving datasets, DDAD and Argoverse. Code is available at http://vis.xyz/pub/idisc/.

Sampling Is Matter: Point-Guided 3D Human Mesh Reconstruction
Kim, JeonghwanandGwon, Mi-GyeongandPark, HyunwooandKwon, HyukminandUm, Gi-MunandKim, Wonjun



研究问题:如何从单张RGB图像中重建3D人体网格?
动机:尽管现有的方法在估计整体网格顶点的非局部交互和处理身体部位关系方面取得了显著进展,但直接推断从2D输入图像编码的特征与每个顶点的3D坐标之间的关系仍然困难。
方法:提出一种简单的特征采样方案,通过按照3D网格顶点(即,真实值)的投影结果来指导2D空间中的特征采样,使模型更关注顶点相关特征,从而实现自然人体姿态的重建。
效果:实验结果表明,该方法有效提高了3D人体网格重建的性能。

This paper presents a simple yet powerful method for 3D human mesh reconstruction from a single RGB image. Most recently, the non-local interactions of the whole mesh vertices have been effectively estimated in the transformer while the relationship between body parts also has begun to be handled via the graph model. Even though those approaches have shown the remarkable progress in 3D human mesh reconstruction, it is still difficult to directly infer the relationship between features, which are encoded from the 2D input image, and 3D coordinates of each vertex. To resolve this problem, we propose to design a simple feature sampling scheme. The key idea is to sample features in the embedded space by following the guide of points, which are estimated as projection results of 3D mesh vertices (i.e., ground truth). This helps the model to concentrate more on vertex-relevant features in the 2D space, thus leading to the reconstruction of the natural human pose. Furthermore, we apply progressive attention masking to precisely estimate local interactions between vertices even under severe occlusions. Experimental results on benchmark datasets show that the proposed method efficiently improves the performance of 3D human mesh reconstruction. The code and model are publicly available at: https://github.com/DCVL-3D/PointHMR_release.

Depth Estimation From Indoor Panoramas With Neural Scene Representation
Chang, WenjieandZhang, YueyiandXiong, Zhiwei



研究问题:如何提高室内全景图像深度估计的准确性和效率。
动机:由于全景图像的等距畸变和不准确的匹配,从室内全景图像进行深度估计具有挑战性。
方法:提出一种实用的框架,利用神经辐射场技术改进多视角室内全景图像的深度估计。开发两个网络来隐式学习用于深度测量的有符号距离函数和全景图的辐射场。引入一种新的球形位置嵌入方案以实现高精度。为了更好的收敛,提出了一种基于曼哈顿世界假设的网络权重初始化方法。此外,设计了一种几何一致性损失,利用表面法线进一步优化深度估计。
效果:实验结果表明,所提出的方法在定量和定性评估中均大幅超越现有最先进技术。

Depth estimation from indoor panoramas is challenging due to the equirectangular distortions of panoramas and inaccurate matching. In this paper, we propose a practical framework to improve the accuracy and efficiency of depth estimation from multi-view indoor panoramic images with the Neural Radiance Field technology. Specifically, we develop two networks to implicitly learn the Signed Distance Function for depth measurements and the radiance field from panoramas. We also introduce a novel spherical position embedding scheme to achieve high accuracy. For better convergence, we propose an initialization method for the network weights based on the Manhattan World Assumption. Furthermore, we devise a geometric consistency loss, leveraging the surface normal, to further refine the depth estimation. The experimental results demonstrate that our proposed method outperforms state-of-the-art works by a large margin in both quantitative and qualitative evaluations. Our source code is available at https://github.com/WJ-Chang-42/IndoorPanoDepth.

Single Image Depth Prediction Made Better: A Multivariate Gaussian Take
Liu, CeandKumar, SuryanshandGu, ShuhangandTimofte, RaduandVanGool, Luc



研究问题:单图像深度预测(SIDP)是一个挑战性的任务,目标是在测试时预测场景的每个像素深度。
动机:由于该问题本质上是病态的,因此基本目标是提出一种能够从一组训练样本可靠地对场景深度进行建模的方法。
方法:我们引入了一种连续地对每个像素深度进行建模的方法,我们可以预测和推理每个像素深度及其分布。为此,我们使用多变量高斯分布来对每个像素的场景深度进行建模。此外,与现有的不确定性建模方法相反,我们引入了每个像素协方差建模,它编码了其相对于所有场景点的深度依赖性。
效果:当我们在KITTI、NYU和SUN-RGB-D等基准数据集上进行测试时,通过优化我们的损失函数获得的SIDP模型显示出最先进的结果。我们的方法的准确性(命名为MG)在KITTI深度预测基准测试排行榜上名列前茅。

Neural-network-based single image depth prediction (SIDP) is a challenging task where the goal is to predict the scene's per-pixel depth at test time. Since the problem, by definition, is ill-posed, the fundamental goal is to come up with an approach that can reliably model the scene depth from a set of training examples. In the pursuit of perfect depth estimation, most existing state-of-the-art learning techniques predict a single scalar depth value per-pixel. Yet, it is well-known that the trained model has accuracy limits and can predict imprecise depth. Therefore, an SIDP approach must be mindful of the expected depth variations in the model's prediction at test time. Accordingly, we introduce an approach that performs continuous modeling of per-pixel depth, where we can predict and reason about the per-pixel depth and its distribution. To this end, we model per-pixel scene depth using a multivariate Gaussian distribution. Moreover, contrary to the existing uncertainty modeling methods---in the same spirit, where per-pixel depth is assumed to be independent, we introduce per-pixel covariance modeling that encodes its depth dependency w.r.t. all the scene points. Unfortunately, per-pixel depth covariance modeling leads to a computationally expensive continuous loss function, which we solve efficiently using the learned low-rank approximation of the overall covariance matrix. Notably, when tested on benchmark datasets such as KITTI, NYU, and SUN-RGB-D, the SIDP model obtained by optimizing our loss function shows state-of-the-art results. Our method's accuracy (named MG) is among the top on the KITTI depth-prediction benchmark leaderboard.

UMat: Uncertainty-Aware Single Image High Resolution Material Capture
Rodriguez-Pardo, CarlosandDom{\'\i



研究问题:如何从单一漫反射图像中恢复材料的法线、镜面反射和粗糙度。
动机:现有的方法在处理单张图像时,常常会产生过度平滑的输出,或者分辨率有限,或者每个类别都需要训练一个模型,泛化能力有限。
方法:我们提出了一种基于学习的方法,利用微几何外观作为主要线索,从单一漫反射图像中恢复材料的法线、镜面反射和粗糙度。我们使用了一个带有注意力的生成网络和一个U-Net判别器,以降低计算复杂度并整合全局信息。
效果:我们在一个真实的数字化纺织材料数据集上展示了我们方法的性能,证明普通的平板扫描仪就可以产生我们需要的输入类型。此外,由于问题可能是病态的——可能需要多张漫反射图像才能消除镜面反射的歧义——或者因为训练数据集对真实分布的代表程度不够,我们还提出了一种新的框架来量化模型对其预测的信心。我们的方法首次解决了材料数字化过程中的不确定性建模问题,提高了过程的可信度,并通过主动学习实验展示了更智能的数据集创建策略。

We propose a learning-based method to recover normals, specularity, and roughness from a single diffuse image of a material, using microgeometry appearance as our primary cue. Previous methods that work on single images tend to produce over-smooth outputs with artifacts, operate at limited resolution, or train one model per class with little room for generalization. In contrast, in this work, we propose a novel capture approach that leverages a generative network with attention and a U-Net discriminator, which shows outstanding performance integrating global information at reduced computational complexity. We showcase the performance of our method with a real dataset of digitized textile materials and show that a commodity flatbed scanner can produce the type of diffuse illumination required as input to our method. Additionally, because the problem might be ill-posed --more than a single diffuse image might be needed to disambiguate the specular reflection-- or because the training dataset is not representative enough of the real distribution, we propose a novel framework to quantify the model's confidence about its prediction at test time. Our method is the first one to deal with the problem of modeling uncertainty in material digitization, increasing the trustworthiness of the process and enabling more intelligent strategies for dataset creation, as we demonstrate with an active learning experiment.

SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow
Lang, ItaiandAiger, DrorandCole, ForresterandAvidan, ShaiandRubinstein, Michael



研究问题:本文旨在解决计算机视觉中的场景流估计问题,即从连续观测中找到场景的3D运动。
动机:现有的方法通常需要大量的标注数据进行训练,而本文提出了一种新的方法,可以在少量数据上进行学习,无需使用真实的流监督。
方法:本文提出的方法首先训练一个纯对应模型来学习点特征表示,并将流初始化为源点和其软对应目标点之间的差值。然后在运行时阶段,直接优化流精化组件,使用自我监督的目标,从而在点云之间产生连贯且准确的流场。
效果:实验结果表明,与现有的领先技术相比,该方法在使用一小部分训练数据的情况下取得了性能提升。

Scene flow estimation is a long-standing problem in computer vision, where the goal is to find the 3D motion of a scene from its consecutive observations. Recently, there have been efforts to compute the scene flow from 3D point clouds. A common approach is to train a regression model that consumes source and target point clouds and outputs the per-point translation vector. An alternative is to learn point matches between the point clouds concurrently with regressing a refinement of the initial correspondence flow. In both cases, the learning task is very challenging since the flow regression is done in the free 3D space, and a typical solution is to resort to a large annotated synthetic dataset. We introduce SCOOP, a new method for scene flow estimation that can be learned on a small amount of data without employing ground-truth flow supervision. In contrast to previous work, we train a pure correspondence model focused on learning point feature representation and initialize the flow as the difference between a source point and its softly corresponding target point. Then, in the run-time phase, we directly optimize a flow refinement component with a self-supervised objective, which leads to a coherent and accurate flow field between the point clouds. Experiments on widespread datasets demonstrate the performance gains achieved by our method compared to existing leading techniques while using a fraction of the training data. Our code is publicly available.

RobustNeRF: Ignoring Distractors With Robust Losses
Sabour, SaraandVora, SuhaniandDuckworth, DanielandKrasin, IvanandFleet, DavidJ.andTagliasacchi, Andrea



研究问题:如何消除静态场景图像中的干扰物,如移动物体、光照变化和阴影,以提高研究问题:如何消除静态场景图像中的干扰物,如移动物体、光照变化和阴影,以提高Neural radiance fields(NeRF)在新视图合成中的表现。
动机:现有的NeRF方法在处理包含干扰物的静态场景时会出现"floaters"等伪影。
方法:提出一种稳健的NeRF训练估计方法,将训练数据中的干扰物建模为优化问题的异常值,从而从场景中移除这些异常值。
效果:该方法成功地从合成和真实世界的静态场景中移除了异常值,并在性能上超越了基线方法。此外,该方法易于集成到现代NeRF框架中,且无需预先了解干扰物的类型。

Neural radiance fields (NeRF) excel at synthesizing new views given multi-view, calibrated images of a static scene. When scenes include distractors, which are not persistent during image capture (moving objects, lighting variations, shadows), artifacts appear as view-dependent effects or 'floaters'. To cope with distractors, we advocate a form of robust estimation for NeRF training, modeling distractors in training data as outliers of an optimization problem. Our method successfully removes outliers from a scene and improves upon our baselines, on synthetic and real-world scenes. Our technique is simple to incorporate in modern NeRF frameworks, with few hyper-parameters. It does not assume a priori knowledge of the types of distractors, and is instead focused on the optimization problem rather than pre-processing or modeling transient objects. More results on our page https://robustnerf.github.io/public.

SCOTCH and SODA: A Transformer Video Shadow Detection Framework
Liu, LihaoandProst, JeanandZhu, LeiandPapadakis, NicolasandLi\`o, PietroandSch\"onlieb, Carola-BibianeandAviles-Rivero, AngelicaI.



研究问题:视频中的影子检测由于帧间大的影子变形而变得困难。
动机:设计视频影子检测方法时,考虑影子变形是必要的。
方法:提出了一种新的视频自注意力模块——阴影变形注意力轨迹(SODA),专门用于处理视频中大的影子变形。同时,还提出了一种新的阴影对比学习机制(SCOTCH),旨在引导网络从不同视频中的大量正影子对中学习统一的阴影表示。
效果:实验证明,这两种方法在视频影子检测任务上的效果明显优于现有技术。

Shadows in videos are difficult to detect because of the large shadow deformation between frames. In this work, we argue that accounting for shadow deformation is essential when designing a video shadow detection method. To this end, we introduce the shadow deformation attention trajectory (SODA), a new type of video self-attention module, specially designed to handle the large shadow deformations in videos. Moreover, we present a new shadow contrastive learning mechanism (SCOTCH) which aims at guiding the network to learn a unified shadow representation from massive positive shadow pairs across different videos. We demonstrate empirically the effectiveness of our two contributions in an ablation study. Furthermore, we show that SCOTCH and SODA significantly outperforms existing techniques for video shadow detection. Code is available at the project page: https://lihaoliu-cambridge.github.io/scotch_and_soda/

Learning Visibility Field for Detailed 3D Human Reconstruction and Relighting
Zheng, RuichenandLi, PengandWang, HaoqianandYu, Tao



研究问题:如何有效地进行数字人类的详细3D重建和照片级真实感重光照?
动机:为了解决多视图特征聚合中的遮挡歧义问题,同时评估自我阴影的光线衰减,我们提出了一种新的稀疏视图3D人体重建框架。
方法:我们将占用场和反照率场与额外的可见性场紧密结合,将可见性离散化为一组固定的样本方向,并为其提供耦合的几何3D深度特征和局部2D图像特征。我们还提出了一种新的渲染感知损失,即TransferLoss,以隐式地强制可见性和占用场之间的对齐,实现端到端的联合训练。
效果:实验结果和大量实验证明,该方法在重建精度上超过了最先进的技术,同时在光线追踪的真值对比下实现了相当准确的重光照。

Detailed 3D reconstruction and photo-realistic relighting of digital humans are essential for various applications. To this end, we propose a novel sparse-view 3d human reconstruction framework that closely incorporates the occupancy field and albedo field with an additional visibility field--it not only resolves occlusion ambiguity in multiview feature aggregation, but can also be used to evaluate light attenuation for self-shadowed relighting. To enhance its training viability and efficiency, we discretize visibility onto a fixed set of sample directions and supply it with coupled geometric 3D depth feature and local 2D image feature. We further propose a novel rendering-inspired loss, namely TransferLoss, to implicitly enforce the alignment between visibility and occupancy field, enabling end-to-end joint training. Results and extensive experiments demonstrate the effectiveness of the proposed method, as it surpasses state-of-the-art in terms of reconstruction accuracy while achieving comparably accurate relighting to ray-traced ground truth.

Complementary Intrinsics From Neural Radiance Fields and CNNs for Outdoor Scene Relighting
Yang, SiqiandCui, XuanningandZhu, YongjieandTang, JiajunandLi, SiandYu, ZhaofeiandShi, Boxin



研究问题:如何通过多视角立体视觉的弱监督标签,利用神经辐射场(NeRFs)进行户外场景的光照编辑,以解决户外场景重光照的挑战。
动机:由于户外场景的多样化照明和显著的投射阴影,使得户外场景的重光照变得具有挑战性。通过使用神经辐射场(NeRFs)对户外照片集进行固有图像分解,可以部分解决这个问题。
方法:本论文提出了一种互补的方法,该方法结合了体积渲染的固有估计和使用卷积神经网络(CNNs)反演光度图像形成模型。前者为训练后者产生更丰富、更可靠的伪标签(投射阴影和天空外观以及反照率和法线),后者通过单图像预测管道预测可解释和可编辑的照明参数。
效果:我们的方法在多种真实户外场景的固有图像分解和重光照方面都显示出优势。

Relighting an outdoor scene is challenging due to the diverse illuminations and salient cast shadows. Intrinsic image decomposition on outdoor photo collections could partly solve this problem by weakly supervised labels with albedo and normal consistency from multi-view stereo. With neural radiance fields (NeRFs), editing the appearance code could produce more realistic results without explicitly interpreting the outdoor scene image formation. This paper proposes to complement the intrinsic estimation from volume rendering using NeRFs and from inversing the photometric image formation model using convolutional neural networks (CNNs). The former produces richer and more reliable pseudo labels (cast shadows and sky appearances in addition to albedo and normal) for training the latter to predict interpretable and editable lighting parameters via a single-image prediction pipeline. We demonstrate the advantages of our method for both intrinsic image decomposition and relighting for various real outdoor scenes.

High-Res Facial Appearance Capture From Polarized Smartphone Images
Azinovi\'c, DejanandMaury, OlivierandHery, ChristopheandNie{\ss



研究问题:如何从RGB图像中高质量重建面部纹理?
动机:使用智能手机和便宜的偏振箔,通过新颖的捕捉方式进行面部纹理重建。
方法:将手机闪光灯转为偏振光源,并在相机上添加偏振滤镜。在黑暗环境下,用修改后的智能手机以不同的光极化捕捉对象的面部,并基于这些观察结果,使用结构运动学重建面部显式表面网格。然后利用相机和光源在同一位置的可微渲染器,通过分析合成法优化面部纹理。
效果:优化后的纹理可用于标准渲染管道,在新环境中合成高质量的照片级真实3D数字人。

We propose a novel method for high-quality facial texture reconstruction from RGB images using a novel capturing routine based on a single smartphone which we equip with an inexpensive polarization foil. Specifically, we turn the flashlight into a polarized light source and add a polarization filter on top of the camera. Leveraging this setup, we capture the face of a subject with cross-polarized and parallel-polarized light. For each subject, we record two short sequences in a dark environment under flash illumination with different light polarization using the modified smartphone. Based on these observations, we reconstruct an explicit surface mesh of the face using structure from motion. We then exploit the camera and light co-location within a differentiable renderer to optimize the facial textures using an analysis-by-synthesis approach. Our method optimizes for high-resolution normal textures, diffuse albedo, and specular albedo using a coarse-to-fine optimization scheme. We show that the optimized textures can be used in a standard rendering pipeline to synthesize high-quality photo-realistic 3D digital humans in novel environments.

JAWS: Just a Wild Shot for Cinematic Transfer in Neural Radiance Fields
Wang, XiandCourant, RobinandShi, JingleiandMarchand, EricandChristie, Marc



研究问题:如何实现从参考野外视频片段到新生成片段的视觉电影特征的稳健转移。
动机:为了提高新生成视频片段的电影特征与参考片段的相似度,提出一种优化驱动的方法。
方法:采用隐式神经表示(INR)计算共享相同电影特征的片段,并提出了在INR中计算相机参数和时间的通用相机优化问题公式。通过利用神经表示的可微性,将设计的电影损失通过NeRF网络反向传播到提出的电影参数。
效果:实验结果表明,该系统能够很好地复制电影中的知名相机序列,调整生成视频片段的构图、相机参数和时间,使其与参考片段的相似度最大化。

This paper presents JAWS, an optimzation-driven approach that achieves the robust transfer of visual cinematic features from a reference in-the-wild video clip to a newly generated clip. To this end, we rely on an implicit-neural-representation (INR) in a way to compute a clip that shares the same cinematic features as the reference clip. We propose a general formulation of a camera optimization problem in an INR that computes extrinsic and intrinsic camera parameters as well as timing. By leveraging the differentiability of neural representations, we can back-propagate our designed cinematic losses measured on proxy estimators through a NeRF network to the proposed cinematic parameters directly. We also introduce specific enhancements such as guidance maps to improve the overall quality and efficiency. Results display the capacity of our system to replicate well known camera sequences from movies, adapting the framing, camera parameters and timing of the generated video clip to maximize the similarity with the reference clip.

Temporally Consistent Online Depth Estimation Using Point-Based Fusion
Khan, NumairandPenner, EricandLanman, DouglasandXiao, Lei



研究问题:本文旨在解决视频深度估计中的时间一致性问题,即如何在没有未来帧的情况下在线实时地估计出具有时间一致性的视频深度图。
动机:现有的视频深度估计方法在应用于视频时,其结果缺乏时间一致性,会出现闪烁和游走的伪影。同时,由于未来帧不可用,以及动态物体的存在,使得这个问题变得更加复杂。
方法:本文提出了一种全局点云的方法来解决这个问题,每帧都会动态更新这个点云。同时,还采用了在图像空间中学习融合的方法。这种方法既鼓励了一致性,又允许对错误和动态物体进行更新。
效果:定性和定量的实验结果表明,该方法在视频深度估计的时间一致性方面取得了最先进的质量。

Depth estimation is an important step in many computer vision problems such as 3D reconstruction, novel view synthesis, and computational photography. Most existing work focuses on depth estimation from single frames. When applied to videos, the result lacks temporal consistency, showing flickering and swimming artifacts. In this paper we aim to estimate temporally consistent depth maps of video streams in an online setting. This is a difficult problem as future frames are not available and the method must choose between enforcing consistency and correcting errors from previous estimations. The presence of dynamic objects further complicates the problem. We propose to address these challenges by using a global point cloud that is dynamically updated each frame, along with a learned fusion approach in image space. Our approach encourages consistency while simultaneously allowing updates to handle errors and dynamic objects. Qualitative and quantitative results show that our method achieves state-of-the-art quality for consistent video depth estimation.

Passive Micron-Scale Time-of-Flight With Sunlight Interferometry
Kotwal, AlankarandLevin, AnatandGkioulekas, Ioannis



研究问题:本文旨在介绍一种用于被动飞行时间成像和微米级轴向分辨率深度传感的干涉测量技术。
动机:现有的技术需要使用复杂的光源和扫描设备,而我们的方法利用了太阳光作为唯一的光源,并通过简单的轴向扫描操作获取微米级分辨率的时间响应。
方法:我们使用全视迈克尔逊干涉仪进行修改,以利用太阳光作为唯一的光源。太阳光的大光谱带宽使得我们能够通过简单的轴向扫描操作获取微米级分辨率的时间响应。此外,太阳光的角度带宽使得我们能够捕获对间接照明效应(如反射和次表面散射)不敏感的飞行时间测量结果。
效果:我们在户外、直接阳光下以及在机器振动和车辆交通等不利环境条件下操作实验原型,并首次展示了对间接照明具有鲁棒性的微米级深度传感、直接成像和通过扩散器成像等被动成像能力。

We introduce an interferometric technique for passive time-of-flight imaging and depth sensing at micrometer axial resolutions. Our technique uses a full-field Michelson interferometer, modified to use sunlight as the only light source. The large spectral bandwidth of sunlight makes it possible to acquire micrometer-resolution time-resolved scene responses, through a simple axial scanning operation. Additionally, the angular bandwidth of sunlight makes it possible to capture time-of-flight measurements insensitive to indirect illumination effects, such as interreflections and subsurface scattering. We build an experimental prototype that we operate outdoors, under direct sunlight, and in adverse environment conditions such as machine vibrations and vehicle traffic. We use this prototype to demonstrate, for the first time, passive imaging capabilities such as micrometer-scale depth sensing robust to indirect illumination, direct-only imaging, and imaging through diffusers.

Unsupervised Volumetric Animation
Siarohin, AliaksandrandMenapace, WilliandSkorokhodov, IvanandOlszewski, KyleandRen, JianandLee, Hsin-YingandChai, MengleiandTulyakov, Sergey



研究问题:本文旨在提出一种新颖的无监督方法,用于对非刚性可变形物体进行3D动画制作。
动机:目前的3D动画制作方法需要大量的标注数据和复杂的计算过程,而本文的方法可以从单视角RGB视频中学习物体的3D结构和动态,并将其分解为具有语义意义的部分进行跟踪和动画制作。
方法:本文采用了一个3D自动解码器框架,结合了通过可微PnP算法估计关键点的方法,完全无监督地学习了物体的基本几何形状和部分分解。这使得该方法能够进行3D分割、3D关键点估计、新视图合成和动画制作。
效果:在VoxCeleb 256^2和TEDXPeople 256^2两个视频数据集上进行了主要评估。此外,在Cats 256^2数据集上,本文的方法甚至能够从原始图像数据中学习出引人注目的3D几何形状。最后,本文的方法可以从单个或几张图像中获得可动画化的3D物体。

We propose a novel approach for unsupervised 3D animation of non-rigid deformable objects. Our method learns the 3D structure and dynamics of objects solely from single-view RGB videos, and can decompose them into semantically meaningful parts that can be tracked and animated. Using a 3D autodecoder framework, paired with a keypoint estimator via a differentiable PnP algorithm, our model learns the underlying object geometry and parts decomposition in an entirely unsupervised manner. This allows it to perform 3D segmentation, 3D keypoint estimation, novel view synthesis, and animation. We primarily evaluate the framework on two video datasets: VoxCeleb 256^2 and TEDXPeople 256^2. In addition, on the Cats 256^2 dataset, we show that it learns compelling 3D geometry even from raw image data. Finally, we show that our model can obtain animatable 3D objects from a singe or a few images.

PlaneDepth: Self-Supervised Depth Estimation via Orthogonal Planes
Wang, RuoyuandYu, ZehaoandGao, Shenghua



研究问题:现有的基于近前平行平面的深度表示在自监督单目深度估计(MDE)中取得了显著成果,但这种表示方法会导致地面不连续,对自动驾驶中的可行驶空间识别造成损害。
动机:为了解决上述问题,本文提出了一种新颖的基于正交平面的表示方法——PlaneDepth,包括垂直平面和地面平面。
方法:PlaneDepth使用拉普拉斯混合模型基于输入图像的正交平面来估计深度分布。这些平面用于合成参考视图以提供自我监督信号。此外,作者发现常用的调整大小和裁剪数据增强会破坏正交性假设,导致平面预测效果不佳。为解决这个问题,作者通过显式构造调整大小裁剪变换来纠正预定义的平面和预测的相机姿态。此外,作者还提出了一种增强的自我蒸馏损失,使用双边遮挡掩码进行监督,以提高正交平面表示对遮挡的鲁棒性。
效果:由于我们采用了正交平面表示法,我们可以在无监督的方式下提取地面平面,这对于自动驾驶至关重要。在KITTI数据集上的大量实验证明了我们方法的有效性和效率。代码可在https://github.com/svip-lab/PlaneDepth获取。

Multiple near frontal-parallel planes based depth representation demonstrated impressive results in self-supervised monocular depth estimation (MDE). Whereas, such a representation would cause the discontinuity of the ground as it is perpendicular to the frontal-parallel planes, which is detrimental to the identification of drivable space in autonomous driving. In this paper, we propose the PlaneDepth, a novel orthogonal planes based presentation, including vertical planes and ground planes. PlaneDepth estimates the depth distribution using a Laplacian Mixture Model based on orthogonal planes for an input image. These planes are used to synthesize a reference view to provide the self-supervision signal. Further, we find that the widely used resizing and cropping data augmentation breaks the orthogonality assumptions, leading to inferior plane predictions. We address this problem by explicitly constructing the resizing cropping transformation to rectify the predefined planes and predicted camera pose. Moreover, we propose an augmented self-distillation loss supervised with a bilateral occlusion mask to boost the robustness of orthogonal planes representation for occlusions. Thanks to our orthogonal planes representation, we can extract the ground plane in an unsupervised manner, which is important for autonomous driving. Extensive experiments on the KITTI dataset demonstrate the effectiveness and efficiency of our method. The code is available at https://github.com/svip-lab/PlaneDepth.

MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices
Li, KejieandBian, Jia-WangandCastle, RobertandTorr, PhilipH.S.andPrisacariu, VictorAdrian



研究问题:如何获取高质量的3D真实形状,以进行3D物体重建评估。
动机:现实中难以复制物体,即使使用3D扫描仪生成的3D重建也存在偏差,因此需要一种更精确的方法。
方法:利用已知几何形状的乐高模型作为图像捕获的3D结构,通过移动设备获取高分辨率RGB图像和低分辨率深度图,从而得到精确的3D真实形状。
效果:提出了一种新的多视图RGBD数据集,包括153个具有不同3D结构的物体模型的高精度3D真实注释。该数据集为未来关于高保真度3D重建的研究提供了独特的机会。同时,对一系列3D重建算法进行了评估。

High-quality 3D ground-truth shapes are critical for 3D object reconstruction evaluation. However, it is difficult to create a replica of an object in reality, and even 3D reconstructions generated by 3D scanners have artefacts that cause biases in evaluation. To address this issue, we introduce a novel multi-view RGBD dataset captured using a mobile device, which includes highly precise 3D ground-truth annotations for 153 object models featuring a diverse set of 3D structures. We obtain precise 3D ground-truth shape without relying on high-end 3D scanners by utilising LEGO models with known geometry as the 3D structures for image capture. The distinct data modality offered by high- resolution RGB images and low-resolution depth maps captured on a mobile device, when combined with precise 3D geometry annotations, presents a unique opportunity for future research on high-fidelity 3D reconstruction. Furthermore, we evaluate a range of 3D reconstruction algorithms on the proposed dataset.

SteerNeRF: Accelerating NeRF Rendering via Smooth Viewpoint Trajectory
Li, SichengandLi, HaoandWang, YueandLiao, YiyiandYu, Lu



研究问题:如何提高Neural Radiance Fields(NeRF)的渲染速度,同时降低内存消耗。
动机:现有的加速方法虽然可以加快NeRF的渲染速度,但会消耗大量的内存。
方法:利用交互式视点控制中视点变化通常是平滑和连续的这一特性,通过利用前一个视点的信息来减少需要渲染的像素数量以及剩余像素光线上的采样点数量。
效果:该方法可以在保持较高渲染质量的同时,显著减少渲染时间和内存开销,实现在1080P图像分辨率下30FPS的渲染速度和低内存占用。

Neural Radiance Fields (NeRF) have demonstrated superior novel view synthesis performance but are slow at rendering. To speed up the volume rendering process, many acceleration methods have been proposed at the cost of large memory consumption. To push the frontier of the efficiency-memory trade-off, we explore a new perspective to accelerate NeRF rendering, leveraging a key fact that the viewpoint change is usually smooth and continuous in interactive viewpoint control. This allows us to leverage the information of preceding viewpoints to reduce the number of rendered pixels as well as the number of sampled points along the ray of the remaining pixels. In our pipeline, a low-resolution feature map is rendered first by volume rendering, then a lightweight 2D neural renderer is applied to generate the output image at target resolution leveraging the features of preceding and current frames. We show that the proposed method can achieve competitive rendering quality while reducing the rendering time with little memory overhead, enabling 30FPS at 1080P image resolution with a low memory footprint.

MonoHuman: Animatable Human Neural Field From Monocular Video
Yu, ZhengmingandCheng, WeiandLiu, XianandWu, WayneandLin, Kwan-Yee



研究问题:如何利用神经辐射场(NeRF)的表示能力,从单目视频中重建人体,并实现虚拟化身的自由视点控制。
动机:目前的虚拟化身动画技术要么依赖于姿态相关的表示,要么由于帧间独立优化而缺乏运动连贯性,难以真实地推广到未见过的序列姿态。
方法:提出一种新的框架MonoHuman,通过双向约束对形变场进行建模,并显式利用现成的关键帧信息来推理特征相关性,以获得连贯的结果。
效果:实验结果表明,MonoHuman在各种挑战性新姿态设置下,都能生成具有高度保真度和多视图一致性的渲染结果,优于现有方法。

Animating virtual avatars with free-view control is crucial for various applications like virtual reality and digital entertainment. Previous studies have attempted to utilize the representation power of the neural radiance field (NeRF) to reconstruct the human body from monocular videos. Recent works propose to graft a deformation network into the NeRF to further model the dynamics of the human neural field for animating vivid human motions. However, such pipelines either rely on pose-dependent representations or fall short of motion coherency due to frame-independent optimization, making it difficult to generalize to unseen pose sequences realistically. In this paper, we propose a novel framework MonoHuman, which robustly renders view-consistent and high-fidelity avatars under arbitrary novel poses. Our key insight is to model the deformation field with bi-directional constraints and explicitly leverage the off-the-peg keyframe information to reason the feature correlations for coherent results. Specifically, we first propose a Shared Bidirectional Deformation module, which creates a pose-independent generalizable deformation field by disentangling backward and forward deformation correspondences into shared skeletal motion weight and separate non-rigid motions. Then, we devise a Forward Correspondence Search module, which queries the correspondence feature of keyframes to guide the rendering network. The rendered results are thus multi-view consistent with high fidelity, even under challenging novel pose settings. Extensive experiments demonstrate the superiority of our proposed MonoHuman over state-of-the-art methods.

NeRFLix: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-Viewpoint MiXer
Zhou, KunandLi, WenboandWang, YiandHu, TaoandJiang, NianjuanandHan, XiaoguangandLu, Jiangbo



研究问题:现有的基于NeRF的新颖视图合成方法在真实世界场景中,由于可能的不完美校准信息和场景表示不准确,从源图像恢复高质量细节仍然具有挑战性。
动机:尽管有高质量的训练帧,但NeRF模型产生的合成新视图帧仍然存在明显的渲染伪影,如噪声、模糊等。为了提高基于NeRF的方法的合成质量,我们提出了NeRFLiX,一种通用的NeRF无关的恢复器范式,通过学习一个退化驱动的视点间混合器。
方法:我们设计了一个NeRF风格的退化建模方法并构建了大规模的训练数据,使得有可能有效地去除现有深度神经网络中的那些NeRF原生渲染伪影。此外,除了退化消除,我们还提出了一个视点间聚合框架,能够融合高度相关的高质量训练图像,将最先进的NeRF模型的性能提升到全新的水平,生成高度逼真的合成图像。
效果:实验结果表明,NeRFLiX能够在各种真实世界场景中显著改善基于NeRF的方法的合成质量,同时提高了渲染图像的细节质量和视觉真实性。

Neural radiance fields(NeRF) show great success in novel-view synthesis. However, in real-world scenes, recovering high-quality details from the source images is still challenging for the existing NeRF-based approaches, due to the potential imperfect calibration information and scene representation inaccuracy. Even with high-quality training frames, the synthetic novel-view frames produced by NeRF models still suffer from notable rendering artifacts, such as noise, blur, etc. Towards to improve the synthesis quality of NeRF-based approaches, we propose NeRFLiX, a general NeRF-agnostic restorer paradigm by learning a degradation-driven inter-viewpoint mixer. Specially, we design a NeRF-style degradation modeling approach and construct large-scale training data, enabling the possibility of effectively removing those NeRF-native rendering artifacts for existing deep neural networks. Moreover, beyond the degradation removal, we propose an inter-viewpoint aggregation framework that is able to fuse highly related high-quality training images, pushing the performance of cutting-edge NeRF models to entirely new levels and producing highly photo-realistic synthetic images.

3D Human Keypoints Estimation From Point Clouds in the Wild Without Human Labels
Weng, ZhenzhenandGorban, AlexanderS.andJi, JingweiandNajibi, MahyarandZhou, YinandAnguelov, Dragomir



研究问题:如何从点云中训练一个3D人体关键点检测器,而无需大量的高质量标签。
动机:虽然获取大量的人体点云相对容易,但标注3D关键点既昂贵又主观,且易出错,特别是对于长尾情况(罕见姿势的行人、滑板车手等)。
方法:我们提出了GC-KPL——一种受几何一致性启发的关键点学习法,通过创新的无监督损失公式来学习点云中的3D人体关节位置,无需人工标注。
效果:我们在Waymo开放数据集上进行大规模训练,即使没有人工标注的关键点,也能取得与全监督方法相当的性能。此外,该方法对下游的关键点少样本学习也有帮助,只需在10%的标注训练数据上进行微调,就能达到与在整个数据集上微调相当的性能。实验证明,当在完整数据集上训练时,GC-KPL比现有技术有大幅度的提升,并能有效地利用大量未标注的数据。

Training a 3D human keypoint detector from point clouds in a supervised manner requires large volumes of high quality labels. While it is relatively easy to capture large amounts of human point clouds, annotating 3D keypoints is expensive, subjective, error prone and especially difficult for long-tail cases (pedestrians with rare poses, scooterists, etc.). In this work, we propose GC-KPL - Geometry Consistency inspired Key Point Leaning, an approach for learning 3D human joint locations from point clouds without human labels. We achieve this by our novel unsupervised loss formulations that account for the structure and movement of the human body. We show that by training on a large training set from Waymo Open Dataset without any human annotated keypoints, we are able to achieve reasonable performance as compared to the fully supervised approach. Further, the backbone benefits from the unsupervised training and is useful in downstream fewshot learning of keypoints, where fine-tuning on only 10 percent of the labeled training data gives comparable performance to fine-tuning on the entire set. We demonstrated that GC-KPL outperforms by a large margin over SoTA when trained on entire dataset and efficiently leverages large volumes of unlabeled data.

FLEX: Full-Body Grasping Without Full-Body Grasps
Tendulkar, PurvaandSur{\'\i



研究问题:如何生成与场景互动的逼真的3D人类化身,特别是在抓取日常物品时。
动机:现有的方法需要收集大量的3D人体与物体互动数据集进行训练,但存在无法适应不同物体位置、方向和场景中家具的问题,且生成的全身姿态多样性有限。
方法:本文提出了一种新的方法,利用全身姿态和手部抓取的先验知识,通过3D几何约束来合成全身抓取姿态。
效果:实验证明,这种方法可以生成各种可行的人体抓取姿态,无论是在数量还是质量上,都优于基线方法。

Synthesizing 3D human avatars interacting realistically with a scene is an important problem with applications in AR/VR, video games, and robotics. Towards this goal, we address the task of generating a virtual human -- hands and full body -- grasping everyday objects. Existing methods approach this problem by collecting a 3D dataset of humans interacting with objects and training on this data. However, 1) these methods do not generalize to different object positions and orientations or to the presence of furniture in the scene, and 2) the diversity of their generated full-body poses is very limited. In this work, we address all the above challenges to generate realistic, diverse full-body grasps in everyday scenes without requiring any 3D full-body grasping data. Our key insight is to leverage the existence of both full-body pose and hand-grasping priors, composing them using 3D geometrical constraints to obtain full-body grasps. We empirically validate that these constraints can generate a variety of feasible human grasps that are superior to baselines both quantitatively and qualitatively.

IMP: Iterative Matching and Pose Estimation With Adaptive Pooling
Xue, FeiandBudvytis, IgnasandCipolla, Roberto



研究问题:现有的特征匹配和姿态估计方法效率低下,准确性有限。
动机:我们提出了一个迭代匹配和姿态估计框架(IMP),利用两个任务之间的几何关系,通过良好的匹配进行粗略准确的姿态估计,再通过粗略准确的姿态提供几何约束来指导匹配。
方法:我们实现了一个具有变换器的几何感知循环模块,该模块联合输出稀疏的匹配和相机姿态。在每个迭代中,我们首先通过姿态一致性损失将几何信息隐式嵌入到模块中,使其能够逐步预测几何感知的匹配。其次,我们引入了有效的IMP(EIMP)来动态地丢弃没有潜在匹配的关键点,避免冗余更新并显著减少变换器中注意力计算的二次时间复杂度。
效果:在YFCC100m、Scannet和Aachen Day-Night数据集上的实验表明,所提出的方法在准确性和效率方面优于先前的方法。

Previous methods solve feature matching and pose estimation using a two-stage process by first finding matches and then estimating the pose. As they ignore the geometric relationships between the two tasks, they focus on either improving the quality of matches or filtering potential outliers, leading to limited efficiency or accuracy. In contrast, we propose an iterative matching and pose estimation framework (IMP) leveraging the geometric connections between the two tasks: a few good matches are enough for a roughly accurate pose estimation; a roughly accurate pose can be used to guide the matching by providing geometric constraints. To this end, we implement a geometry-aware recurrent module with transformers which jointly outputs sparse matches and camera poses. Specifically, for each iteration, we first implicitly embed geometric information into the module via a pose-consistency loss, allowing it to predict geometry-aware matches progressively. Second, we introduce an efficient IMP (EIMP) to dynamically discard keypoints without potential matches, avoiding redundant updating and significantly reducing the quadratic time complexity of attention computation in transformers. Experiments on YFCC100m, Scannet, and Aachen Day-Night datasets demonstrate that the proposed method outperforms previous approaches in terms of accuracy and efficiency.

Revisiting Rolling Shutter Bundle Adjustment: Toward Accurate and Fast Solution
Liao, BangyanandQu, DelinandXue, YifeiandZhang, HuiqingandLao, Yizhen



研究问题:提出一种基于滚动快门相机测量的鲁棒快速束调整解决方案,以估计相机的6自由度位姿和环境的几何形状。
动机:解决现有方法依赖额外传感器、高帧率视频输入、对相机运动、读取方向的限制性假设以及效率低下等问题。
方法:首先研究了图像点归一化对RSBA性能的影响,并展示了其在模拟真实6自由度相机运动中的更好近似。然后提出了一种新的视觉残差协方差分析模型,用于优化过程中重投影误差的标准,从而提高整体精度。更重要的是,结合归一化和协方差标准化权重的RSBA(NW-RSBA)可以避免常见的平面退化,而无需限制拍摄方式。此外,还提出了一种基于其雅可比矩阵和舒尔补的NW-RSBA加速策略。
效果:大量的合成和真实数据实验验证了所提出解决方案在现有技术中的效果和效率。同时证明该方法可以很容易地实现,并可作为著名的GSSfM和GSSLAM系统的补充RSSfM和RSSLAM解决方案。

We propose a robust and fast bundle adjustment solution that estimates the 6-DoF pose of the camera and the geometry of the environment based on measurements from a rolling shutter (RS) camera. This tackles the challenges in the existing works, namely relying on additional sensors, high frame rate video as input, restrictive assumptions on camera motion, readout direction, and poor efficiency. To this end, we first investigate the influence of normalization to the image point on RSBA performance and show its better approximation in modelling the real 6-DoF camera motion. Then we present a novel analytical model for the visual residual covariance, which can be used to standardize the reprojection error during the optimization, consequently improving the overall accuracy. More importantly, the combination of normalization and covariance standardization weighting in RSBA (NW-RSBA) can avoid common planar degeneracy without needing to constrain the filming manner. Besides, we propose an acceleration strategy for NW-RSBA based on the sparsity of its Jacobian matrix and Schur complement. The extensive synthetic and real data experiments verify the effectiveness and efficiency of the proposed solution over the state-of-the-art works. We also demonstrate the proposed method can be easily implemented and plug-in famous GSSfM and GSSLAM systems as completed RSSfM and RSSLAM solutions.

Role of Transients in Two-Bounce Non-Line-of-Sight Imaging
Somasundaram, SiddharthandDave, AkshatandHenley, ConnorandVeeraraghavan, AshokandRaskar, Ramesh



研究问题:如何利用多次散射光对摄像机视野外的被遮挡物体进行成像?
动机:最近的研究表明,通过扫描激光并测量场景中两个中继面的被遮挡物体的投影阴影,可以实现两次反弹(2B)NLOS成像。
方法:本研究探讨了在多路照明条件下,飞行时间(ToF)测量(即瞬变)在2B-NLOS中的作用。具体来说,我们研究了ToF信息如何减少形状重建所需的测量次数和空间分辨率。
效果:我们的研究结果表明,在(1)时间分辨率、(2)空间分辨率和(3)图像捕获次数方面,系统参数的优化可以带来信噪比和可恢复性的最佳平衡。这为未来NLOS成像系统的设计提供了形式化的数学约束条件,特别是在ToF传感器变得越来越普遍的情况下。

The goal of non-line-of-sight (NLOS) imaging is to image objects occluded from the camera's field of view using multiply scattered light. Recent works have demonstrated the feasibility of two-bounce (2B) NLOS imaging by scanning a laser and measuring cast shadows of occluded objects in scenes with two relay surfaces. In this work, we study the role of time-of-flight (ToF) measurements, i.e. transients, in 2B-NLOS under multiplexed illumination. Specifically, we study how ToF information can reduce the number of measurements and spatial resolution needed for shape reconstruction. We present our findings with respect to tradeoffs in (1) temporal resolution, (2) spatial resolution, and (3) number of image captures by studying SNR and recoverability as functions of system parameters. This leads to a formal definition of the mathematical constraints for 2B lidar. We believe that our work lays an analytical groundwork for design of future NLOS imaging systems, especially as ToF sensors become increasingly ubiquitous.

ObjectMatch: Robust Registration Using Canonical Object Correspondences
G\"umeli, CanandDai, AngelaandNie{\ss



研究问题:如何提高RGB-D SLAM管道中语义和对象中心的相机位姿估计器的性能。
动机:现有的相机位姿估计器依赖于帧之间的重叠区域的直接对应关系,但在重叠区域较小或没有重叠的情况下无法对齐相机帧。
方法:通过识别语义对象获取间接对应关系,例如,当一个物体在一帧中从前看到,而在另一帧中从后看到时,可以通过标准物体对应关系提供额外的位姿约束。首先提出一种神经网络在像素级别预测这种对应关系,然后将其与用联合高斯-牛顿优化求解的顶级关键点匹配相结合。
效果:在成对设置中,该方法提高了最先进的特征匹配的注册召回率,包括在10%或更少的帧间重叠对中从24%提高到45%。在注册RGB-D序列时,该方法在具有挑战性的低帧率场景中优于尖端的SLAM基线,在多个场景中实现了超过35%的轨迹误差减少。

We present ObjectMatch, a semantic and object-centric camera pose estimator for RGB-D SLAM pipelines. Modern camera pose estimators rely on direct correspondences of overlapping regions between frames; however, they cannot align camera frames with little or no overlap. In this work, we propose to leverage indirect correspondences obtained via semantic object identification. For instance, when an object is seen from the front in one frame and from the back in another frame, we can provide additional pose constraints through canonical object correspondences. We first propose a neural network to predict such correspondences on a per-pixel level, which we then combine in our energy formulation with state-of-the-art keypoint matching solved with a joint Gauss-Newton optimization. In a pairwise setting, our method improves registration recall of state-of-the-art feature matching, including from 24% to 45% in pairs with 10% or less inter-frame overlap. In registering RGB-D sequences, our method outperforms cutting-edge SLAM baselines in challenging, low-frame-rate scenarios, achieving more than 35% reduction in trajectory error in multiple scenes.

Adaptive Patch Deformation for Textureless-Resilient Multi-View Stereo
Wang, YuesongandZeng, ZhaojieandGuan, TaoandYang, WeiandChen, ZhuoandLiu, WenkaiandXu, LuoyuanandLuo, Yawei



研究问题:如何提高深度学习在大规模无纹理区域多视角立体视觉任务中的性能,同时减少内存消耗。
动机:大多数基于学习的多视角立体视觉方法需要构建成本体积并显著增加感受野以获得满意的结果,这在处理大规模无纹理区域时会导致巨大的内存消耗。
方法:将深度学习中的变形卷积思想创新地移植到传统的PatchMatch方法中。对于每个具有匹配歧义的像素(称为不可靠的像素),我们自适应地变形其上的补丁,以扩大感受野,直到覆盖足够的相关可靠像素(没有匹配歧义)作为锚点。执行PatchMatch时,受锚像素的限制,不可靠像素的匹配成本保证在正确的深度处达到全局最小值,从而显著提高了多视角立体视觉的鲁棒性。为了检测更多的锚像素以确保更好的自适应补丁变形,我们建议通过检查优化过程中估计深度的收敛来评估某个像素的匹配歧义。
效果:该方法在ETH3D和Tanks and Temples上取得了最先进的性能,同时保持了低内存消耗。

In recent years, deep learning-based approaches have shown great strength in multi-view stereo because of their outstanding ability to extract robust visual features. However, most learning-based methods need to build the cost volume and increase the receptive field enormously to get a satisfactory result when dealing with large-scale textureless regions, consequently leading to prohibitive memory consumption. To ensure both memory-friendly and textureless-resilient, we innovatively transplant the spirit of deformable convolution from deep learning into the traditional PatchMatch-based method. Specifically, for each pixel with matching ambiguity (termed unreliable pixel), we adaptively deform the patch centered on it to extend the receptive field until covering enough correlative reliable pixels (without matching ambiguity) that serve as anchors. When performing PatchMatch, constrained by the anchor pixels, the matching cost of an unreliable pixel is guaranteed to reach the global minimum at the correct depth and therefore increases the robustness of multi-view stereo significantly. To detect more anchor pixels to ensure better adaptive patch deformation, we propose to evaluate the matching ambiguity of a certain pixel by checking the convergence of the estimated depth as optimization proceeds. As a result, our method achieves state-of-the-art performance on ETH3D and Tanks and Temples while preserving low memory consumption.

SeaThru-NeRF: Neural Radiance Fields in Scattering Media
Levy, DeborahandPeleg, AmitandPearl, NaamaandRosenbaum, DanandAkkaynak, DeryaandKorman, SimonandTreibitz, Tali



研究问题:现有的神经辐射场(NeRF)模型在处理水下或雾天场景时,由于介质对物体外观的强烈影响,其表现效果不佳。
动机:基于体积渲染的NeRF框架具有考虑介质效应的内在能力,但目前尚未有模型能够妥善处理这种情况。
方法:我们开发了一种新的散射介质中的NeRF渲染模型,该模型基于SeaThru图像形成模型,并提出了学习场景信息和介质参数的合适架构。
效果:通过模拟和真实世界的场景,我们证明了该方法的有效性,能够在水下正确渲染出新颖的真实感视图。更令人兴奋的是,我们可以清晰地渲染这些场景,消除了摄像机和场景之间的介质,重建了被介质严重遮挡的远处物体的外观和深度。我们的代码和独特的数据集可以在项目的网站上找到。

Research on neural radiance fields (NeRFs) for novel view generation is exploding with new models and extensions. However, a question that remains unanswered is what happens in underwater or foggy scenes where the medium strongly influences the appearance of objects. Thus far, NeRF and its variants have ignored these cases. However, since the NeRF framework is based on volumetric rendering, it has inherent capability to account for the medium's effects, once modeled appropriately. We develop a new rendering model for NeRFs in scattering media, which is based on the SeaThru image formation model, and suggest a suitable architecture for learning both scene information and medium parameters. We demonstrate the strength of our method using simulated and real-world scenes, correctly rendering novel photorealistic views underwater. Even more excitingly, we can render clear views of these scenes, removing the medium between the camera and the scene and reconstructing the appearance and depth of far objects, which are severely occluded by the medium. Our code and unique datasets are available on the project's website.

Human Pose Estimation in Extremely Low-Light Conditions
Lee, SohyunandRim, JaesungandJeong, BoseungandKim, GeonuandWoo, ByungjuandLee, HaechanandCho, SunghyunandKwak, Suha



研究问题:本文旨在研究在极低光照条件下的人体姿态估计。
动机:由于收集具有准确标签的真实低光照图像困难,以及严重损坏的输入会显著降低预测质量,因此这个任务具有挑战性。
方法:我们开发了一个专用的相机系统,并建立了一个新的真实低光照图像数据集,其中包含准确的人体姿态标签。此外,我们还提出了一种新的模型和训练策略,充分利用这些特权信息来学习对光照条件不敏感的表示。
效果:我们的方法是在真实的极低光照图像上表现出色,广泛的分析验证了我们的模型和数据集对此成功的贡献。

We study human pose estimation in extremely low-light images. This task is challenging due to the difficulty of collecting real low-light images with accurate labels, and severely corrupted inputs that degrade prediction quality significantly. To address the first issue, we develop a dedicated camera system and build a new dataset of real low-light images with accurate pose labels. Thanks to our camera system, each low-light image in our dataset is coupled with an aligned well-lit image, which enables accurate pose labeling and is used as privileged information during training. We also propose a new model and a new training strategy that fully exploit the privileged information to learn representation insensitive to lighting conditions. Our method demonstrates outstanding performance on real extremely low-light images, and extensive analyses validate that both of our model and dataset contribute to the success.

EventNeRF: Neural Radiance Fields From a Single Colour Event Camera
Rudnev, ViktorandElgharib, MohamedandTheobalt, ChristianandGolyanik, Vladislav



研究问题:现有的基于事件的3D重建方法只能恢复稀疏的点云,这对于许多计算机视觉和图形应用来说是一个限制。
动机:为了解决这个问题,本文提出了一种使用单一颜色事件流作为输入进行3D一致、密集和照片真实感的新视图合成的方法。
方法:该方法的核心是一种完全在自监督方式下从事件中训练的神经辐射场,同时保留了彩色事件通道的原始分辨率。此外,我们的光线采样策略也是针对事件的,可以实现数据高效的训练。
效果:在测试中,我们的方法在RGB空间中产生了前所未有的高质量结果。我们在几个具有挑战性的合成场景和真实场景上进行了定性和定量评估,结果显示,我们的方法产生的渲染结果比现有方法更密集,视觉效果更好。我们还展示了在快速运动和低光照条件下的鲁棒性。

Asynchronously operating event cameras find many applications due to their high dynamic range, vanishingly low motion blur, low latency and low data bandwidth. The field saw remarkable progress during the last few years, and existing event-based 3D reconstruction approaches recover sparse point clouds of the scene. However, such sparsity is a limiting factor in many cases, especially in computer vision and graphics, that has not been addressed satisfactorily so far. Accordingly, this paper proposes the first approach for 3D-consistent, dense and photorealistic novel view synthesis using just a single colour event stream as input. At its core is a neural radiance field trained entirely in a self-supervised manner from events while preserving the original resolution of the colour event channels. Next, our ray sampling strategy is tailored to events and allows for data-efficient training. At test, our method produces results in the RGB space at unprecedented quality. We evaluate our method qualitatively and numerically on several challenging synthetic and real scenes and show that it produces significantly denser and more visually appealing renderings than the existing methods. We also demonstrate robustness in challenging scenarios with fast motion and under low lighting conditions. We release the newly recorded dataset and our source code to facilitate the research field, see https://4dqv.mpi-inf.mpg.de/EventNeRF.

Representing Volumetric Videos As Dynamic MLP Maps
Peng, SidaandYan, YunzhiandShuai, QingandBao, HujunandZhou, Xiaowei



研究问题:如何有效地实时合成动态场景的体积视频。
动机:现有的神经场景表示方法在模拟和渲染复杂静态场景方面表现出色,但将其扩展到表示动态场景并不简单,因为其渲染速度慢或存储成本高。
方法:将每帧的辐射场表示为一组浅层MLP网络,这些网络的参数存储在称为MLP图的二维网格中,并由所有帧共享的二维CNN解码器动态预测。
效果:实验表明,该方法在NHR和ZJU-MoCap数据集上实现了最先进的渲染质量,同时具有高效的实时渲染能力,在RTX 3090 GPU上对512 x 512图像的渲染速度达到41.7 fps。

This paper introduces a novel representation of volumetric videos for real-time view synthesis of dynamic scenes. Recent advances in neural scene representations demonstrate their remarkable capability to model and render complex static scenes, but extending them to represent dynamic scenes is not straightforward due to their slow rendering speed or high storage cost. To solve this problem, our key idea is to represent the radiance field of each frame as a set of shallow MLP networks whose parameters are stored in 2D grids, called MLP maps, and dynamically predicted by a 2D CNN decoder shared by all frames. Representing 3D scenes with shallow MLPs significantly improves the rendering speed, while dynamically predicting MLP parameters with a shared 2D CNN instead of explicitly storing them leads to low storage cost. Experiments show that the proposed approach achieves state-of-the-art rendering quality on the NHR and ZJU-MoCap datasets, while being efficient for real-time rendering with a speed of 41.7 fps for 512 x 512 images on an RTX 3090 GPU. The code is available at https://zju3dv.github.io/mlp_maps/.

ACR: Attention Collaboration-Based Regressor for Arbitrary Two-Hand Reconstruction
Yu, ZhengdiandHuang, ShaoliandFang, ChenandBreckon, TobyP.andWang, Jue



研究问题:从单目RGB图像重建两只手是具有挑战性的,因为经常会发生遮挡和相互混淆。
动机:现有的方法主要学习一种纠缠的表示来编码两只交互的手,这种方法对于受损的交互(如截断的手、分开的手或外部遮挡)非常脆弱。
方法:本文提出了ACR(基于注意力协作的回归器),这是首次尝试在任意场景中重建手。为了实现这一目标,ACR通过利用中心和部分关注进行特征提取,显式地减轻了手与手之间以及部分之间的相互依赖性。
效果:我们在各种类型的手重建数据集上评估我们的方法。我们的方法在InterHand2.6M数据集上显著优于最好的交互手方法,同时在FreiHand数据集上与最先进的单手方法产生相当的性能。在野外和手-物体交互数据集以及网络图像/视频上的更多定性结果进一步证明了我们的方法在任意手重建中的有效性。

Reconstructing two hands from monocular RGB images is challenging due to frequent occlusion and mutual confusion. Existing methods mainly learn an entangled representation to encode two interacting hands, which are incredibly fragile to impaired interaction, such as truncated hands, separate hands, or external occlusion. This paper presents ACR (Attention Collaboration-based Regressor), which makes the first attempt to reconstruct hands in arbitrary scenarios. To achieve this, ACR explicitly mitigates interdependencies between hands and between parts by leveraging center and part-based attention for feature extraction. However, reducing interdependence helps release the input constraint while weakening the mutual reasoning about reconstructing the interacting hands. Thus, based on center attention, ACR also learns cross-hand prior that handle the interacting hands better. We evaluate our method on various types of hand reconstruction datasets. Our method significantly outperforms the best interacting-hand approaches on the InterHand2.6M dataset while yielding comparable performance with the state-of-the-art single-hand methods on the FreiHand dataset. More qualitative results on in-the-wild and hand-object interaction datasets and web images/videos further demonstrate the effectiveness of our approach for arbitrary hand reconstruction. Our code is available at https://github.com/ZhengdiYu/Arbitrary-Hands-3D-Reconstruction

Bringing Inputs to Shared Domains for 3D Interacting Hands Recovery in the Wild
Moon, Gyeongsik



研究问题:现有的3D交互手部恢复方法主要在运动捕捉(MoCap)环境中表现良好,但在自然环境(ITW)中的表现较差。
动机:收集自然环境下的3D交互手部数据非常具有挑战性,即使对于2D数据也是如此。
方法:我们提出了InterWild,它将MoCap和ITW样本带到共享领域,以有限的ITW 2D/3D交互手部数据实现在自然环境中的鲁棒3D交互手部恢复。
效果:实验结果表明,InterWild在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Despite recent achievements, existing 3D interacting hands recovery methods have shown results mainly on motion capture (MoCap) environments, not on in-the-wild (ITW) ones. This is because collecting 3D interacting hands data in the wild is extremely challenging, even for the 2D data. We present InterWild, which brings MoCap and ITW samples to shared domains for robust 3D interacting hands recovery in the wild with a limited amount of ITW 2D/3D interacting hands data. 3D interacting hands recovery consists of two sub-problems: 1) 3D recovery of each hand and 2) 3D relative translation recovery between two hands. For the first sub-problem, we bring MoCap and ITW samples to a shared 2D scale space. Although ITW datasets provide a limited amount of 2D/3D interacting hands, they contain large-scale 2D single hand data. Motivated by this, we use a single hand image as an input for the first sub-problem regardless of whether two hands are interacting. Hence, interacting hands of MoCap datasets are brought to the 2D scale space of single hands of ITW datasets. For the second sub-problem, we bring MoCap and ITW samples to a shared appearance-invariant space. Unlike the first sub-problem, 2D labels of ITW datasets are not helpful for the second sub-problem due to the 3D translation's ambiguity. Hence, instead of relying on ITW samples, we amplify the generalizability of MoCap samples by taking only a geometric feature without an image as an input for the second sub-problem. As the geometric feature is invariant to appearances, MoCap and ITW samples do not suffer from a huge appearance gap between the two datasets. The code is available in https://github.com/facebookresearch/InterWild.

NeRDi: Single-View NeRF Synthesis With Language-Guided Diffusion As General Image Priors
Deng, CongyueandJiang, Chiyu{\textquotedblleft



研究问题:如何通过利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

2D-to-3D reconstruction is an ill-posed problem, yet humans are good at solving this problem due to their prior knowledge of the 3D world developed over years. Driven by this observation, we propose NeRDi, a single-view NeRF synthesis framework with general image priors from 2D diffusion models. Formulating single-view reconstruction as an image-conditioned 3D generation problem, we optimize the NeRF representations by minimizing a diffusion loss on its arbitrary view renderings with a pretrained image diffusion model under the input-view constraint. We leverage off-the-shelf vision-language models and introduce a two-section language guidance as conditioning inputs to the diffusion model. This is essentially helpful for improving multiview content coherence as it narrows down the general image prior conditioned on the semantic and visual features of the single-view input image. Additionally, we introduce a geometric loss based on estimated depth maps to regularize the underlying 3D geometry of the NeRF. Experimental results on the DTU MVS dataset show that our method can synthesize novel views with higher quality even compared to existing methods trained on this dataset. We also demonstrate our generalizability in zero-shot NeRF synthesis for in-the-wild images.

TRACE: 5D Temporal Regression of Avatars With Dynamic Cameras in 3D Environments
Sun, YuandBao, QianandLiu, WuandMei, TaoandBlack, MichaelJ.



研究问题:当前的方法无法可靠地估计移动中的人类在全局坐标中的姿态和形状,特别是在相机也在移动时。
动机:为了解决这些问题,我们采用了一种新的5D表示(空间、时间和身份),使场景中的人能够进行端到端推理。
方法:我们的方法名为TRACE,引入了几个新的架构组件。最重要的是,它使用两个新的“地图”来推理人在相机和世界坐标中随时间变化的3D轨迹。此外,额外的记忆单元可以在长时间的遮挡期间持续跟踪人。
效果:TRACE是第一个从动态相机中联合恢复和跟踪3D人类在全局坐标中的方法。通过端到端的训练和使用完整的图像信息,TRACE在跟踪和HPS基准测试上取得了最先进的性能。

Although the estimation of 3D human pose and shape (HPS) is rapidly progressing, current methods still cannot reliably estimate moving humans in global coordinates, which is critical for many applications. This is particularly challenging when the camera is also moving, entangling human and camera motion. To address these issues, we adopt a novel 5D representation (space, time, and identity) that enables end-to-end reasoning about people in scenes. Our method, called TRACE, introduces several novel architectural components. Most importantly, it uses two new "maps" to reason about the 3D trajectory of people over time in camera, and world, coordinates. An additional memory unit enables persistent tracking of people even during long occlusions. TRACE is the first one-stage method to jointly recover and track 3D humans in global coordinates from dynamic cameras. By training it end-to-end, and using full image information, TRACE achieves state-of-the-art performance on tracking and HPS benchmarks. The code and dataset are released for research purposes.

Neural Kernel Surface Reconstruction
Huang, JiahuiandGojcic, ZanandAtzmon, MatanandLitany, OrandFidler, SanjaandWilliams, Francis



研究问题:如何从大规模、稀疏和嘈杂的点云中重建3D隐式表面?
动机:现有的方法(神经核场)在处理大规模场景、噪声以及训练需求上存在限制。
方法:提出一种新方法,通过使用紧凑支持的核函数来扩展大型场景的处理能力,通过梯度拟合解法提高对噪声的鲁棒性,并通过最小化训练需求,使得该方法能够学习任何稠密有向点的数据集。
效果:该方法能够在几秒钟内重建数百万个点,并能以超出核心的方式处理非常大的场景。在单物体(ShapeNet, ABC)、室内场景(ScanNet, Matterport3D)和室外场景(CARLA, Waymo)的重建基准测试中取得了最先进的结果。

We present a novel method for reconstructing a 3D implicit surface from a large-scale, sparse, and noisy point cloud. Our approach builds upon the recently introduced Neural Kernel Fields (NKF) representation. It enjoys similar generalization capabilities to NKF, while simultaneously addressing its main limitations: (a) We can scale to large scenes through compactly supported kernel functions, which enable the use of memory-efficient sparse linear solvers. (b) We are robust to noise, through a gradient fitting solve. (c) We minimize training requirements, enabling us to learn from any dataset of dense oriented points, and even mix training data consisting of objects and scenes at different scales. Our method is capable of reconstructing millions of points in a few seconds, and handling very large scenes in an out-of-core fashion. We achieve state-of-the-art results on reconstruction benchmarks consisting of single objects (ShapeNet, ABC), indoor scenes (ScanNet, Matterport3D), and outdoor scenes (CARLA, Waymo).

Learning 3D-Aware Image Synthesis With Unknown Pose Distribution
Shi, ZifanandShen, YujunandXu, YinghaoandPeng, SidaandLiao, YiyiandGuo, ShengandChen, QifengandYeung, Dit-Yan



研究问题:现有的3D感知图像合成方法大多依赖于预先在训练集上估计的3D姿态分布,不准确的估计可能会误导模型学习错误的几何形状。
动机:本文提出了一种名为PoF3D的新方法,该方法可以使生成的辐射场摆脱对3D姿态先验的需求。
方法:首先,我们在生成器中配备了一个高效的姿位学习器,该学习器能够从潜在代码中推断出姿态,从而自动近似真实的基本姿态分布。然后,我们为判别器分配了一个任务,即在生成器的监督下学习姿态分布,并利用预测的姿态作为条件区分真实和合成的图像。最后,我们将无姿态的生成器和有姿态意识的判别器以对抗的方式联合训练。
效果:在几个数据集上的大量实验结果证实,无论是在图像质量还是几何质量方面,我们的方法都与最先进的技术相媲美。据我们所知,PoF3D首次证明了无需使用3D姿态先验就能学习高质量的3D感知图像合成的可行性。

Existing methods for 3D-aware image synthesis largely depend on the 3D pose distribution pre-estimated on the training set. An inaccurate estimation may mislead the model into learning faulty geometry. This work proposes PoF3D that frees generative radiance fields from the requirements of 3D pose priors. We first equip the generator with an efficient pose learner, which is able to infer a pose from a latent code, to approximate the underlying true pose distribution automatically. We then assign the discriminator a task to learn pose distribution under the supervision of the generator and to differentiate real and synthesized images with the predicted pose as the condition. The pose-free generator and the pose-aware discriminator are jointly trained in an adversarial manner. Extensive results on a couple of datasets confirm that the performance of our approach, regarding both image quality and geometry quality, is on par with state of the art. To our best knowledge, PoF3D demonstrates the feasibility of learning high-quality 3D-aware image synthesis without using 3D pose priors for the first time. Project page can be found at https://vivianszf.github.io/pof3d/.

CAPE: Camera View Position Embedding for Multi-View 3D Object Detection
Xiong, KaixinandGong, ShiandYe, XiaoqingandTan, XiaoandWan, JiandDing, ErruiandWang, JingdongandBai, Xiang



研究问题:本文旨在解决从多视角图像中检测3D物体的问题。
动机:目前的基于查询的方法依赖于全局3D位置嵌入来学习图像和3D空间之间的几何对应关系,但这种方法存在学习视图变换困难的问题。
方法:提出一种基于相机视图位置嵌入(CAPE)的新方法,通过在局部相机视图坐标系统中形成3D位置嵌入,避免了编码相机外参数的困扰。同时,通过利用前几帧的目标查询和编码自我运动,将CAPE扩展到时间建模,以提高3D物体检测的性能。
效果:实验结果表明,CAPE在无LiDAR的标准nuScenes数据集上达到了最先进的性能(61.0% NDS和52.5% mAP)。

In this paper, we address the problem of detecting 3D objects from multi-view images. Current query-based methods rely on global 3D position embeddings (PE) to learn the geometric correspondence between images and 3D space. We claim that directly interacting 2D image features with global 3D PE could increase the difficulty of learning view transformation due to the variation of camera extrinsics. Thus we propose a novel method based on CAmera view Position Embedding, called CAPE. We form the 3D position embeddings under the local camera-view coordinate system instead of the global coordinate system, such that 3D position embedding is free of encoding camera extrinsic parameters. Furthermore, we extend our CAPE to temporal modeling by exploiting the object queries of previous frames and encoding the ego motion for boosting 3D object detection. CAPE achieves the state-of-the-art performance (61.0% NDS and 52.5% mAP) among all LiDAR-free methods on standard nuScenes dataset. Codes and models are available.

FlexNeRF: Photorealistic Free-Viewpoint Rendering of Moving Humans From Sparse Views
Jayasundara, VinojandAgrawal, AmitandHeron, NicolasandShrivastava, AbhinavandDavis, LarryS.



研究问题:如何从单目视频中实现运动人物的逼真自由视点渲染。
动机:在目标展示快速/复杂运动时,稀疏视图是一个具有挑战性的场景。
方法:我们提出了一种新颖的方法,联合优化一个标准的时间和姿势配置,以及一个与姿势相关的动作场和与姿势无关的时间变形,相互补充。
效果:由于我们新颖的时间和周期性一致性约束以及在中间表示(如分割)上的额外损失,我们的方法在观察到的视图变得稀疏时提供了高质量的输出。我们在公共基准数据集以及一个自我捕捉的时尚数据集上进行实证测试,证明该方法明显优于现有技术。

We present FlexNeRF, a method for photorealistic free-viewpoint rendering of humans in motion from monocular videos. Our approach works well with sparse views, which is a challenging scenario when the subject is exhibiting fast/complex motions. We propose a novel approach which jointly optimizes a canonical time and pose configuration, with a pose-dependent motion field and pose-independent temporal deformations complementing each other. Thanks to our novel temporal and cyclic consistency constraints along with additional losses on intermediate representation such as segmentation, our approach provides high quality outputs as the observed views become sparser. We empirically demonstrate that our method significantly outperforms the state-of-the-art on public benchmark datasets as well as a self-captured fashion dataset. The project page is available at: https://flex-nerf.github.io/.

DIFu: Depth-Guided Implicit Function for Clothed Human Reconstruction
Song, Dae-YoungandLee, HeeKyungandSeo, JeongilandCho, Donghyeon



研究问题:如何利用单个图像进行穿衣人类重建的隐式函数(IF)方法。
动机:现有的大多数方法依赖于使用体积的3D嵌入分支,如SMPL模型,以补偿单个图像中信息的不足。
方法:本文提出了一种新的基于IF的方法DIFu,该方法利用包含纹理和非参数化人类3D信息的项目深度先验。具体来说,DIFu由生成器、占用预测网络和纹理预测网络组成。生成器接收人的正面RGB图像作为输入,并产生人的背面图像。然后,估计并投影到3D体积空间的前/后图像的深度图。最后,占用预测网络通过2D编码器和3D编码器分别提取像素对齐特征和体素对齐特征,并使用这些特征估计占用情况。
效果:通过与最近的基于IF的模型进行定量和定性比较,证明了DIFu的有效性。

Recently, implicit function (IF)-based methods for clothed human reconstruction using a single image have received a lot of attention. Most existing methods rely on a 3D embedding branch using volume such as the skinned multi-person linear (SMPL) model, to compensate for the lack of information in a single image. Beyond the SMPL, which provides skinned parametric human 3D information, in this paper, we propose a new IF-based method, DIFu, that utilizes a projected depth prior containing textured and non-parametric human 3D information. In particular, DIFu consists of a generator, an occupancy prediction network, and a texture prediction network. The generator takes an RGB image of the human front-side as input, and hallucinates the human back-side image. After that, depth maps for front/back images are estimated and projected into 3D volume space. Finally, the occupancy prediction network extracts a pixel-aligned feature and a voxel-aligned feature through a 2D encoder and a 3D encoder, respectively, and estimates occupancy using these features. Note that voxel-aligned features are obtained from the projected depth maps, thus it can contain detailed 3D information such as hair and cloths. Also, colors of each 3D point are also estimated with the texture inference branch. The effectiveness of DIFu is demonstrated by comparing to recent IF-based models quantitatively and qualitatively.

Towards Better Gradient Consistency for Neural Signed Distance Functions via Level Set Alignment
Ma, BaoruiandZhou, JunshengandLiu, Yu-ShenandHan, Zhizhong



研究问题:如何更准确地从点云或多视图图像中推断出神经符号距离函数(SDFs)。
动机:尽管神经符号距离函数在表示几何细节方面表现出了显著的能力,但在没有符号距离监督的情况下,使用神经网络从点云或多视图图像中推断出SDFs仍然是一个挑战。
方法:我们提出了一种水平集对齐损失来评估水平集的平行性,通过最小化这个损失可以实现更好的梯度一致性。我们的创新之处在于,我们可以通过自适应地约束查询点的梯度及其在零水平集上的投影,将所有的水平集对齐到零水平集。
效果:我们的数值和视觉比较表明,我们的损失可以在各种基准下显著提高从点云或多视图图像中推断出的SDFs的准确性。

Neural signed distance functions (SDFs) have shown remarkable capability in representing geometry with details. However, without signed distance supervision, it is still a challenge to infer SDFs from point clouds or multi-view images using neural networks. In this paper, we claim that gradient consistency in the field, indicated by the parallelism of level sets, is the key factor affecting the inference accuracy. Hence, we propose a level set alignment loss to evaluate the parallelism of level sets, which can be minimized to achieve better gradient consistency. Our novelty lies in that we can align all level sets to the zero level set by constraining gradients at queries and their projections on the zero level set in an adaptive way. Our insight is to propagate the zero level set to everywhere in the field through consistent gradients to eliminate uncertainty in the field that is caused by the discreteness of 3D point clouds or the lack of observations from multi-view images. Our proposed loss is a general term which can be used upon different methods to infer SDFs from 3D point clouds and multi-view images. Our numerical and visual comparisons demonstrate that our loss can significantly improve the accuracy of SDFs inferred from point clouds or multi-view images under various benchmarks. Code and data are available at https://github.com/mabaorui/TowardsBetterGradient.

Vid2Avatar: 3D Avatar Reconstruction From Videos in the Wild via Self-Supervised Scene Decomposition
Guo, ChenandJiang, TianjianandChen, XuandSong, JieandHilliges, Otmar



研究问题:如何从单目野外视频中学习人类化身。
动机:重建自然移动的人类的挑战性很大,需要准确地将人类与任意背景分离,并从短的视频序列中重建详细的3D表面。
方法:提出了Vid2Avatar方法,通过在场景中共同对人和背景进行建模,直接在3D中解决场景分解和表面重建的任务,无需任何地面真值监督或从大量穿衣人类扫描数据集提取的先验知识,也不依赖任何外部分割模块。
效果:评估结果显示,该方法在公开可用的数据集上优于现有技术。

We present Vid2Avatar, a method to learn human avatars from monocular in-the-wild videos. Reconstructing humans that move naturally from monocular in-the-wild videos is difficult. Solving it requires accurately separating humans from arbitrary backgrounds. Moreover, it requires reconstructing detailed 3D surface from short video sequences, making it even more challenging. Despite these challenges, our method does not require any groundtruth supervision or priors extracted from large datasets of clothed human scans, nor do we rely on any external segmentation modules. Instead, it solves the tasks of scene decomposition and surface reconstruction directly in 3D by modeling both the human and the background in the scene jointly, parameterized via two separate neural fields. Specifically, we define a temporally consistent human representation in canonical space and formulate a global optimization over the background model, the canonical human shape and texture, and per-frame human pose parameters. A coarse-to-fine sampling strategy for volume rendering and novel objectives are introduced for a clean separation of dynamic human and static background, yielding detailed and robust 3D human reconstructions. The evaluation of our method shows improvements over prior art on publicly available datasets.

PixHt-Lab: Pixel Height Based Light Effect Generation for Image Compositing
Sheng, YichenandZhang, JianmingandPhilip, JulienandHold-Geoffroy, YannickandSun, XinandZhang, HeandLing, LuandBenes, Bedrich



研究问题:如何生成更真实的合成图像中的光影效果,如阴影和反射?
动机:传统的计算机图形学使用基于物理的渲染器和3D几何来生成光影效果,但在2D图像合成中,由于缺乏几何信息,导致生成的软阴影和反射效果有限。
方法:提出PixHt-Lab系统,通过将像素高度表示映射到3D空间,重建剪切和背景几何,为图像合成渲染真实、多样的光影效果。对于具有基于物理的材料的表面,可以渲染不同光泽度的反射。为了生成更真实的软阴影,进一步提出了使用3D感知缓冲通道来指导神经渲染器的方法。
效果:实验结果表明,PixHt-Lab显著提高了软阴影的生成质量。

Lighting effects such as shadows or reflections are key in making synthetic images realistic and visually appealing. To generate such effects, traditional computer graphics uses a physically-based renderer along with 3D geometry. To compensate for the lack of geometry in 2D Image compositing, recent deep learning-based approaches introduced a pixel height representation to generate soft shadows and reflections. However, the lack of geometry limits the quality of the generated soft shadows and constrains reflections to pure specular ones. We introduce PixHt-Lab, a system leveraging an explicit mapping from pixel height representation to 3D space. Using this mapping, PixHt-Lab reconstructs both the cutout and background geometry and renders realistic, diverse, lighting effects for image compositing. Given a surface with physically-based materials, we can render reflections with varying glossiness. To generate more realistic soft shadows, we further propose to use 3D-aware buffer channels to guide a neural renderer. Both quantitative and qualitative evaluations demonstrate that PixHt-Lab significantly improves soft shadow generation.

vMAP: Vectorised Object Mapping for Neural Field SLAM
Kong, XinandLiu, ShikunandTaher, MarwanandDavison, AndrewJ.



研究问题:如何有效地进行对象级的稠密SLAM系统。
动机:现有的稠密SLAM系统需要3D先验信息,而我们的目标是在没有这些信息的情况下进行高效的建模。
方法:提出了vMAP,一个使用神经场表示的对象级稠密SLAM系统。每个对象由一个小的多层感知器(MLP)表示,无需3D先验信息即可实现高效、严密的对象建模。
效果:实验证明,与先前的神经场SLAM系统相比,vMAP在场景级和对象级重建质量上都有显著提高。

We present vMAP, an object-level dense SLAM system using neural field representations. Each object is represented by a small MLP, enabling efficient, watertight object modelling without the need for 3D priors. As an RGB-D camera browses a scene with no prior information, vMAP detects object instances on-the-fly, and dynamically adds them to its map. Specifically, thanks to the power of vectorised training, vMAP can optimise as many as 50 individual objects in a single scene, with an extremely efficient training speed of 5Hz map update. We experimentally demonstrate significantly improved scene-level and object-level reconstruction quality compared to prior neural field SLAM systems. Project page: https://kxhit.github.io/vMAP.

Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields
Chen, YueandChen, XingyuandWang, XuanandZhang, QiandGuo, YuandShan, YingandWang, Fei



研究问题:目前的神经辐射场(NeRF)模型在实现照片级新视角合成时,需要准确的相机姿态,这限制了其应用。
动机:尽管已有的联合学习神经3D表示和注册相机帧的分析-合成扩展方法存在,但如果初始化不良,它们容易产生次优解。
方法:我们提出了L2G-NeRF,一种用于束调整神经辐射场的局部到全局配准方法:首先进行像素级的灵活对齐,然后进行帧级的约束参数对齐。像素级的局部对齐通过优化光度重建误差的深度网络进行无监督学习。帧级的全局对齐通过对像素级对应关系使用可微参数估计求解器来找到全局变换。
效果:我们在合成数据和真实世界数据上的实验表明,我们的方法在高保真重建和解决大相机姿态错位方面优于当前最先进的方法。我们的模块是一个易于使用的插件,可以应用于NeRF变体和其他神经场应用。

Neural Radiance Fields (NeRF) have achieved photorealistic novel views synthesis; however, the requirement of accurate camera poses limits its application. Despite analysis-by-synthesis extensions for jointly learning neural 3D representations and registering camera frames exist, they are susceptible to suboptimal solutions if poorly initialized. We propose L2G-NeRF, a Local-to-Global registration method for bundle-adjusting Neural Radiance Fields: first, a pixel-wise flexible alignment, followed by a frame-wise constrained parametric alignment. Pixel-wise local alignment is learned in an unsupervised way via a deep network which optimizes photometric reconstruction errors. Frame-wise global alignment is performed using differentiable parameter estimation solvers on the pixel-wise correspondences to find a global transformation. Experiments on synthetic and real-world data show that our method outperforms the current state-of-the-art in terms of high-fidelity reconstruction and resolving large camera pose misalignment. Our module is an easy-to-use plugin that can be applied to NeRF variants and other neural field applications.

DC2: Dual-Camera Defocus Control by Learning To Refocus
Alzayer, HadiandAbuolaim, AbdullahandChan, LeungChunandYang, YangandLou, YingChenandHuang, Jia-BinandKar, Abhishek



研究问题:本文旨在通过软硬件的进步,使智能手机相机的多功能性和质量越来越接近专业相机。
动机:虽然现在的手机相机已经非常先进,但固定的光圈仍然是一个重要的限制,它阻止了用户控制拍摄图像的景深(DoF)。同时,许多智能手机现在都有多个具有不同固定光圈的摄像头。
方法:我们提出了DC^2系统,这是一个用于合成改变相机光圈、焦距和任意散焦效果的系统,通过融合来自这种双摄像头系统的信息来实现。
效果:我们在真实世界的数据集上进行了定量和定性的评估,结果显示我们的系统在散焦去模糊、虚化渲染和图像重聚焦方面优于最先进的技术。最后,我们还展示了由我们的方法实现的创新后捕获散焦控制,包括倾斜移位和基于内容散焦效果。

Smartphone cameras today are increasingly approaching the versatility and quality of professional cameras through a combination of hardware and software advancements. However, fixed aperture remains a key limitation, preventing users from controlling the depth of field (DoF) of captured images. At the same time, many smartphones now have multiple cameras with different fixed apertures - specifically, an ultra-wide camera with wider field of view and deeper DoF and a higher resolution primary camera with shallower DoF. In this work, we propose DC^2, a system for defocus control for synthetically varying camera aperture, focus distance and arbitrary defocus effects by fusing information from such a dual-camera system. Our key insight is to leverage real-world smartphone camera dataset by using image refocus as a proxy task for learning to control defocus. Quantitative and qualitative evaluations on real-world data demonstrate our system's efficacy where we outperform state-of-the-art on defocus deblurring, bokeh rendering, and image refocus. Finally, we demonstrate creative post-capture defocus control enabled by our method, including tilt-shift and content-based defocus effects.

Looking Through the Glass: Neural Surface Reconstruction Against High Specular Reflections
Qiu, JiaxiongandJiang, Peng-TaoandZhu, YifanandYin, Ze-XinandCheng, Ming-MingandRen, Bo



研究问题:现有的神经隐式方法在轻微高光下的3D物体表面重建质量较高,但在通过眼镜捕捉目标物体时,会出现复杂的高光反射(HSR)现象,导致重建结果不准确。
动机:为了解决这一问题,我们提出了一种新的基于隐式神经渲染的表面重建框架NeuS-HSR。
方法:在NeuS-HSR中,我们将物体表面参数化为隐式符号距离函数(SDF)。为了减少HSR的干扰,我们将渲染图像分解为目标物体和辅助平面两种外观。我们设计了一个新的辅助平面模块,结合物理假设和神经网络生成辅助平面外观。
效果:我们在合成和真实世界的数据集上进行了广泛的实验,结果表明,NeuS-HSR在对抗HSR的情况下,对目标表面的准确和鲁棒重建优于最先进的方法。

Neural implicit methods have achieved high-quality 3D object surfaces under slight specular highlights. However, high specular reflections (HSR) often appear in front of target objects when we capture them through glasses. The complex ambiguity in these scenes violates the multi-view consistency, then makes it challenging for recent methods to reconstruct target objects correctly. To remedy this issue, we present a novel surface reconstruction framework, NeuS-HSR, based on implicit neural rendering. In NeuS-HSR, the object surface is parameterized as an implicit signed distance function (SDF). To reduce the interference of HSR, we propose decomposing the rendered image into two appearances: the target object and the auxiliary plane. We design a novel auxiliary plane module by combining physical assumptions and neural networks to generate the auxiliary plane appearance. Extensive experiments on synthetic and real-world datasets demonstrate that NeuS-HSR outperforms state-of-the-art approaches for accurate and robust target surface reconstruction against HSR.

Decoupling Human and Camera Motion From Videos in the Wild
Ye, VickieandPavlakos, GeorgiosandMalik, JitendraandKanazawa, Angjoo



研究问题:如何从野外视频中重建全球人类轨迹?
动机:大多数现有方法没有模型化相机运动,依赖背景像素推断3D人体运动通常需要完整的场景重建,这在野外视频中往往是不可能的。即使现有的SLAM系统无法恢复准确的场景重建,背景像素的运动仍然提供了足够的信号来约束相机运动。
方法:我们提出了一种方法,通过优化分离相机和人体运动,将人放置在同一世界坐标框架中。我们的方法是利用相对相机估计和数据驱动的人体运动先验来解决场景尺度模糊性并恢复全局人类轨迹。
效果:我们的方法在具有挑战性的野外视频(如PoseTrack)中稳健地恢复了人们的全局3D轨迹。我们在Egobody三维人体数据集上量化了相对于现有方法的改进。我们还证明,我们恢复的相机尺度允许我们在共享坐标框架中推理多人的运动,从而提高了PoseTrack下游跟踪的性能。

We propose a method to reconstruct global human trajectories from videos in the wild. Our optimization method decouples the camera and human motion, which allows us to place people in the same world coordinate frame. Most existing methods do not model the camera motion; methods that rely on the background pixels to infer 3D human motion usually require a full scene reconstruction, which is often not possible for in-the-wild videos. However, even when existing SLAM systems cannot recover accurate scene reconstructions, the background pixel motion still provides enough signal to constrain the camera motion. We show that relative camera estimates along with data-driven human motion priors can resolve the scene scale ambiguity and recover global human trajectories. Our method robustly recovers the global 3D trajectories of people in challenging in-the-wild videos, such as PoseTrack. We quantify our improvement over existing methods on 3D human dataset Egobody. We further demonstrate that our recovered camera scale allows us to reason about motion of multiple people in a shared coordinate frame, which improves performance of downstream tracking in PoseTrack. Code and additional results can be found at https://vye16.github.io/slahmr/.

LightedDepth: Video Depth Estimation in Light of Limited Inference View Angles
Zhu, ShengjieandLiu, Xiaoming



研究问题:本文旨在解决视频深度估计问题,即从相邻的视频帧中推断出密集的场景深度。
动机:虽然最近的一些工作将视频深度估计视为简化的结构从运动(SfM)问题,但它与SfM的主要区别在于推理过程中可用的视角要少得多。这种设置适合单深度和光流估计。
方法:本文将视频深度估计分解为两个部分,一个是在流图上的归一化姿态估计,另一个是在单深度图上的记录残差深度估计。两部分通过一个高效的现成的比例对齐算法进行统一。此外,通过添加额外的投影约束和确保足够的相机平移,稳定了室内的双视图姿态估计。
效果:尽管是一个双视图算法,但通过与多视图迭代先前作品在室内和室外数据集上的性能显著提高,验证了分解的益处。代码和模型可在https://github.com/ShngJZ/LightedDepth获取。

Video depth estimation infers the dense scene depth from immediate neighboring video frames. While recent works consider it a simplified structure-from-motion (SfM) problem, it still differs from the SfM in that significantly fewer view angels are available in inference. This setting, however, suits the mono-depth and optical flow estimation. This observation motivates us to decouple the video depth estimation into two components, a normalized pose estimation over a flowmap and a logged residual depth estimation over a mono-depth map. The two parts are unified with an efficient off-the-shelf scale alignment algorithm. Additionally, we stabilize the indoor two-view pose estimation by including additional projection constraints and ensuring sufficient camera translation. Though a two-view algorithm, we validate the benefit of the decoupling with the substantial performance improvement over multi-view iterative prior works on indoor and outdoor datasets. Codes and models are available at https://github.com/ShngJZ/LightedDepth.

SparsePose: Sparse-View Camera Pose Regression and Refinement
Sinha, SamarthandZhang, JasonY.andTagliasacchi, AndreaandGilitschenski, IgorandLindell, DavidB.



研究问题:如何利用稀疏的图像集合进行准确的相机位姿估计。
动机:现有的位姿估计方法在只有少量图像的情况下往往无法准确工作,因为其依赖于在图像对之间稳健地识别和匹配视觉特征。
方法:提出Sparse-View Camera Pose Regression and Refinement(SparsePose)方法,通过训练大规模物体数据集(Co3D),学习从稀疏的宽基线图像中恢复准确的相机位姿。
效果:实验表明,SparsePose在恢复准确的相机旋转和平移方面显著优于传统的和基于学习的基线方法,并且可以仅使用5-9张图像实现高保真度的3D重建。

Camera pose estimation is a key step in standard 3D reconstruction pipelines that operates on a dense set of images of a single object or scene. However, methods for pose estimation often fail when there are only a few images available because they rely on the ability to robustly identify and match visual features between pairs of images. While these methods can work robustly with dense camera views, capturing a large set of images can be time consuming or impractical. Here, we propose Sparse-View Camera Pose Regression and Refinement (SparsePose) for recovering accurate camera poses given a sparse set of wide-baseline images (fewer than 10). The method learns to regress initial camera poses and then iteratively refine them after training on a large-scale dataset of objects (Co3D: Common Objects in 3D). SparsePose significantly outperforms conventional and learning-based baselines in recovering accurate camera rotations and translations. We also demonstrate our pipeline for high-fidelity 3D reconstruction using only 5-9 images of an object.

Flow Supervision for Deformable NeRF
Wang, ChaoyangandMacDonald, LachlanEwenandJeni, L\'aszl\'oA.andLucey, Simon



研究问题:本文旨在提出一种新的方法,直接使用光流作为监督,解决变形NeRF计算效率低下的问题。
动机:变形NeRF在计算场景流时需要对后向形变场施加流动约束,这在计算上是低效的。
方法:我们提出了一种新方法,不需要反转后向形变函数来计算帧之间的场景流。这种方法大大简化了问题,不再受限于可以解析反转的形变函数。
效果:我们在单目新视角合成和快速物体运动方面进行了实验,结果显示,该方法在无需流监督的情况下,比基线方法有显著改进。

In this paper we present a new method for deformable NeRF that can directly use optical flow as supervision. We overcome the major challenge with respect to the computationally inefficiency of enforcing the flow constraints to the backward deformation field, used by deformable NeRFs. Specifically, we show that inverting the backward deformation function is actually not needed for computing scene flows between frames. This insight dramatically simplifies the problem, as one is no longer constrained to deformation functions that can be analytically inverted. Instead, thanks to the weak assumptions required by our derivation based on the inverse function theorem, our approach can be extended to a broad class of commonly used backward deformation field. We present results on monocular novel view synthesis with rapid object motion, and demonstrate significant improvements over baselines without flow supervision.

MOVES: Manipulated Objects in Video Enable Segmentation
Higgins, RichardE.L.andFouhey, DavidF.



研究问题:如何通过操纵学习理解人们持有的物体以及手与物体的接触。
动机:传统的图像分割方法需要手动标注,过程繁琐。我们希望通过观察真实的视频数据,训练一个系统来自动理解物体的分组和手与物体的接触。
方法:我们训练了一个系统,该系统接受单张RGB图像并生成可用于回答分组问题(这两个像素是否在一起)和手关联问题(这只手是否持有那个像素)的像素嵌入。我们没有费力地标注分割掩码,而是观察了真实的视频数据。我们将极线几何与现代光流相结合,为分组生成简单而有效的伪标签。给定人物分割,我们可以进一步将像素与手关联以理解接触。
效果:我们的系统在手部和手持物体任务上取得了具有竞争力的结果。

We present a method that uses manipulation to learn to understand the objects people hold and as well as hand-object contact. We train a system that takes a single RGB image and produces a pixel-embedding that can be used to answer grouping questions (do these two pixels go together) as well as hand-association questions (is this hand holding that pixel). Rather painstakingly annotate segmentation masks, we observe people in realistic video data. We show that pairing epipolar geometry with modern optical flow produces simple and effective pseudo-labels for grouping. Given people segmentations, we can further associate pixels with hands to understand contact. Our system achieves competitive results on hand and hand-held object tasks.

ShadowNeuS: Neural SDF Reconstruction by Shadow Ray Supervision
Ling, JingwangandWang, ZhiboandXu, Feng



研究问题:如何通过监督光线和阴影射线,从单视图图像中重建场景的神经表示。
动机:目前的NeRF模型仅考虑了摄像机光线,而未考虑光源和场景之间的阴影射线,这限制了其在复杂光照条件下的性能。
方法:提出一种新的阴影射线监督方案,优化光线样本和射线位置。通过监督阴影射线,成功地从单视图图像重建了场景的神经表面距离场(SDF)。
效果:在从单视图二进制阴影或RGB图像重建形状的具有挑战性的任务上,与先前的工作相比,该方法表现出显著的改进。

By supervising camera rays between a scene and multi-view image planes, NeRF reconstructs a neural scene representation for the task of novel view synthesis. On the other hand, shadow rays between the light source and the scene have yet to be considered. Therefore, we propose a novel shadow ray supervision scheme that optimizes both the samples along the ray and the ray location. By supervising shadow rays, we successfully reconstruct a neural SDF of the scene from single-view images under multiple lighting conditions. Given single-view binary shadows, we train a neural network to reconstruct a complete scene not limited by the camera's line of sight. By further modeling the correlation between the image colors and the shadow rays, our technique can also be effectively extended to RGB inputs. We compare our method with previous works on challenging tasks of shape reconstruction from single-view binary shadow or RGB images and observe significant improvements. The code and data are available at https://github.com/gerwang/ShadowNeuS.

A Light Touch Approach to Teaching Transformers Multi-View Geometry
Bhalgat, YashandHenriques, Jo\~aoF.andZisserman, Andrew



研究问题:如何让视觉Transformer在处理多视图几何任务时,既能保持足够的灵活性,又能遵守严格的投影几何规则。
动机:现有的视觉Transformer由于缺乏人工指定的先验知识,虽然具有强大的学习能力,但在处理涉及3D形状和视点的多变性以及投影几何的精确性的任务时,可能会出现问题。
方法:提出一种"轻触"策略,通过使用极线来指导Transformer的交叉注意力图,对注意力值进行惩罚,鼓励更高的注意力沿着这些线,因为这些线上包含几何上合理的匹配。
效果:实验证明,该方法在无需测试时间相机姿态信息的情况下,优于现有方法,并在物体检索任务中取得了显著的效果。

Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propose a "light touch" approach, guiding visual Transformers to learn multiple-view geometry but allowing them to break free when needed. We achieve this by using epipolar lines to guide the Transformer's cross-attention maps, penalizing attention values outside the epipolar lines and encouraging higher attention along these lines since they contain geometrically plausible matches. Unlike previous methods, our proposal does not require any camera pose information at test-time. We focus on pose-invariant object instance retrieval, where standard Transformer networks struggle, due to the large differences in viewpoint between query and retrieved images. Experimentally, our method outperforms state-of-the-art approaches at object retrieval, without needing pose information at test-time.

NeRF-Supervised Deep Stereo
Tosi, FabioandTonioni, AlessioandDeGregorio, DanieleandPoggi, Matteo



研究问题:本文旨在提出一种无需任何真实地面训练的深度立体网络新框架。
动机:现有的自我监督方法在挑战性的Middlebury数据集上存在30-40%的性能差距,需要填补到监督模型的差距,并在零样本泛化中表现优越。
方法:利用最先进的神经渲染解决方案从单手持摄像机收集的图像序列生成立体训练数据,然后进行NeRF监督训练过程,利用渲染的立体三元组补偿遮挡和深度图作为代理标签。
效果:实验结果表明,在这种机制下训练的模型在Middlebury数据集上比现有的自我监督方法提高了30-40%,填补了监督模型的差距,并在大多数情况下实现了零样本泛化的优越性能。

We introduce a novel framework for training deep stereo networks effortlessly and without any ground-truth. By leveraging state-of-the-art neural rendering solutions, we generate stereo training data from image sequences collected with a single handheld camera. On top of them, a NeRF-supervised training procedure is carried out, from which we exploit rendered stereo triplets to compensate for occlusions and depth maps as proxy labels. This results in stereo networks capable of predicting sharp and detailed disparity maps. Experimental results show that models trained under this regime yield a 30-40% improvement over existing self-supervised methods on the challenging Middlebury dataset, filling the gap to supervised models and, most times, outperforming them at zero-shot generalization.

DualRefine: Self-Supervised Depth and Pose Estimation Through Iterative Epipolar Sampling and Refinement Toward Equilibrium
Bangunharcana, AntyantaandMagd, AhmedandKim, Kyung-Soo



研究问题:如何通过联合训练大规模文本语料库和知识图谱来提高语言表示模型的性能?
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,本文提出利用知识图谱中的有信息量的实体来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Self-supervised multi-frame depth estimation achieves high accuracy by computing matching costs of pixel correspondences between adjacent frames, injecting geometric information into the network. These pixel-correspondence candidates are computed based on the relative pose estimates between the frames. Accurate pose predictions are essential for precise matching cost computation as they influence the epipolar geometry. Furthermore, improved depth estimates can, in turn, be used to align pose estimates. Inspired by traditional structure-from-motion (SfM) principles, we propose the DualRefine model, which tightly couples depth and pose estimation through a feedback loop. Our novel update pipeline uses a deep equilibrium model framework to iteratively refine depth estimates and a hidden state of feature maps by computing local matching costs based on epipolar geometry. Importantly, we used the refined depth estimates and feature maps to compute pose updates at each step. This update in the pose estimates slowly alters the epipolar geometry during the refinement process. Experimental results on the KITTI dataset demonstrate competitive depth prediction and odometry prediction performance surpassing published self-supervised baselines. The code is available at https://github.com/antabangun/DualRefine.

Im2Hands: Learning Attentive Implicit Representation of Interacting Two-Hand Shapes
Lee, JihyunandSung, MinhyukandChoi, HonggyuandKim, Tae-Kyun



研究问题:如何有效地重建两只互动的手的神经隐式表示。
动机:现有的两手重建方法依赖于参数化手模型和/或低分辨率网格,而Im2Hands可以生成两只手的精细几何形状,具有高的手-手和手-图像一致性。
方法:Im2Hands通过两个新型的注意力模块来建模两只手的占用体积,这两个模块分别负责(1)初始占用估计和(2)上下文感知的占用细化。首先,Im2Hands在为每只手设计的正则空间中学习每只手的神经关节占用,然后使用查询-图像注意力在定位空间中细化初始的两只手的占用,以提高两只手形状之间的一致性。
效果:实验结果表明,Im2Hands在两手重建方面的效果优于相关方法,实现了最先进的结果。

We present Implicit Two Hands (Im2Hands), the first neural implicit representation of two interacting hands. Unlike existing methods on two-hand reconstruction that rely on a parametric hand model and/or low-resolution meshes, Im2Hands can produce fine-grained geometry of two hands with high hand-to-hand and hand-to-image coherency. To handle the shape complexity and interaction context between two hands, Im2Hands models the occupancy volume of two hands -- conditioned on an RGB image and coarse 3D keypoints -- by two novel attention-based modules responsible for (1) initial occupancy estimation and (2) context-aware occupancy refinement, respectively. Im2Hands first learns per-hand neural articulated occupancy in the canonical space designed for each hand using query-image attention. It then refines the initial two-hand occupancy in the posed space to enhance the coherency between the two hand shapes using query-anchor attention. In addition, we introduce an optional keypoint refinement module to enable robust two-hand shape estimation from predicted hand keypoints in a single-image reconstruction scenario. We experimentally demonstrate the effectiveness of Im2Hands on two-hand reconstruction in comparison to related methods, where ours achieves state-of-the-art results. Our code is publicly available at https://github.com/jyunlee/Im2Hands.

Long-Term Visual Localization With Mobile Sensors
Yan, ShenandLiu, YuandWang, LongandShen, ZehongandPeng, ZhenandLiu, HaominandZhang, MaojunandZhang, GuofengandZhou, Xiaowei



研究问题:在变化多端的户外环境中,由于光照、季节和结构研究问题:在变化多端的户外环境中,由于光照、季节和结构变化引起的查询图像和参考图像之间的外观差异巨大,因此基于图像的相机定位仍然是一个挑战。
动机:尽管图像匹配和姿态估计取得了显著的进步,但在变化的户外环境中,由于光照、季节和结构变化引起的查询图像和参考图像之间的外观差异巨大,因此基于图像的相机定位仍然是一个挑战。
方法:我们提出利用手机中的附加传感器(主要是GPS、罗盘和重力传感器)来解决这个具有挑战性的问题。这些移动传感器提供了适当的初始姿态和有效的约束,以减少图像匹配和最终姿态估计的搜索空间。
效果:我们收集了一个新的数据集,该数据集提供了各种移动传感器数据和显著的场景外观变化,并开发了一个系统来获取查询图像的地面真实姿态。我们的方法和几种最先进的基线进行了基准测试,证明了所提出方法的有效性。

Despite the remarkable advances in image matching and pose estimation, image-based localization of a camera in a temporally-varying outdoor environment is still a challenging problem due to huge appearance disparity between query and reference images caused by illumination, seasonal and structural changes. In this work, we propose to leverage additional sensors on a mobile phone, mainly GPS, compass, and gravity sensor, to solve this challenging problem. We show that these mobile sensors provide decent initial poses and effective constraints to reduce the searching space in image matching and final pose estimation. With the initial pose, we are also able to devise a direct 2D-3D matching network to efficiently establish 2D-3D correspondences instead of tedious 2D-2D matching in existing systems. As no public dataset exists for the studied problem, we collect a new dataset that provides a variety of mobile sensor data and significant scene appearance variations, and develop a system to acquire ground-truth poses for query images. We benchmark our method as well as several state-of-the-art baselines and demonstrate the effectiveness of the proposed approach. Our code and dataset are available on the project page: https://zju3dv.github.io/sensloc/.

Relightable Neural Human Assets From Multi-View Gradient Illuminations
Zhou, TaotaoandHe, KaiandWu, DiandXu, TengandZhang, QixuanandShao, KuixiangandChen, WenzhengandXu, LanandYu, Jingyi



研究问题:现有的人类数据集大多只提供同一光照条件下的多视角人像,对于光照变化情况下的人体建模和重照明问题帮助有限。
动机:为了推动人体建模和重照明领域的研究,本文提出了一个新的3D人体数据集UltraStage,包含2000多个高质量的人体模型,这些模型在不同视角和多种光照条件下拍摄。
方法:为每个样本提供32个环绕视图,分别用一盏白光和两种渐变光照进行照亮。除了常规的多视角图像外,渐变光照有助于恢复详细的表面法线和空间变化的材质映射,支持各种重照明应用。受神经表示最新进展的启发,我们进一步将每个样本解释为一个神经人体资产,使其能在任意光照条件下合成新的视角。
效果:我们的神经人体资产能够实现极高的捕获性能,并能表示面部皱纹和衣物褶皱等细节。在单图像重照明任务中验证UltraStage,通过使用来自神经资产的虚拟重照明数据训练神经网络,实现了比现有技术更真实的渲染效果。UltraStage将公开给社区,以刺激人体建模和渲染等各种任务的未来发展。

Human modeling and relighting are two fundamental problems in computer vision and graphics, where high-quality datasets can largely facilitate related research. However, most existing human datasets only provide multi-view human images captured under the same illumination. Although valuable for modeling tasks, they are not readily used in relighting problems. To promote research in both fields, in this paper, we present UltraStage, a new 3D human dataset that contains more than 2,000 high-quality human assets captured under both multi-view and multi-illumination settings. Specifically, for each example, we provide 32 surrounding views illuminated with one white light and two gradient illuminations. In addition to regular multi-view images, gradient illuminations help recover detailed surface normal and spatially-varying material maps, enabling various relighting applications. Inspired by recent advances in neural representation, we further interpret each example into a neural human asset which allows novel view synthesis under arbitrary lighting conditions. We show our neural human assets can achieve extremely high capture performance and are capable of representing fine details such as facial wrinkles and cloth folds. We also validate UltraStage in single image relighting tasks, training neural networks with virtual relighted data from neural assets and demonstrating realistic rendering improvements over prior arts. UltraStage will be publicly available to the community to stimulate significant future developments in various human modeling and rendering tasks. The dataset is available at https://miaoing.github.io/RNHA.

DyLiN: Making Light Field Networks Dynamic
Yu, HengandJulin, JoelandMilacski, Zolt\'an\'A.andNiinuma, KoichiroandJeni, L\'aszl\'oA.



研究问题:如何提高光场网络在处理动态非刚性变形场景时的性能和效率?
动机:目前的光场网络虽然能高效地从二维观察中表示三维结构,但仅限于整体静态场景,无法处理动态非刚性变形。
方法:提出动态光场网络(DyLiN)方法,通过学习输入光线到标准光线的形变场并提升到高维空间来处理不连续性。进一步引入可控属性输入的CoDyLiN。通过知识蒸馏从预训练动态辐射场进行训练。
效果:在包含各种非刚性变形的合成和真实世界数据集上评估,DyLiN在视觉保真度上优于并匹配了最先进的方法,同时计算速度快25-71倍。CoDyLiN在属性注释数据上也超越了其教师模型。

Light Field Networks, the re-formulations of radiance fields to oriented rays, are magnitudes faster than their coordinate network counterparts, and provide higher fidelity with respect to representing 3D structures from 2D observations. They would be well suited for generic scene representation and manipulation, but suffer from one problem: they are limited to holistic and static scenes. In this paper, we propose the Dynamic Light Field Network (DyLiN) method that can handle non-rigid deformations, including topological changes. We learn a deformation field from input rays to canonical rays, and lift them into a higher dimensional space to handle discontinuities. We further introduce CoDyLiN, which augments DyLiN with controllable attribute inputs. We train both models via knowledge distillation from pretrained dynamic radiance fields. We evaluated DyLiN using both synthetic and real world datasets that include various non-rigid deformations. DyLiN qualitatively outperformed and quantitatively matched state-of-the-art methods in terms of visual fidelity, while being 25 - 71x computationally faster. We also tested CoDyLiN on attribute annotated data and it surpassed its teacher model. Project page: https://dylin2023.github.io.

Neuralangelo: High-Fidelity Neural Surface Reconstruction
Li, ZhaoshuoandM\"uller, ThomasandEvans, AlexandTaylor, RussellH.andUnberath, MathiasandLiu, Ming-YuandLin, Chen-Hsuan



研究问题:如何通过图像神经网络渲染恢复密集的3D表面。
动机:当前的方法在恢复真实世界场景的详细结构上存在困难。
方法:提出了Neuralangelo,该方法将多分辨率3D哈希网格的表示能力与神经表面渲染相结合,通过数值梯度计算高阶导数作为平滑操作,并在哈希网格上进行从粗到精的优化以控制不同级别的细节。
效果:即使没有深度等辅助输入,Neuralangelo也能有效地从多视图图像中恢复密集的3D表面结构,其保真度显著超过以前的方法,能够从RGB视频捕捉中进行详细的大规模场景重建。

Neural surface reconstruction has been shown to be powerful for recovering dense 3D surfaces via image-based neural rendering. However, current methods struggle to recover detailed structures of real-world scenes. To address the issue, we present Neuralangelo, which combines the representation power of multi-resolution 3D hash grids with neural surface rendering. Two key ingredients enable our approach: (1) numerical gradients for computing higher-order derivatives as a smoothing operation and (2) coarse-to-fine optimization on the hash grids controlling different levels of details. Even without auxiliary inputs such as depth, Neuralangelo can effectively recover dense 3D surface structures from multi-view images with fidelity significantly surpassing previous methods, enabling detailed large-scale scene reconstruction from RGB video captures.

Neural Vector Fields: Implicit Representation by Explicit Learning
Yang, XianghuiandLin, GuoshengandChen, ZhenghaoandZhou, Luping



研究问题:如何有效地进行3D表面重建?
动机:现有的方法在分辨率和拓扑性上存在限制,需要一种新方法来提高重建效果。
方法:提出一种新的3D表示方法——神经矢量场(NVF),结合显式学习和隐式函数的强大表示能力,直接操作网格并打破分辨率和拓扑性的障碍。
效果:实验结果表明,该方法在各种评估场景中优于现有方法,包括水密与非水密形状、特定类别与非特定类别重建、未见过的类别重建以及跨领域重建。

Deep neural networks (DNNs) are widely applied for nowadays 3D surface reconstruction tasks and such methods can be further divided into two categories, which respectively warp templates explicitly by moving vertices or represent 3D surfaces implicitly as signed or unsigned distance functions. Taking advantage of both advanced explicit learning process and powerful representation ability of implicit functions, we propose a novel 3D representation method, Neural Vector Fields (NVF). It not only adopts the explicit learning process to manipulate meshes directly, but also leverages the implicit representation of unsigned distance functions (UDFs) to break the barriers in resolution and topology. Specifically, our method first predicts the displacements from queries towards the surface and models the shapes as Vector Fields. Rather than relying on network differentiation to obtain direction fields as most existing UDF-based methods, the produced vector fields encode the distance and direction fields both and mitigate the ambiguity at "ridge" points, such that the calculation of direction fields is straightforward and differentiation-free. The differentiation-free characteristic enables us to further learn a shape codebook via Vector Quantization, which encodes the cross-object priors, accelerates the training procedure, and boosts model generalization on cross-category reconstruction. The extensive experiments on surface reconstruction benchmarks indicate that our method outperforms those state-of-the-art methods in different evaluation scenarios including watertight vs non-watertight shapes, category-specific vs category-agnostic reconstruction, category-unseen reconstruction, and cross-domain reconstruction. Our code is released at https://github.com/Wi-sc/NVF.

Overcoming the Trade-Off Between Accuracy and Plausibility in 3D Hand Shape Reconstruction
Yu, ZiweiandLi, ChenandYang, LinlinandZheng, XiaoxuandMi, MichaelBiandLee, GimHeeandYao, Angela



研究问题:如何准确重建3D手部形状,同时保证重建结果的合理性。
动机:直接网格拟合方法可以高度准确地重建3D手部形状,但容易产生伪影,结果不够合理;而参数化模型如MANO能保证手部形状的合理性,但准确性不如非参数化方法。
方法:本文提出了一种新的弱监督手部形状估计框架,将非参数化网格拟合与MANO模型进行端到端集成。
效果:该联合模型克服了准确性和合理性之间的权衡,特别是在具有挑战性的双手和手-物体交互场景中,能够生成对齐良好、高质量的3D网格。

Direct mesh fitting for 3D hand shape reconstruction estimates highly accurate meshes. However, the resulting meshes are prone to artifacts and do not appear as plausible hand shapes. Conversely, parametric models like MANO ensure plausible hand shapes but are not as accurate as the non-parametric methods. In this work, we introduce a novel weakly-supervised hand shape estimation framework that integrates non-parametric mesh fitting with MANO models in an end-to-end fashion. Our joint model overcomes the tradeoff in accuracy and plausibility to yield well-aligned and high-quality 3D meshes, especially in challenging two-hand and hand-object interaction scenarios.

EditableNeRF: Editing Topologically Varying Neural Radiance Fields by Key Points
Zheng, ChengweiandLin, WenbinandXu, Feng



研究问题:如何编辑由NeRF-based方法建模的动态场景。
动机:现有的NeRF模型虽然能实现高度逼真的新视角合成,但对于动态场景的编辑却是一个挑战。
方法:提出可编辑的神经辐射场,通过图像序列自动训练网络,使用表面关键点来模型化拓扑变化。用户可以通过拖动关键点到新位置来编辑场景。
效果:实验证明,该方法在各种动态场景上都能实现高质量的编辑,并优于现有技术。

Neural radiance fields (NeRF) achieve highly photo-realistic novel-view synthesis, but it's a challenging problem to edit the scenes modeled by NeRF-based methods, especially for dynamic scenes. We propose editable neural radiance fields that enable end-users to easily edit dynamic scenes and even support topological changes. Input with an image sequence from a single camera, our network is trained fully automatically and models topologically varying dynamics using our picked-out surface key points. Then end-users can edit the scene by easily dragging the key points to desired new positions. To achieve this, we propose a scene analysis method to detect and initialize key points by considering the dynamics in the scene, and a weighted key points strategy to model topologically varying dynamics by joint key points and weights optimization. Our method supports intuitive multi-dimensional (up to 3D) editing and can generate novel scenes that are unseen in the input sequence. Experiments demonstrate that our method achieves high-quality editing on various dynamic scenes and outperforms the state-of-the-art. Our code and captured data are available at https://chengwei-zheng.github.io/EditableNeRF/.

NeuralEditor: Editing Neural Radiance Fields via Manipulating Point Clouds
Chen, Jun-KunandLyu, JipengandWang, Yu-Xiong



研究问题:如何对神经辐射场(NeRFs)进行形状编辑。
动机:尽管NeRFs在新颖视图合成方面取得了令人印象深刻的结果,但对其场景形状的编辑仍是一个基本挑战。
方法:本文提出了一种名为NeuralEditor的方法,该方法利用显式点云表示作为构建NeRFs的基础结构,并引入了一种基于K-D树引导的密度自适应体素的新型渲染方案,通过优化产生高质量的渲染结果和精确的点云。然后,NeuralEditor通过映射相关点之间的点云来进行形状编辑。
效果:广泛的评估表明,NeuralEditor在形状变形和场景变形任务中都达到了最先进的性能。值得注意的是,NeuralEditor支持零样本推理以及进一步对编辑后的场景进行微调。

This paper proposes NeuralEditor that enables neural radiance fields (NeRFs) natively editable for general shape editing tasks. Despite their impressive results on novel-view synthesis, it remains a fundamental challenge for NeRFs to edit the shape of the scene. Our key insight is to exploit the explicit point cloud representation as the underlying structure to construct NeRFs, inspired by the intuitive interpretation of NeRF rendering as a process that projects or "plots" the associated 3D point cloud to a 2D image plane. To this end, NeuralEditor introduces a novel rendering scheme based on deterministic integration within K-D tree-guided density-adaptive voxels, which produces both high-quality rendering results and precise point clouds through optimization. NeuralEditor then performs shape editing via mapping associated points between point clouds. Extensive evaluation shows that NeuralEditor achieves state-of-the-art performance in both shape deformation and scene morphing tasks. Notably, NeuralEditor supports both zero-shot inference and further fine-tuning over the edited scene. Our code, benchmark, and demo video are available at https://immortalco.github.io/NeuralEditor.

NIKI: Neural Inverse Kinematics With Invertible Neural Networks for 3D Human Pose and Shape Estimation
Li, JiefengandBian, SiyuanandLiu, QiandTang, JiashengandWang, FanandLu, Cewu



研究问题:目前的顶级3D人体姿态和形状估计方法要么对遮挡具有鲁棒性,要么在非遮挡情况下获得像素对齐的准确性,但无法同时实现两者。
动机:开发一种新方法,既能对遮挡有鲁棒性,又能获得像素对齐的精度。
方法:提出NIKI(带有可逆神经网络的神经逆向运动学),通过双向误差建模来提高对遮挡的鲁棒性和获取像素对齐的精度。NIKI能从正向和逆向过程中学习,并使用可逆网络进行模型构建。
效果:实验证明,NIKI在标准和特定遮挡基准测试上都表现出了有效性,实现了鲁棒性和良好对齐结果的同时存在。

With the progress of 3D human pose and shape estimation, state-of-the-art methods can either be robust to occlusions or obtain pixel-aligned accuracy in non-occlusion cases. However, they cannot obtain robustness and mesh-image alignment at the same time. In this work, we present NIKI (Neural Inverse Kinematics with Invertible Neural Network), which models bi-directional errors to improve the robustness to occlusions and obtain pixel-aligned accuracy. NIKI can learn from both the forward and inverse processes with invertible networks. In the inverse process, the model separates the error from the plausible 3D pose manifold for a robust 3D human pose estimation. In the forward process, we enforce the zero-error boundary conditions to improve the sensitivity to reliable joint positions for better mesh-image alignment. Furthermore, NIKI emulates the analytical inverse kinematics algorithms with the twist-and-swing decomposition for better interpretability. Experiments on standard and occlusion-specific benchmarks demonstrate the effectiveness of NIKI, where we exhibit robust and well-aligned results simultaneously. Code is available at https://github.com/Jeff-sjtu/NIKI

Transfer4D: A Framework for Frugal Motion Capture and Deformation Transfer
Maheshwari, ShubhandNarain, RahulandHebbalaguppe, Ramya



研究问题:如何利用低成本深度传感器和自动化流程,将真实演员的表演转化为虚拟角色动画。
动机:现有的动作捕捉技术需要昂贵的设备和专家操作,限制了其普及性。我们的目标是开发一种名为“Transfer4D”的替代方案,降低制作成本并简化操作流程。
方法:我们使用单目深度序列提取骨骼作为动作捕捉和转换的中间表示,结合额外的几何信息进行运动重建和转换。通过非刚性重建从深度序列中跟踪运动,然后使用皮肤分解对源对象进行绑定。最后,将骨架嵌入目标对象进行运动重定向。
效果:实验结果表明,我们的Transfer4D方法在运动重建和转换方面优于现有方法,实现了以低成本深度传感器为基础的虚拟角色动画制作。

Animating a virtual character based on a real performance of an actor is a challenging task that currently requires expensive motion capture setups and additional effort by expert animators, rendering it accessible only to large production houses. The goal of our work is to democratize this task by developing a frugal alternative termed "Transfer4D" that uses only commodity depth sensors and further reduces animators' effort by automating the rigging and animation transfer process. To handle sparse, incomplete videos from depth video inputs and large variations between source and target objects, we propose to use skeletons as an intermediary representation between motion capture and transfer. We propose a novel skeleton extraction pipeline from single-view depth sequence that incorporates additional geometric information, resulting in superior performance in motion reconstruction and transfer in comparison to the contemporary methods. We use non-rigid reconstruction to track motion from the depth sequence, and then we rig the source object using skinning decomposition. Finally, the rig is embedded into the target object for motion retargeting.

Handy: Towards a High Fidelity 3D Hand Shape and Appearance Model
Potamias, RolandosAlexandrosandPloumpis, StylianosandMoschoglou, StylianosandTriantafyllou, VasileiosandZafeiriou, Stefanos



研究问题:如何有效地重建和估计人类手部的形状、姿态和外观?
动机:当前最先进的手部重建和姿态估计方法主要依赖于低多边形的MANO模型,但该模型存在表达力有限、训练样本数量不足以及对手部外观关注不足等问题。
方法:我们提出了"Handy"模型,这是一个大规模的人类手部模型,包含超过1200个样本,覆盖了不同的年龄、性别和种族。同时,我们还训练了一个强大的生成对抗网络来生成高分辨率的手部纹理。
效果:实验证明,我们的模型在形状、姿态和纹理重建方面都优于现有的最先进技术,即使在恶劣的“野外”条件下也能准确地重建出分布外样本。

Over the last few years, with the advent of virtual and augmented reality, an enormous amount of research has been focused on modeling, tracking and reconstructing human hands. Given their power to express human behavior, hands have been a very important, but challenging component of the human body. Currently, most of the state-of-the-art reconstruction and pose estimation methods rely on the low polygon MANO model. Apart from its low polygon count, MANO model was trained with only 31 adult subjects, which not only limits its expressive power but also imposes unnecessary shape reconstruction constraints on pose estimation methods. Moreover, hand appearance remains almost unexplored and neglected from the majority of hand reconstruction methods. In this work, we propose "Handy", a large-scale model of the human hand, modeling both shape and appearance composed of over 1200 subjects which we make publicly available for the benefit of the research community. In contrast to current models, our proposed hand model was trained on a dataset with large diversity in age, gender, and ethnicity, which tackles the limitations of MANO and accurately reconstructs out-of-distribution samples. In order to create a high quality texture model, we trained a powerful GAN, which preserves high frequency details and is able to generate high resolution hand textures. To showcase the capabilities of the proposed model, we built a synthetic dataset of textured hands and trained a hand pose estimation network to reconstruct both the shape and appearance from single images. As it is demonstrated in an extensive series of quantitative as well as qualitative experiments, our model proves to be robust against the state-of-the-art and realistically captures the 3D hand shape and pose along with a high frequency detailed texture even in adverse "in-the-wild" conditions.

Semi-Supervised Hand Appearance Recovery via Structure Disentanglement and Dual Adversarial Discrimination
Zhao, ZimengandZuo, BinghuiandLong, ZhiyuandWang, Yangang



研究问题:如何从标记式运动捕捉(MoCap)收集的大量手部图像中恢复可靠的外观。
动机:由于标记引起的退化限制了这些图像在手部外观重建中的应用。
方法:首先,通过使用半监督学习范式,将退化的手部结构与裸露的手部结构分离;然后,利用双重对抗性鉴别(DAD)方案将外观包裹到该结构上。
效果:实验证明,该方法可以稳健地从包含不同标记甚至被物体遮挡的数据集恢复照片级真实的手部外观,为其他下游学习问题提供了一种新的获取裸露手部外观数据的路径。

Enormous hand images with reliable annotations are collected through marker-based MoCap. Unfortunately, degradations caused by markers limit their application in hand appearance reconstruction. A clear appearance recovery insight is an image-to-image translation trained with unpaired data. However, most frameworks fail because there exists structure inconsistency from a degraded hand to a bare one. The core of our approach is to first disentangle the bare hand structure from those degraded images and then wrap the appearance to this structure with a dual adversarial discrimination (DAD) scheme. Both modules take full advantage of the semi-supervised learning paradigm: The structure disentanglement benefits from the modeling ability of ViT, and the translator is enhanced by the dual discrimination on both translation processes and translation results. Comprehensive evaluations have been conducted to prove that our framework can robustly recover photo-realistic hand appearance from diverse marker-contained and even object-occluded datasets. It provides a novel avenue to acquire bare hand appearance data for other downstream learning problems.

Markerless Camera-to-Robot Pose Estimation via Self-Supervised Sim-to-Real Transfer
Lu, JingpeiandRichter, FlorianandYip, MichaelC.



研究问题:解决基于视觉的机器人控制中相机到机器人位姿的问题,需要准确且细致的操作。
动机:传统的解决方法需要通过标记器修改机器人,而深度学习方法可以实现无标记的特征提取。主流的深度学习方法仅使用合成数据并依赖领域随机化来填补模拟与现实的鸿沟,因为获取3D注释是劳动密集型的。
方法:我们提出了一种端到端的位姿估计框架,能够进行在线相机到机器人的校准,并提出了一种自我监督的训练方法,以扩展到未标记的真实世界数据。我们的框架结合了深度学习和几何视觉来解决机器人位姿问题,并且整个流程是完全可微分的。为了训练相机到机器人位姿估计网络(CtRNet),我们利用前景分割和可微分渲染进行图像级别的自我监督。通过渲染器可视化位姿预测,并将图像损失与输入图像进行反向传播以训练神经网络。
效果:我们在两个公共真实数据集上的实验结果证实了我们的方法优于现有工作。我们还将我们的框架集成到一个视觉伺服系统中,展示了实时精确机器人位姿估计在自动化任务中的潜力。

Solving the camera-to-robot pose is a fundamental requirement for vision-based robot control, and is a process that takes considerable effort and cares to make accurate. Traditional approaches require modification of the robot via markers, and subsequent deep learning approaches enabled markerless feature extraction. Mainstream deep learning methods only use synthetic data and rely on Domain Randomization to fill the sim-to-real gap, because acquiring the 3D annotation is labor-intensive. In this work, we go beyond the limitation of 3D annotations for real-world data. We propose an end-to-end pose estimation framework that is capable of online camera-to-robot calibration and a self-supervised training method to scale the training to unlabeled real-world data. Our framework combines deep learning and geometric vision for solving the robot pose, and the pipeline is fully differentiable. To train the Camera-to-Robot Pose Estimation Network (CtRNet), we leverage foreground segmentation and differentiable rendering for image-level self-supervision. The pose prediction is visualized through a renderer and the image loss with the input image is back-propagated to train the neural network. Our experimental results on two public real datasets confirm the effectiveness of our approach over existing works. We also integrate our framework into a visual servoing system to demonstrate the promise of real-time precise robot pose estimation for automation tasks.

CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects
Heppert, NickandIrshad, MuhammadZubairandZakharov, SergeyandLiu, KatherineandAmbrus, RaresAndreiandBohg, JeannetteandValada, AbhinavandKollar, Thomas



研究问题:如何从单目立体RGB观测中重建多个铰接对象。
动机:目前的重建方法需要为每个类别单独训练解码器,效率低下。
方法:提出CARTO方法,使用隐式的对象中心表示和学习的单一几何和关节解码器来重建多个对象类别。
效果:在与分别对每个类别进行训练的专用解码器相比,CARTO方法在重建精度上具有可比性。结合立体图像编码器,可以在单个前向传递中推断出多个未知对象的3D形状、6D位姿、大小、关节类型和关节状态。该方法在新的实例上实现了20.4%的mAP 3D IOU50绝对改进,且推理速度快,可在NVIDIA TITAN XP GPU上以1 HZ的速度运行八个或更少的对象。虽然仅在模拟数据上进行训练,但CARTO可以转移到真实世界的物体实例。

We present CARTO, a novel approach for reconstructing multiple articulated objects from a single stereo RGB observation. We use implicit object-centric representations and learn a single geometry and articulation decoder for multiple object categories. Despite training on multiple categories, our decoder achieves a comparable reconstruction accuracy to methods that train bespoke decoders separately for each category. Combined with our stereo image encoder we infer the 3D shape, 6D pose, size, joint type, and the joint state of multiple unknown objects in a single forward pass. Our method achieves a 20.4% absolute improvement in mAP 3D IOU50 for novel instances when compared to a two-stage pipeline. Inference time is fast and can run on a NVIDIA TITAN XP GPU at 1 HZ for eight or less objects present. While only trained on simulated data, CARTO transfers to real-world object instances. Code and evaluation data is available at: http://carto.cs.uni-freiburg.de

RGBD2: Generative Scene Synthesis via Incremental View Inpainting Using RGBD Diffusion Models
Lei, JiabaoandTang, JiapengandJia, Kui



研究问题:如何从稀疏的RGBD视图中恢复底层场景几何和颜色。
动机:现有的方法在处理多视图不一致性问题上存在困难。
方法:提出一种新方法RGBD2,通过生成新的RGBD视图并融合,直接得到场景几何。同时,使用中间表面网格和相机投影解决多视图不一致性问题。
效果:在ScanNet数据集上进行的实验表明,该方法在3D场景合成任务上优于现有方法。

We address the challenge of recovering an underlying scene geometry and colors from a sparse set of RGBD view observations. In this work, we present a new solution termed RGBD2 that sequentially generates novel RGBD views along a camera trajectory, and the scene geometry is simply the fusion result of these views. More specifically, we maintain an intermediate surface mesh used for rendering new RGBD views, which subsequently becomes complete by an inpainting network; each rendered RGBD view is later back-projected as a partial surface and is supplemented into the intermediate mesh. The use of intermediate mesh and camera projection helps solve the tough problem of multi-view inconsistency. We practically implement the RGBD inpainting network as a versatile RGBD diffusion model, which is previously used for 2D generative modeling; we make a modification to its reverse diffusion process to enable our use. We evaluate our approach on the task of 3D scene synthesis from sparse RGBD inputs; extensive experiments on the ScanNet dataset demonstrate the superiority of our approach over existing ones. Project page: https://jblei.site/proj/rgbd-diffusion.

Deep Polarization Reconstruction With PDAVIS Events
Mei, HaiyangandWang, ZuowenandYang, XinandWei, XiaopengandDelbruck, Tobi



研究问题:如何从输入的原始极化事件中直接输出极化信息,以提高极化重建性能。
动机:目前的极化重建方法忽视了通道间的相关性,导致四个重建帧之间存在内容不一致,影响了极化重建的性能。
方法:构建了首个大规模的事件到极化数据集,用于训练事件到极化网络E2P。E2P通过跨模态上下文整合提取输入极化事件中的丰富极化模式并增强特征。
效果:实验结果表明,E2P在没有额外计算成本的情况下,显著优于Polarization FireNet。在快速和高动态范围的场景中,E2P产生的极化测量也比PDAVIS帧更准确。

The polarization event camera PDAVIS is a novel bio-inspired neuromorphic vision sensor that reports both conventional polarization frames and asynchronous, continuously per-pixel polarization brightness changes (polarization events) with fast temporal resolution and large dynamic range. A deep neural network method (Polarization FireNet) was previously developed to reconstruct the polarization angle and degree from polarization events for bridging the gap between the polarization event camera and mainstream computer vision. However, Polarization FireNet applies a network pre-trained for normal event-based frame reconstruction independently on each of four channels of polarization events from four linear polarization angles, which ignores the correlations between channels and inevitably introduces content inconsistency between the four reconstructed frames, resulting in unsatisfactory polarization reconstruction performance. In this work, we strive to train an effective, yet efficient, DNN model that directly outputs polarization from the input raw polarization events. To this end, we constructed the first large-scale event-to-polarization dataset, which we subsequently employed to train our events-to-polarization network E2P. E2P extracts rich polarization patterns from input polarization events and enhances features through cross-modality context integration. We demonstrate that E2P outperforms Polarization FireNet by a significant margin with no additional computing cost. Experimental results also show that E2P produces more accurate measurement of polarization than the PDAVIS frames in challenging fast and high dynamic range scenes.

Semidefinite Relaxations for Robust Multiview Triangulation
H\"arenstam-Nielsen, LinusandZeller, NiclasandCremers, Daniel



研究问题:提出一种基于凸松弛的方法,用于确定性最优鲁棒多视角三角化。
动机:为了解决非鲁棒多视角三角化问题,通过引入最小二乘成本函数扩展现有的松弛方法。
方法:提出了两种公式,一种基于极线约束,另一种基于分数重投影约束。第一种公式维度较低,在适度的噪声和异常值水平下仍然紧密;第二种公式维度较高,因此速度较慢,但在极端噪声和异常值水平下仍然紧密。
效果:通过大量实验证明,所提出的方法即使在显著的噪声和大部分异常值存在的情况下,也能计算出确定的最优重建。

We propose an approach based on convex relaxations for certifiably optimal robust multiview triangulation. To this end, we extend existing relaxation approaches to non-robust multiview triangulation by incorporating a least squares cost function. We propose two formulations, one based on epipolar constraints and one based on fractional reprojection constraints. The first is lower dimensional and remains tight under moderate noise and outlier levels, while the second is higher dimensional and therefore slower but remains tight even under extreme noise and outlier levels. We demonstrate through extensive experiments that the proposed approaches allow us to compute provably optimal reconstructions even under significant noise and a large percentage of outliers.

ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-Real Novel View Synthesis via Contrastive Learning
Yang, HaoandHong, LanqingandLi, AoxueandHu, TianyangandLi, ZhenguoandLee, GimHeeandWang, Liwei



研究问题:尽管许多近期的研究已经探索了基于NeRF的未见场景的新型视图合成,但他们很少考虑在许多实际应用中所需的合成到真实的泛化。
动机:本研究首先探讨了合成数据在合成到真实新型视图合成中的影响,并惊讶地发现,使用合成数据训练的模型倾向于产生更锐利但不太准确的体积密度。
方法:为了保持使用合成数据的优点同时避免其负面影响,我们提出了引入几何感知对比学习来学习具有几何约束的多视图一致特征。同时,我们采用跨视图注意力进一步通过查询输入视图的特征增强特征的几何感知。
效果:实验表明,在合成到真实的设置下,我们的方法可以渲染出质量更高、细节更好的图像,并在PSNR、SSIM和LPIPS方面优于现有的可泛化新型视图合成方法。当我们在真实数据上进行训练时,我们的方法也取得了最先进的结果。

Although many recent works have investigated generalizable NeRF-based novel view synthesis for unseen scenes, they seldom consider the synthetic-to-real generalization, which is desired in many practical applications. In this work, we first investigate the effects of synthetic data in synthetic-to-real novel view synthesis and surprisingly observe that models trained with synthetic data tend to produce sharper but less accurate volume densities. For pixels where the volume densities are correct, fine-grained details will be obtained. Otherwise, severe artifacts will be produced. To maintain the advantages of using synthetic data while avoiding its negative effects, we propose to introduce geometry-aware contrastive learning to learn multi-view consistent features with geometric constraints. Meanwhile, we adopt cross-view attention to further enhance the geometry perception of features by querying features across input views. Experiments demonstrate that under the synthetic-to-real setting, our method can render images with higher quality and better fine-grained details, outperforming existing generalizable novel view synthesis methods in terms of PSNR, SSIM, and LPIPS. When trained on real data, our method also achieves state-of-the-art results. https://haoy945.github.io/contranerf/

PaletteNeRF: Palette-Based Appearance Editing of Neural Radiance Fields
Kuang, ZhengfeiandLuan, FujunandBi, SaiandShu, ZhixinandWetzstein, GordonandSunkavalli, Kalyan



研究问题:如何有效地编辑神经辐射场的外观,同时保持照片的真实性。
动机:尽管神经辐射场能够实现复杂场景的高保真度3D重建和新颖视图合成,但其外观的编辑效率和照片真实性仍有待探索。
方法:提出了一种基于3D颜色分解的照片真实感外观编辑的新方法——PaletteNeRF。该方法将每个3D点的外观分解为场景共享的调色板基(即由一组NeRF类型函数定义的3D分割)的线性组合。
效果:通过修改颜色调色板,用户能有效地编辑3D场景的外观。在定量和定性上都优于基线方法,特别是在编辑复杂真实世界场景的外观上。

Recent advances in neural radiance fields have enabled the high-fidelity 3D reconstruction of complex scenes for novel view synthesis. However, it remains underexplored how the appearance of such representations can be efficiently edited while maintaining photorealism. In this work, we present PaletteNeRF, a novel method for photorealistic appearance editing of neural radiance fields (NeRF) based on 3D color decomposition. Our method decomposes the appearance of each 3D point into a linear combination of palette-based bases (i.e., 3D segmentations defined by a group of NeRF-type functions) that are shared across the scene. While our palette-based bases are view-independent, we also predict a view-dependent function to capture the color residual (e.g., specular shading). During training, we jointly optimize the basis functions and the color palettes, and we also introduce novel regularizers to encourage the spatial coherence of the decomposition. Our method allows users to efficiently edit the appearance of the 3D scene by modifying the color palettes. We also extend our framework with compressed semantic features for semantic-aware appearance editing. We demonstrate that our technique is superior to baseline methods both quantitatively and qualitatively for appearance editing of complex real-world scenes.

NeRF-DS: Neural Radiance Fields for Dynamic Specular Objects
Yan, ZhiwenandLi, ChenandLee, GimHee



研究问题:目前的动态NeRF算法在渲染动态场景的单目RGB视频时,无法准确捕捉到镜面物体反射颜色的变化,导致渲染效果不佳。
动机:为了解决这个问题,我们提出了一种新的方法,通过将神经辐射场函数的条件改为观察空间中的表面位置和方向,使镜面物体在不同姿态下映射到公共标准空间时能保持不同的反射颜色。
方法:我们还添加了运动物体的掩码来引导变形场。由于镜面物体在运动过程中颜色会发生变化,掩码可以缓解仅依赖RGB监督时找不到时间对应关系的问题。
效果:我们在包含不同真实环境中移动镜面物体的自收集数据集上评估了我们的模型。实验结果表明,与现有的NeRF模型相比,我们的方法显著提高了从单目RGB视频中重建移动镜面物体的质量。

Dynamic Neural Radiance Field (NeRF) is a powerful algorithm capable of rendering photo-realistic novel view images from a monocular RGB video of a dynamic scene. Although it warps moving points across frames from the observation spaces to a common canonical space for rendering, dynamic NeRF does not model the change of the reflected color during the warping. As a result, this approach often fails drastically on challenging specular objects in motion. We address this limitation by reformulating the neural radiance field function to be conditioned on surface position and orientation in the observation space. This allows the specular surface at different poses to keep the different reflected colors when mapped to the common canonical space. Additionally, we add the mask of moving objects to guide the deformation field. As the specular surface changes color during motion, the mask mitigates the problem of failure to find temporal correspondences with only RGB supervision. We evaluate our model based on the novel view synthesis quality with a self-collected dataset of different moving specular objects in realistic environments. The experimental results demonstrate that our method significantly improves the reconstruction quality of moving specular objects from monocular RGB videos compared to the existing NeRF models. Our code and data are available at the project website https://github.com/JokerYan/NeRF-DS.

RealFusion: 360deg Reconstruction of Any Object From a Single Image
Melas-Kyriazi, LukeandLaina, IroandRupprecht, ChristianandVedaldi, Andrea



研究问题:如何从单一图像中重建物体的完整360度照片模型。
动机:目前的问题是,从单一图像重建物体的3D模型是严重不适定的。
方法:我们采用基于扩散的条件图像生成器,并设计一个提示,鼓励它"想象"物体的新视图。使用最新的DreamFusion方法,我们将给定的输入视图、条件先验和其他正则化器融合在一起,得到最终一致的重建结果。
效果:与先前的方法相比,我们在基准图像上展示了最先进的重建结果。定性地说,我们的重建结果忠实地匹配了输入视图,并对其外观和3D形状进行了合理的推断,包括物体不可见的一侧。

We consider the problem of reconstructing a full 360deg photographic model of an object from a single image of it. We do so by fitting a neural radiance field to the image, but find this problem to be severely ill-posed. We thus take an off-the-self conditional image generator based on diffusion and engineer a prompt that encourages it to "dream up" novel views of the object. Using the recent DreamFusion method, we fuse the given input view, the conditional prior, and other regularizers in a final, consistent reconstruction. We demonstrate state-of-the-art reconstruction results on benchmark images when compared to prior methods for monocular 3D reconstruction of objects. Qualitatively, our reconstructions provide a faithful match of the input view and a plausible extrapolation of its appearance and 3D shape, including to the side of the object not visible in the image.

TensoIR: Tensorial Inverse Rendering
Jin, HaianandLiu, IsabellaandXu, PeijiaandZhang, XiaoshuaiandHan, SongfangandBi, SaiandZhou, XiaoweiandXu, ZexiangandSu, Hao



研究问题:提出一种新的基于张量分解和神经场的逆渲染方法,以解决以往仅使用多层感知器(MLP)的神经场容量低、计算成本高的问题。
动机:扩展现有的最先进技术TensoRF,用于估计未知光照条件下多视角图像的场景几何、表面反射和环境照明,实现辐射场重建和物理模型估计,从而生成逼真的新视图合成和重照明。
方法:将TensoRF扩展到联合实现辐射场重建和物理模型估计,利用其效率和可扩展性表示,准确模拟二级阴影效果(如阴影和间接照明),并支持在单个或多个未知光照条件下捕获的输入图像。
效果:通过定性和定量地比较各种具有挑战性的合成和真实世界场景,证明该方法优于基线方法。

We propose TensoIR, a novel inverse rendering approach based on tensor factorization and neural fields. Unlike previous works that use purely MLP-based neural fields, thus suffering from low capacity and high computation costs, we extend TensoRF, a state-of-the-art approach for radiance field modeling, to estimate scene geometry, surface reflectance, and environment illumination from multi-view images captured under unknown lighting conditions. Our approach jointly achieves radiance field reconstruction and physically-based model estimation, leading to photo-realistic novel view synthesis and relighting. Benefiting from the efficiency and extensibility of the TensoRF-based representation, our method can accurately model secondary shading effects (like shadows and indirect lighting) and generally support input images captured under a single or multiple unknown lighting conditions. The low-rank tensor representation allows us to not only achieve fast and compact reconstruction but also better exploit shared information under an arbitrary number of capturing lighting conditions. We demonstrate the superiority of our method to baseline methods qualitatively and quantitatively on various challenging synthetic and real-world scenes.

DSFNet: Dual Space Fusion Network for Occlusion-Robust 3D Dense Face Alignment
Li, HeyuanandWang, BoandCheng, YuandKankanhalli, MohanandTan, RobbyT.



研究问题:现有的单目3D稠密人脸对齐方法对严重遮挡和大视角敏感,限制了其应用场景。
动机:现有基于3DMM的方法直接回归模型系数,没有充分利用2D空间和语义的低级信息,这些信息实际上可以为面部形状和方向提供线索。
方法:我们展示了如何共同在图像空间和模型空间中建模3D面部几何,以解决遮挡和视角问题。我们不是直接预测整个面部,而是首先通过密集预测在可见面部区域中回归图像空间特征。然后,我们根据可见区域的回归特征预测模型的系数,利用来自变形模型的整个面部几何的先验知识来完成不可见区域。我们还提出了一个融合网络,结合图像空间和模型空间预测的优点,以实现在无约束场景中的高鲁棒性和准确性。
效果:由于提出的融合模块,我们的方法不仅对遮挡和大俯仰角具有鲁棒性(这是我们的图像空间方法的优点),而且对噪声和大偏航角也具有鲁棒性(这是我们的模型空间方法的优点)。综合评估表明,我们的方法优于最先进的方法。在3D稠密人脸对齐任务上,我们在AFLW2000-3D数据集上实现了3.80%的NME,比最先进的方法高出5.5%。代码可在https://github.com/lhyfst/DSFNet获取。

Sensitivity to severe occlusion and large view angles limits the usage scenarios of the existing monocular 3D dense face alignment methods. The state-of-the-art 3DMM-based method, directly regresses the model's coefficients, underutilizing the low-level 2D spatial and semantic information, which can actually offer cues for face shape and orientation. In this work, we demonstrate how modeling 3D facial geometry in image and model space jointly can solve the occlusion and view angle problems. Instead of predicting the whole face directly, we regress image space features in the visible facial region by dense prediction first. Subsequently, we predict our model's coefficients based on the regressed feature of the visible regions, leveraging the prior knowledge of whole face geometry from the morphable models to complete the invisible regions. We further propose a fusion network that combines the advantages of both the image and model space predictions to achieve high robustness and accuracy in unconstrained scenarios. Thanks to the proposed fusion module, our method is robust not only to occlusion and large pitch and roll view angles, which is the benefit of our image space approach, but also to noise and large yaw angles, which is the benefit of our model space method. Comprehensive evaluations demonstrate the superior performance of our method compared with the state-of-the-art methods. On the 3D dense face alignment task, we achieve 3.80% NME on the AFLW2000-3D dataset, which outperforms the state-of-the-art method by 5.5%. Code is available at https://github.com/lhyfst/DSFNet.

MAIR: Multi-View Attention Inverse Rendering With 3D Spatially-Varying Lighting Estimation
Choi, JunYongandLee, SeokYeongandPark, HaesolandJung, Seung-WonandKim, Ig-JaeandCho, Junghyun



研究问题:本文旨在提出一种基于多视角图像的场景级逆渲染框架,以实现场景的几何、SVBRDF和3D空间变化的照明分解。
动机:由于缺乏多视角HDR合成数据集,以往场景级逆渲染主要使用单视角图像进行研究。然而,多视角图像提供了丰富的场景信息,因此我们试图通过扩展OpenRooms数据集和设计处理多视角图像的有效管道,以及分割空间变化的照明,来成功执行基于多视角图像的场景级逆渲染。
方法:我们的方法包括利用多视角图像对场景进行分解,设计处理多视角图像的高效管道,以及分割空间变化的照明。我们还开发了一个复杂的3D空间变化的照明体积,可以在任何3D位置插入照片真实的物体。
效果:实验表明,我们的方法不仅在性能上优于基于单视角的方法,而且在未见过的真实世界场景上也表现出稳健的性能。此外,我们的复杂3D空间变化的照明体积允许在任何3D位置插入照片真实的物体。

We propose a scene-level inverse rendering framework that uses multi-view images to decompose the scene into geometry, a SVBRDF, and 3D spatially-varying lighting. Because multi-view images provide a variety of information about the scene, multi-view images in object-level inverse rendering have been taken for granted. However, owing to the absence of multi-view HDR synthetic dataset, scene-level inverse rendering has mainly been studied using single-view image. We were able to successfully perform scene-level inverse rendering using multi-view images by expanding OpenRooms dataset and designing efficient pipelines to handle multi-view images, and splitting spatially-varying lighting. Our experiments show that the proposed method not only achieves better performance than single-view-based methods, but also achieves robust performance on unseen real-world scene. Also, our sophisticated 3D spatially-varying lighting volume allows for photorealistic object insertion in any 3D location.

Human Pose As Compositional Tokens
Geng, ZigangandWang, ChunyuandWei, YixuanandLiu, ZeandLi, HouqiangandHu, Han



研究问题:如何更准确地表示和预测人体姿势?
动机:现有的人体姿势表示方法(如关节坐标向量或热图嵌入)虽然便于数据处理,但由于缺乏身体关节之间的依赖关系建模,可能会导致不真实的姿势估计。
方法:提出了一种结构化表示方法——"Pose as Compositional Tokens"(PCT),将人体姿势表示为M个离散的、描述子结构及其相互依赖关节的标记。通过将姿势估计转化为分类任务,并使用预训练的解码器网络从标记中恢复姿势,无需进一步的后处理。
效果:在一般场景下,该方法实现了与现有方法相当甚至更好的姿势估计结果,并在遮挡普遍存在的情况下仍能保持良好的性能。

Human pose is typically represented by a coordinate vector of body joints or their heatmap embeddings. While easy for data processing, unrealistic pose estimates are admitted due to the lack of dependency modeling between the body joints. In this paper, we present a structured representation, named Pose as Compositional Tokens (PCT), to explore the joint dependency. It represents a pose by M discrete tokens with each characterizing a sub-structure with several interdependent joints. The compositional design enables it to achieve a small reconstruction error at a low cost. Then we cast pose estimation as a classification task. In particular, we learn a classifier to predict the categories of the M tokens from an image. A pre-learned decoder network is used to recover the pose from the tokens without further post-processing. We show that it achieves better or comparable pose estimation results as the existing methods in general scenarios, yet continues to work well when occlusion occurs, which is ubiquitous in practice. The code and models are publicly available at https://github.com/Gengzigang/PCT.

HuManiFlow: Ancestor-Conditioned Normalising Flows on SO(3) Manifolds for Human Pose and Shape Distribution Estimation
Sengupta, AkashandBudvytis, IgnasandCipolla, Roberto



研究问题:单目3D人体姿态和形状估计是一个病态问题,因为多个3D解决方案可以解释一个主题的2D图像。
动机:最近的方法是预测在给定图像条件下可能的3D姿态和形状参数的概率分布,但这些方法在三个关键属性之间存在权衡:准确性、样本输入一致性和样本多样性。
方法:我们的方法HuManiFlow同时预测准确、一致和多样的分布。我们使用人体运动树将全身姿态分解为祖先条件的身体各部分的姿态分布,并以自回归的方式实现。身体各部分的分布使用正则化流实现,尊重SO(3)的流形结构,即身体各部分姿态的李群。
效果:实验结果表明,病态但普遍存在的3D点估计损失会降低样本多样性,只采用概率训练损失。HuManiFlow在3DPW和SSP-3D数据集上优于最先进的概率方法。

Monocular 3D human pose and shape estimation is an ill-posed problem since multiple 3D solutions can explain a 2D image of a subject. Recent approaches predict a probability distribution over plausible 3D pose and shape parameters conditioned on the image. We show that these approaches exhibit a trade-off between three key properties: (i) accuracy - the likelihood of the ground-truth 3D solution under the predicted distribution, (ii) sample-input consistency - the extent to which 3D samples from the predicted distribution match the visible 2D image evidence, and (iii) sample diversity - the range of plausible 3D solutions modelled by the predicted distribution. Our method, HuManiFlow, predicts simultaneously accurate, consistent and diverse distributions. We use the human kinematic tree to factorise full body pose into ancestor-conditioned per-body-part pose distributions in an autoregressive manner. Per-body-part distributions are implemented using normalising flows that respect the manifold structure of SO(3), the Lie group of per-body-part poses. We show that ill-posed, but ubiquitous, 3D point estimate losses reduce sample diversity, and employ only probabilistic training losses. HuManiFlow outperforms state-of-the-art probabilistic approaches on the 3DPW and SSP-3D datasets.

Semantic Ray: Learning a Generalizable Semantic Field With Cross-Reprojection Attention
Liu, FangfuandZhang, ChubinandZheng, YuandDuan, Yueqi



研究问题:如何从多个场景中学习准确、高效且可泛化的语义辐射场。
动机:现有的NeRFs主要关注神经场景渲染、图像合成和多视图重建任务,而Semantic-NeRF等少数尝试探索使用NeRF结构进行高级别语义理解。然而,Semantic-NeRF通过单个射线的多个头部同时学习颜色和语义标签,但单个射线无法提供丰富的语义信息。因此,Semantic NeRF依赖于位置编码,并且需要为每个场景训练一个特定的模型。
方法:我们提出Semantic Ray (S-Ray)来充分利用沿射线方向的多视图重投影的语义信息。为了解决直接在多视图重投影的射线上执行密集注意力会导致计算成本高昂的问题,我们设计了一个跨重投影注意力模块,该模块具有连续的 intra-view radial 和 cross-view sparse attentions,它沿着重投影的射线分解上下文信息并跨越多个视图,然后通过堆叠模块收集密集连接。
效果:实验表明,我们的S-Ray能够从多个场景中学习,并且具有很强的泛化能力,可以适应未见过的场景。

In this paper, we aim to learn a semantic radiance field from multiple scenes that is accurate, efficient and generalizable. While most existing NeRFs target at the tasks of neural scene rendering, image synthesis and multi-view reconstruction, there are a few attempts such as Semantic-NeRF that explore to learn high-level semantic understanding with the NeRF structure. However, Semantic-NeRF simultaneously learns color and semantic label from a single ray with multiple heads, where the single ray fails to provide rich semantic information. As a result, Semantic NeRF relies on positional encoding and needs to train one specific model for each scene. To address this, we propose Semantic Ray (S-Ray) to fully exploit semantic information along the ray direction from its multi-view reprojections. As directly performing dense attention over multi-view reprojected rays would suffer from heavy computational cost, we design a Cross-Reprojection Attention module with consecutive intra-view radial and cross-view sparse attentions, which decomposes contextual information along reprojected rays and cross multiple views and then collects dense connections by stacking the modules. Experiments show that our S-Ray is able to learn from multiple scenes, and it presents strong generalization ability to adapt to unseen scenes.

ORCa: Glossy Objects As Radiance-Field Cameras
Tiwary, KushagraandDave, AkshatandBehari, NikhilandKlinghoffer, TzofiandVeeraraghavan, AshokandRaskar, Ramesh



研究问题:如何利用高光物体的反射信息,将其转化为相机,实现超越视场和看似不可能的视角的成像。
动机:高光物体的反射包含了周围环境的重要和隐藏信息,通过将这些物体转化为相机,可以解锁令人兴奋的应用,如超越相机视场和从看似不可能的角度(如人眼的反射)进行成像。
方法:将未知几何形状的高光物体转化为辐射场相机,从物体的视角对世界进行成像。主要思路是将物体表面转化为虚拟传感器,捕捉投射的反射作为5D环境辐射场的2D投影,该辐射场对物体及其周围可见。
效果:通过恢复环境辐射场,不仅可以实现从物体到其周围环境的深度和辐射估计,还可以实现超越视场的新视图合成,即渲染仅直接可见于场景中存在高光物体但观察者不可见的新视图。此外,使用辐射场可以在场景中由靠近物体引起的遮挡物周围进行成像。该方法在多视角物体图像上进行端到端训练,并联合估计物体几何、漫反射和5D环境辐射场。

Reflections on glossy objects contain valuable and hidden information about the surrounding environment. By converting these objects into cameras, we can unlock exciting applications, including imaging beyond the camera's field-of-view and from seemingly impossible vantage points, e.g. from reflections on the human eye. However, this task is challenging because reflections depend jointly on object geometry, material properties, the 3D environment, and the observer's viewing direction. Our approach converts glossy objects with unknown geometry into radiance-field cameras to image the world from the object's perspective. Our key insight is to convert the object surface into a virtual sensor that captures cast reflections as a 2D projection of the 5D environment radiance field visible to and surrounding the object. We show that recovering the environment radiance fields enables depth and radiance estimation from the object to its surroundings in addition to beyond field-of-view novel-view synthesis, i.e. rendering of novel views that are only directly visible to the glossy object present in the scene, but not the observer. Moreover, using the radiance field we can image around occluders caused by close-by objects in the scene. Our method is trained end-to-end on multi-view images of the object and jointly estimates object geometry, diffuse radiance, and the 5D environment radiance field.

SECAD-Net: Self-Supervised CAD Reconstruction by Learning Sketch-Extrude Operations
Li, PuandGuo, JianweiandZhang, XiaopengandYan, Dong-Ming



研究问题:如何从原始几何形状中反向工程CAD模型,这是一个经典但艰巨的研究问题。
动机:现有的基于学习的方法严重依赖标签,或者重建的CAD形状不易编辑。
方法:我们提出了SECAD-Net,一种端到端的神经网络,旨在以自监督的方式重建简洁且易于编辑的CAD模型。我们从现代CAD软件最常用的建模语言中得到启发,学习2D草图和3D拉伸参数,通过这些参数可以从原始形状生成一组拉伸圆柱体。
效果:我们在ABC和Fusion 360数据集上进行了大量实验,证明了我们方法的有效性,并在包括密切相关的有监督CAD重建方法在内的最先进技术中脱颖而出。我们还将此方法应用于CAD编辑和单视图CAD重建。

Reverse engineering CAD models from raw geometry is a classic but strenuous research problem. Previous learning-based methods rely heavily on labels due to the supervised design patterns or reconstruct CAD shapes that are not easily editable. In this work, we introduce SECAD-Net, an end-to-end neural network aimed at reconstructing compact and easy-to-edit CAD models in a self-supervised manner. Drawing inspiration from the modeling language that is most commonly used in modern CAD software, we propose to learn 2D sketches and 3D extrusion parameters from raw shapes, from which a set of extrusion cylinders can be generated by extruding each sketch from a 2D plane into a 3D body. By incorporating the Boolean operation (i.e., union), these cylinders can be combined to closely approximate the target geometry. We advocate the use of implicit fields for sketch representation, which allows for creating CAD variations by interpolating latent codes in the sketch latent space. Extensive experiments on both ABC and Fusion 360 datasets demonstrate the effectiveness of our method, and show superiority over state-of-the-art alternatives including the closely related method for supervised CAD reconstruction. We further apply our approach to CAD editing and single-view CAD reconstruction. The code is released at https://github.com/BunnySoCrazy/SECAD-Net.

Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories
Sinha, SamarthandShapovalov, RomanandReizenstein, JeremyandRocco, IgnacioandNeverova, NataliaandVedaldi, AndreaandNovotny, David



研究问题:如何从稀疏视图中获取物体的逼真重建?
动机:早期的稀疏刚性物体重建方法可以从CO3D等大型数据集学习合适的重建先验,但这种方法无法应用于动态物体。
方法:我们使用猫和狗作为代表示例,引入了CoP3D,这是一个包含约4200个独特宠物的众包视频集合。我们还提出了Tracker-NeRF,一种从我们的数据集学习4D重建的方法。在测试时,给定未见过序列的少量视频帧,Tracker-NeRF预测3D点的轨迹和动力学并生成新视图,插值视点和时间。
效果:在CoP3D上的结果比现有的基线在非刚性新视图合成性能上有显著改善。

Obtaining photorealistic reconstructions of objects from sparse views is inherently ambiguous and can only be achieved by learning suitable reconstruction priors. Earlier works on sparse rigid object reconstruction successfully learned such priors from large datasets such as CO3D. In this paper, we extend this approach to dynamic objects. We use cats and dogs as a representative example and introduce Common Pets in 3D (CoP3D), a collection of crowd-sourced videos showing around 4,200 distinct pets. CoP3D is one of the first large-scale datasets for benchmarking non-rigid 3D reconstruction "in the wild". We also propose Tracker-NeRF, a method for learning 4D reconstruction from our dataset. At test time, given a small number of video frames of an unseen sequence, Tracker-NeRF predicts the trajectories and dynamics of the 3D points and generates new views, interpolating viewpoint and time. Results on CoP3D reveal significantly better non-rigid new-view synthesis performance than existing baselines. The data is available on the project webpage: https://cop3d.github.io/.

Normal-Guided Garment UV Prediction for Human Re-Texturing
Jafarian, YasaminandWang, TuanfengY.andCeylan, DuyguandYang, JimeiandCarr, NathanandZhou, YiandPark, HyunSoo



研究问题:如何以物理上可信的方式编辑人类视频,考虑到衣物的复杂几何变形和外观变化。
动机:传统的3D重建方法在处理动态衣物时面临挑战,需要一种新的方法来估计衣物的几何感知纹理地图(UV map)。
方法:提出一种无需3D重建即可编辑穿衣人类图像和视频的方法。通过利用从图像中预测的3D表面法线,设计一个保留底层3D表面等距性的UV映射。该方法以自监督的方式捕捉衣物的底层几何结构,无需真实UV图的标注,并可扩展到预测时间连贯的UV映射。
效果:在真实和合成数据上,该方法优于最先进的人类UV图估计方法。

Clothes undergo complex geometric deformations, which lead to appearance changes. To edit human videos in a physically plausible way, a texture map must take into account not only the garment transformation induced by the body movements and clothes fitting, but also its 3D fine-grained surface geometry. This poses, however, a new challenge of 3D reconstruction of dynamic clothes from an image or a video. In this paper, we show that it is possible to edit dressed human images and videos without 3D reconstruction. We estimate a geometry aware texture map between the garment region in an image and the texture space, a.k.a, UV map. Our UV map is designed to preserve isometry with respect to the underlying 3D surface by making use of the 3D surface normals predicted from the image. Our approach captures the underlying geometry of the garment in a self-supervised way, requiring no ground truth annotation of UV maps and can be readily extended to predict temporally coherent UV maps. We demonstrate that our method outperforms the state-of-the-art human UV map estimation approaches on both real and synthetic data.

Computational Flash Photography Through Intrinsics
Maralan, SepidehSarajianandCareaga, ChrisandAksoy, Yagiz



研究问题:本文旨在研究闪光灯在摄影中的计算控制,特别是在有或无闪光灯的情况下。
动机:目前,闪光灯的使用是二分的,一旦拍摄照片,对闪光灯的特性(如强度或颜色)的控制是有限的。
方法:我们提出了一个物理驱动的内在公式来形成闪光照片,并开发了分别用于闪光和无闪光照片的闪光分解和生成方法。
效果:我们的实验结果表明,这种内在公式优于文献中的其他方法,使我们能够计算控制自然环境下图像中的闪光灯。

Flash is an essential tool as it often serves as the sole controllable light source in everyday photography. However, the use of flash is a binary decision at the time a photograph is captured with limited control over its characteristics such as strength or color. In this work, we study the computational control of the flash light in photographs taken with or without flash. We present a physically motivated intrinsic formulation for flash photograph formation and develop flash decomposition and generation methods for flash and no-flash photographs, respectively. We demonstrate that our intrinsic formulation outperforms alternatives in the literature and allows us to computationally control flash in in-the-wild images.

BITE: Beyond Priors for Improved Three-D Dog Pose Estimation
R\"uegg, NadineandTripathi, ShashankandSchindler, KonradandBlack, MichaelJ.andZuffi, Silvia



研究问题:从图像中推断狗的3D形状和姿势。
动机:由于缺乏3D训练数据,这个问题具有挑战性,最好的方法落后于那些用于估计人类形状和姿势的方法。
方法:首先,学习一种狗专用的3D参数模型,称为D-SMAL。其次,利用与地面的接触作为侧信息来解决这个问题。然后,开发一种新的神经网络架构来推断和利用这种接触信息。最后,创建包含扫描的3D狗的渲染图像的合成数据集进行评估。
效果:通过这些进步,我们的方法在恢复狗的形状和姿势方面比最先进的技术有了显著的改进,并在3D中进行了评估。

We address the problem of inferring the 3D shape and pose of dogs from images. Given the lack of 3D training data, this problem is challenging, and the best methods lag behind those designed to estimate human shape and pose. To make progress, we attack the problem from multiple sides at once. First, we need a good 3D shape prior, like those available for humans. To that end, we learn a dog-specific 3D parametric model, called D-SMAL. Second, existing methods focus on dogs in standing poses because when they sit or lie down, their legs are self occluded and their bodies deform. Without access to a good pose prior or 3D data, we need an alternative approach. To that end, we exploit contact with the ground as a form of side information. We consider an existing large dataset of dog images and label any 3D contact of the dog with the ground. We exploit body-ground contact in estimating dog pose and find that it significantly improves results. Third, we develop a novel neural network architecture to infer and exploit this contact information. Fourth, to make progress, we have to be able to measure it. Current evaluation metrics are based on 2D features like keypoints and silhouettes, which do not directly correlate with 3D errors. To address this, we create a synthetic dataset containing rendered images of scanned 3D dogs. With these advances, our method recovers significantly better dog shape and pose than the state of the art, and we evaluate this improvement in 3D. Our code, model and test dataset are publicly available for research purposes at https://bite.is.tue.mpg.de.

SeSDF: Self-Evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction
Cao, YukangandHan, KaiandWong, Kwan-YeeK.



研究问题:本文旨在解决从单张图像或未校准的多视图图像重建穿衣人体的问题。
动机:现有的方法在重建穿衣人体的详细几何形状上存在困难,且多视图重建通常需要校准设置。
方法:提出一个灵活的框架,利用参数化的SMPL-X模型,可以在未校准的设置下,根据任意数量的输入图像重建穿衣人体模型。核心是我们的新型自我进化有符号距离场(SeSDF)模块,该模块可以学习变形从拟合的SMPL-X模型得到的有符号距离场(SDF),从而编码出反映实际穿衣人体的详细几何形状,以实现更好的重建。
效果:我们在公共基准上全面评估了我们的框架,无论在定性还是定量上都明显优于最先进的技术。

We address the problem of clothed human reconstruction from a single image or uncalibrated multi-view images. Existing methods struggle with reconstructing detailed geometry of a clothed human and often require a calibrated setting for multi-view reconstruction. We propose a flexible framework which, by leveraging the parametric SMPL-X model, can take an arbitrary number of input images to reconstruct a clothed human model under an uncalibrated setting. At the core of our framework is our novel self-evolved signed distance field (SeSDF) module which allows the framework to learn to deform the signed distance field (SDF) derived from the fitted SMPL-X model, such that detailed geometry reflecting the actual clothed human can be encoded for better reconstruction. Besides, we propose a simple method for self-calibration of multi-view images via the fitted SMPL-X parameters. This lifts the requirement of tedious manual calibration and largely increases the flexibility of our method. Further, we introduce an effective occlusion-aware feature fusion strategy to account for the most useful features to reconstruct the human model. We thoroughly evaluate our framework on public benchmarks, demonstrating significant superiority over the state-of-the-arts both qualitatively and quantitatively.

Deep Depth Estimation From Thermal Image
Shin, UkcheolandPark, JinsunandKweon, InSo



研究问题:如何实现自动驾驶汽车在恶劣天气条件下的鲁棒和精确的几何理解,以实现高度自主性。
动机:目前的自动驾驶算法依赖于可见光波段,容易受到天气和光照条件的影响。长波红外相机(热成像相机)可能是实现高度鲁棒性的潜在解决方案,但缺乏大型数据集和公开基准结果。
方法:本文首先构建了一个大规模的多光谱立体(MS^2)数据集,包括立体RGB、立体NIR、立体热成像和立体激光雷达数据以及GNSS/IMU信息。收集的数据集提供了约195K个同步的数据对,来自城市、住宅、道路、校园和郊区的早晨、白天和夜晚,在晴朗、多云和雨天的条件下。其次,我们对基于可见光波段设计的单目和双目深度估计算法进行了详尽的验证过程,以在热成像领域基准它们的性能。最后,我们提出了一个统一的深度网络,从条件随机场的角度有效地连接了单目深度和双目深度任务。
效果:我们的数据集和源代码可在https://github.com/UkcheolShin/MS2-MultiSpectralStereoDataset获取。实验结果表明,我们的方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Robust and accurate geometric understanding against adverse weather conditions is one top prioritized conditions to achieve a high-level autonomy of self-driving cars. However, autonomous driving algorithms relying on the visible spectrum band are easily impacted by weather and lighting conditions. A long-wave infrared camera, also known as a thermal imaging camera, is a potential rescue to achieve high-level robustness. However, the missing necessities are the well-established large-scale dataset and public benchmark results. To this end, in this paper, we first built a large-scale Multi-Spectral Stereo (MS^2) dataset, including stereo RGB, stereo NIR, stereo thermal, and stereo LiDAR data along with GNSS/IMU information. The collected dataset provides about 195K synchronized data pairs taken from city, residential, road, campus, and suburban areas in the morning, daytime, and nighttime under clear-sky, cloudy, and rainy conditions. Secondly, we conduct an exhaustive validation process of monocular and stereo depth estimation algorithms designed on visible spectrum bands to benchmark their performance in the thermal image domain. Lastly, we propose a unified depth network that effectively bridges monocular depth and stereo depth tasks from a conditional random field approach perspective. Our dataset and source code are available at https://github.com/UkcheolShin/MS2-MultiSpectralStereoDataset.

Building Rearticulable Models for Arbitrary 3D Objects From 4D Point Clouds
Liu, ShaoweiandGupta, SaurabhandWang, Shenlong



研究问题:如何构建可重构模型,用于处理日常生活中由任意数量部分组成的任意连接方式的对象。
动机:为了能够从点云视频中识别出不同的对象部分,以及各部分之间的连接关系和关节属性。
方法:通过使用新颖的能量最小化框架,联合优化部件分割、变换和运动学,实现对对象的建模。
效果:在新的关节式机器人数据集和包含日常物品的Sapiens数据集上进行测试,实验表明该方法在各项指标上都优于两种领先的先前工作。

We build rearticulable models for arbitrary everyday man-made objects containing an arbitrary number of parts that are connected together in arbitrary ways via 1-degree-of-freedom joints. Given point cloud videos of such everyday objects, our method identifies the distinct object parts, what parts are connected to what other parts, and the properties of the joints connecting each part pair. We do this by jointly optimizing the part segmentation, transformation, and kinematics using a novel energy minimization framework. Our inferred animatable models, enables retargeting to novel poses with sparse point correspondences guidance. We test our method on a new articulating robot dataset and the Sapiens dataset with common daily objects. Experiments show that our method outperforms two leading prior works on various metrics.

Towards Stable Human Pose Estimation via Cross-View Fusion and Foot Stabilization
Zhuo, Li{\textquoteright



研究问题:单目图像中稳定人体姿态估计存在两个主要难题,一是不同视角(如前视图、侧视图和顶视图)由于深度模糊导致表现不一致;二是在复杂的人体姿态估计中,如舞蹈和运动,脚部姿势以及脚与地面的交互作用起着重要作用,但大多数通用方法和数据集都忽略了这一点。
动机:为了解决上述问题,本文提出了一种基于视觉变换器编码器的跨视角融合(CVF)模块,以获得更好的3D中间表示并减轻视角不一致的问题。同时,引入了基于优化的方法来重建一般多视角数据集(包括AIST++和Human3.6M)中的脚部姿势和脚地接触。此外,还创新了可逆的运动学拓扑策略,将接触信息用于带有脚部姿势回归器的全身。
方法:通过在流行的基准测试集上进行大量实验,证明了该方法优于最先进的方法,在3DPW测试集上实现了40.1mm的PA-MPJPE,在AIST++测试集上实现了43.8mm。
效果:本文提出的方法在各种知识驱动任务上取得了显著改进,并在其他常见的NLP任务上与最先进的BERT模型相媲美。

Towards stable human pose estimation from monocular images, there remain two main dilemmas. On the one hand, the different perspectives, i.e., front view, side view, and top view, appear the inconsistent performances due to the depth ambiguity. On the other hand, foot posture plays a significant role in complicated human pose estimation, i.e., dance and sports, and foot-ground interaction, but unfortunately, it is omitted in most general approaches and datasets. In this paper, we first propose the Cross-View Fusion (CVF) module to catch up with better 3D intermediate representation and alleviate the view inconsistency based on the vision transformer encoder. Then the optimization-based method is introduced to reconstruct the foot pose and foot-ground contact for the general multi-view datasets including AIST++ and Human3.6M. Besides, the reversible kinematic topology strategy is innovated to utilize the contact information into the full-body with foot pose regressor. Extensive experiments on the popular benchmarks demonstrate that our method outperforms the state-of-the-art approaches by achieving 40.1mm PA-MPJPE on the 3DPW test set and 43.8mm on the AIST++ test set.

Few-Shot Non-Line-of-Sight Imaging With Signal-Surface Collaborative Regularization
Liu, XintongandWang, JianyuandXiao, LepingandFu, XingandQiu, LingyunandShi, Zuoqiang



研究问题:如何通过非视线成像技术,从多次反射的光中重建目标?
动机:现有的大多数方法需要对中继面上的密集点进行光栅扫描以获取高质量的重建,这需要很长的采集时间。
方法:我们提出了一个信号-表面协同正则化(SSCR)框架,该框架使用贝叶斯推理设计了信号、物体的3D体素表示和目标的2D表面描述的联合正则化。
效果:在合成和实验数据集上的实验表明,该方法在共聚焦和非共聚焦设置下都有效。我们报告了仅使用公共数据集中的5 x 5共聚焦测量来重建具有复杂几何结构的隐藏目标,这表明常规测量过程的速度提高了10,000倍。此外,该方法在稀疏测量下具有低时间和内存复杂度。该方法在实时非视线成像应用中具有巨大潜力,如救援行动和自动驾驶。

The non-line-of-sight imaging technique aims to reconstruct targets from multiply reflected light. For most existing methods, dense points on the relay surface are raster scanned to obtain high-quality reconstructions, which requires a long acquisition time. In this work, we propose a signal-surface collaborative regularization (SSCR) framework that provides noise-robust reconstructions with a minimal number of measurements. Using Bayesian inference, we design joint regularizations of the estimated signal, the 3D voxel-based representation of the objects, and the 2D surface-based description of the targets. To our best knowledge, this is the first work that combines regularizations in mixed dimensions for hidden targets. Experiments on synthetic and experimental datasets illustrated the efficiency of the proposed method under both confocal and non-confocal settings. We report the reconstruction of the hidden targets with complex geometric structures with only 5 x 5 confocal measurements from public datasets, indicating an acceleration of the conventional measurement process by a factor of 10,000. Besides, the proposed method enjoys low time and memory complexity with sparse measurements. Our approach has great potential in real-time non-line-of-sight imaging applications such as rescue operations and autonomous driving.

RelightableHands: Efficient Neural Relighting of Articulated Hand Models
Iwase, ShunandSaito, ShunsukeandSimon, TomasandLombardi, StephenandBagautdinov, TimurandJoshi, RohanandPrada, FabianandShiratori, TakaakiandSheikh, YaserandSaragih, Jason



研究问题:如何实时渲染高保真个性化手部,并在新的照明条件下进行动画?
动机:目前的神经光照方法在合成手部时需要大量的计算,且无法实时渲染。
方法:采用教师-学生框架,教师模型通过单点光源从光场图像中学习外观,生成的手部可以在任意照明下渲染,但计算量大。利用教师模型渲染的图像作为训练数据,学生模型直接预测自然光照下的外观,实现实时渲染。为了实现泛化,我们使用物理启发的光照特征(如可见性、漫反射和镜面反射)对学生模型进行条件设置,这些特征与后续的全局光照效果有强相关性,可以作为神经光照网络的条件数据。
效果:实验表明,我们的光照特征表示优于基线方法,能够在实时速度下对两只互动的手进行照片级真实感光照。

We present the first neural relighting approach for rendering high-fidelity personalized hands that can be animated in real-time under novel illumination. Our approach adopts a teacher-student framework, where the teacher learns appearance under a single point light from images captured in a light-stage, allowing us to synthesize hands in arbitrary illuminations but with heavy compute. Using images rendered by the teacher model as training data, an efficient student model directly predicts appearance under natural illuminations in real-time. To achieve generalization, we condition the student model with physics-inspired illumination features such as visibility, diffuse shading, and specular reflections computed on a coarse proxy geometry, maintaining a small computational overhead. Our key insight is that these features have strong correlation with subsequent global light transport effects, which proves sufficient as conditioning data for the neural relighting network. Moreover, in contrast to bottleneck illumination conditioning, these features are spatially aligned based on underlying geometry, leading to better generalization to unseen illuminations and poses. In our experiments, we demonstrate the efficacy of our illumination feature representations, outperforming baseline approaches. We also show that our approach can photorealistically relight two interacting hands at real-time speeds.

AnyFlow: Arbitrary Scale Optical Flow With Implicit Neural Representation
Jung, HyunyoungandHui, ZhuoandLuo, LeiandYang, HaitaoandLiu, FengandYoo, SungjooandRanjan, RakeshandDemandolx, Denis



研究问题:如何准确估计低分辨率输入的光学流?
动机:在实际应用中,为了降低计算成本,通常需要将输入调整为较小的尺寸。然而,这会使估计变得更具挑战性,因为物体和运动范围会变小。尽管现有的方法在高分辨率输入上表现良好,但在低分辨率输入时,它们往往无法准确建模小物体和精确边界。
方法:我们提出了AnyFlow,这是一种能够从各种分辨率的图像中准确估计流的强大网络。通过将光学流表示为连续的基于坐标的表示,AnyFlow可以从低分辨率输入生成任意尺度的输出,从而在捕获微小物体和保留细节方面优于现有方法。
效果:我们在KITTI数据集上建立了新的跨数据集泛化性能的最先进水平,同时在其他SOTA方法上实现了相当的准确性。

To apply optical flow in practice, it is often necessary to resize the input to smaller dimensions in order to reduce computational costs. However, downsizing inputs makes the estimation more challenging because objects and motion ranges become smaller. Even though recent approaches have demonstrated high-quality flow estimation, they tend to fail to accurately model small objects and precise boundaries when the input resolution is lowered, restricting their applicability to high-resolution inputs. In this paper, we introduce AnyFlow, a robust network that estimates accurate flow from images of various resolutions. By representing optical flow as a continuous coordinate-based representation, AnyFlow generates outputs at arbitrary scales from low-resolution inputs, demonstrating superior performance over prior works in capturing tiny objects with detail preservation on a wide range of scenes. We establish a new state-of-the-art performance of cross-dataset generalization on the KITTI dataset, while achieving comparable accuracy on the online benchmarks to other SOTA methods.

Shakes on a Plane: Unsupervised Depth Estimation From Unstabilized Photography
Chugunov, IlyaandZhang, YuxuanandHeide, Felix



研究问题:目前的移动摄影流程在拍摄和合并短序列帧以恢复增强图像时,往往忽视了他们捕捉到的场景的3D特性,将图像间的像素运动视为2D聚合问题。
动机:作者们发现,在两秒钟内捕获的42张12百万像素的RAW帧的"长曝光"中,仅自然的手震就有足够的视差信息来恢复高质量的场景深度。
方法:作者们设计了一种测试时间优化方法,该方法将神经RGB-D表示拟合到长曝光数据,并同时估计场景深度和相机运动。他们的平面加深度模型是端到端训练的,并通过控制网络在训练过程中何时可以访问哪些多分辨率体积特征来进行粗到精的细化。
效果:实验验证了这种方法,无需额外的硬件或单独的数据预处理和姿态估计步骤,就可以实现几何精确的深度重建。

Modern mobile burst photography pipelines capture and merge a short sequence of frames to recover an enhanced image, but often disregard the 3D nature of the scene they capture, treating pixel motion between images as a 2D aggregation problem. We show that in a "long-burst", forty-two 12-megapixel RAW frames captured in a two-second sequence, there is enough parallax information from natural hand tremor alone to recover high-quality scene depth. To this end, we devise a test-time optimization approach that fits a neural RGB-D representation to long-burst data and simultaneously estimates scene depth and camera motion. Our plane plus depth model is trained end-to-end, and performs coarse-to-fine refinement by controlling which multi-resolution volume features the network has access to at what time during training. We validate the method experimentally, and demonstrate geometrically accurate depth reconstructions with no additional hardware or separate data pre-processing and pose-estimation steps.

ShapeClipper: Scalable 3D Shape Learning From Single-View Images via Geometric and CLIP-Based Consistency
Huang, ZixuanandJampani, VarunandThai, AnhandLi, YuanzhenandStojanov, StefanandRehg, JamesM.



研究问题:本文旨在提出一种新的方法,通过单视图RGB图像重建3D物体形状。
动机:现有的3D模型重建方法需要大量的3D、多视图或相机姿态标注,过程繁琐。
方法:作者提出了ShapeClipper,该方法通过一组单视图分割图像学习形状重建。主要思想是通过基于CLIP的形状一致性促进形状学习,鼓励具有相似CLIP编码的对象共享相似的形状。同时,利用现成的法线作为额外的几何约束,使模型能够更好地进行详细的表面几何的自下而上推理。这两种新的一致性约束用于正则化模型,提高了其学习全局形状结构和局部几何细节的能力。
效果:在Pix3D、Pascal3D+和OpenImages三个具有挑战性的现实世界数据集上评估了该方法,结果优于最先进的方法。

We present ShapeClipper, a novel method that reconstructs 3D object shapes from real-world single-view RGB images. Instead of relying on laborious 3D, multi-view or camera pose annotation, ShapeClipper learns shape reconstruction from a set of single-view segmented images. The key idea is to facilitate shape learning via CLIP-based shape consistency, where we encourage objects with similar CLIP encodings to share similar shapes. We also leverage off-the-shelf normals as an additional geometric constraint so the model can learn better bottom-up reasoning of detailed surface geometry. These two novel consistency constraints, when used to regularize our model, improve its ability to learn both global shape structure and local geometric details. We evaluate our method over three challenging real-world datasets, Pix3D, Pascal3D+, and OpenImages, where we achieve superior performance over state-of-the-art methods.

Exact-NeRF: An Exploration of a Precise Volumetric Parameterization for Neural Radiance Fields
Isaac-Medina, BrianK.S.andWillcocks, ChrisG.andBreckon, TobyP.



研究问题:现有的神经辐射场(NeRF)模型在渲染场景视图时,由于采样点的宽度为零可能导致模糊的表示,从而产生诸如混叠等渲染伪影。
动机:为了解决这个问题,最近的mip-NeRF提出了一种基于锥形视景体的整体位置编码(IPE)。然而,这种方法在处理深度较大的场景对象时,会产生高度拉长的区域,导致其性能下降。
方法:本文提出了一种使用金字塔式积分公式计算IPE的精确方法,称为Exact-NeRF。这种方法为NeRF领域提供了第一个精确的解析解决方案。
效果:实验结果表明,这种精确的方法(Exact-NeRF)与mip-NeRF的精度相匹配,并且无需进一步修改即可自然扩展到更具挑战性的场景,如无界场景。这项贡献旨在解决早期NeRF工作中关于视景体近似的问题,并为未来NeRF扩展中考虑解析解决方案提供见解。

Neural Radiance Fields (NeRF) have attracted significant attention due to their ability to synthesize novel scene views with great accuracy. However, inherent to their underlying formulation, the sampling of points along a ray with zero width may result in ambiguous representations that lead to further rendering artifacts such as aliasing in the final scene. To address this issue, the recent variant mip-NeRF proposes an Integrated Positional Encoding (IPE) based on a conical view frustum. Although this is expressed with an integral formulation, mip-NeRF instead approximates this integral as the expected value of a multivariate Gaussian distribution. This approximation is reliable for short frustums but degrades with highly elongated regions, which arises when dealing with distant scene objects under a larger depth of field. In this paper, we explore the use of an exact approach for calculating the IPE by using a pyramid-based integral formulation instead of an approximated conical-based one. We denote this formulation as Exact-NeRF and contribute the first approach to offer a precise analytical solution to the IPE within the NeRF domain. Our exploratory work illustrates that such an exact formulation (Exact-NeRF) matches the accuracy of mip-NeRF and furthermore provides a natural extension to more challenging scenarios without further modification, such as in the case of unbounded scenes. Our contribution aims to both address the hitherto unexplored issues of frustum approximation in earlier NeRF work and additionally provide insight into the potential future consideration of analytical solutions in future NeRF extensions.

Non-Line-of-Sight Imaging With Signal Superresolution Network
Wang, JianyuandLiu, XintongandXiao, LepingandShi, ZuoqiangandQiu, LingyunandFu, Xing



研究问题:本文旨在解决非视距(NLOS)成像中,由于长时间曝光的限制,对隐藏物体的重建质量下降的问题。
动机:NLOS成像在各个领域都有巨大的潜力,但由于其需要长时间的曝光,对于如自动驾驶等实时性要求高的应用来说,限制了其实用化。
方法:本文提出了一种基于学习的管道,通过训练神经网络来恢复高空间分辨率的信号,从而在少量扫描点的情况下提高成像质量。
效果:实验结果表明,该方法在共焦和非共焦设置下都能忠实地重建隐藏场景。与原始测量相比,该方法的采集速度提高了16倍,同时保持了相似的重建质量。此外,该方法可以作为即插即用的模块直接应用于现有的光学系统和成像算法。

Non-line-of-sight (NLOS) imaging aims at reconstructing the location, shape, albedo, and surface normal of the hidden object around the corner with measured transient data. Due to its strong potential in various fields, it has drawn much attention in recent years. However, long exposure time is not always available for applications such as auto-driving, which hinders the practical use of NLOS imaging. Although scanning fewer points can reduce the total measurement time, it also brings the problem of imaging quality degradation. This paper proposes a general learning-based pipeline for increasing imaging quality with only a few scanning points. We tailor a neural network to learn the operator that recovers a high spatial resolution signal. Experiments on synthetic and measured data indicate that the proposed method provides faithful reconstructions of the hidden scene under both confocal and non-confocal settings. Compared with original measurements, the acquisition of our approach is 16 times faster while maintaining similar reconstruction quality. Besides, the proposed pipeline can be applied directly to existing optical systems and imaging algorithms as a plug-in-and-play module. We believe the proposed pipeline is powerful in increasing the frame rate in NLOS video imaging.

WildLight: In-the-Wild Inverse Rendering With a Flashlight
Cheng, ZiangandLi, JunxuanandLi, Hongdong



研究问题:本文旨在解决在未知环境光照下,野外逆向渲染的挑战性问题。
动机:目前的逆向渲染方法需要成对的闪光/非闪光图像,且难以处理环境反射的问题。
方法:提出一种实用的光度解决方案,利用智能手机内置的手电筒作为受控光源,将图像强度分解为静态外观和动态反射两个光度分量。
效果:实验证明,该方法易于实施,设置方便,并持续优于现有的野外逆向渲染技术。最终的神经重建可以方便地导出到适用于工业渲染器的PBR纹理三角形网格。

This paper proposes a practical photometric solution for the challenging problem of in-the-wild inverse rendering under unknown ambient lighting. Our system recovers scene geometry and reflectance using only multi-view images captured by a smartphone. The key idea is to exploit smartphone's built-in flashlight as a minimally controlled light source, and decompose image intensities into two photometric components -- a static appearance corresponds to ambient flux, plus a dynamic reflection induced by the moving flashlight. Our method does not require flash/non-flash images to be captured in pairs. Building on the success of neural light fields, we use an off-the-shelf method to capture the ambient reflections, while the flashlight component enables physically accurate photometric constraints to decouple reflectance and illumination. Compared to existing inverse rendering methods, our setup is applicable to non-darkroom environments yet sidesteps the inherent difficulties of explicit solving ambient reflections. We demonstrate by extensive experiments that our method is easy to implement, casual to set up, and consistently outperforms existing in-the-wild inverse rendering techniques. Finally, our neural reconstruction can be easily exported to PBR textured triangle mesh ready for industrial renderers. Our source code and data are released to https://github.com/za-cheng/WildLight

A Probabilistic Attention Model With Occlusion-Aware Texture Regression for 3D Hand Reconstruction From a Single RGB Image
Jiang, ZhehengandRahmani, HosseinandBlack, SueandWilliams, BryanM.



研究问题:如何从单张RGB图像中重建3D手部模型,同时解决现有方法对模型参数空间的过度依赖和深度模糊的问题。
动机:目前的深度学习方法在处理这些问题时存在不足,需要一种新的方法来提高重建的准确性和鲁棒性。
方法:提出了一种新颖的概率模型,该模型结合了基于模型的网络作为先验网络来估计关节和顶点的先验概率分布,并利用注意力机制的网格顶点不确定性回归模型来捕捉顶点之间的依赖关系以及关节与网格顶点之间的相关性,以提高其特征表示。此外,还提出了一种学习型的遮挡感知手部纹理回归模型来实现高保真度的纹理重建。
效果:实验结果表明,所提出的概率模型在监督和弱监督两种训练模式下都能实现最先进的重建精度,即使在严重遮挡的情况下也能保持良好的性能。

Recently, deep learning based approaches have shown promising results in 3D hand reconstruction from a single RGB image. These approaches can be roughly divided into model-based approaches, which are heavily dependent on the model's parameter space, and model-free approaches, which require large numbers of 3D ground truths to reduce depth ambiguity and struggle in weakly-supervised scenarios. To overcome these issues, we propose a novel probabilistic model to achieve the robustness of model-based approaches and reduced dependence on the model's parameter space of model-free approaches. The proposed probabilistic model incorporates a model-based network as a prior-net to estimate the prior probability distribution of joints and vertices. An Attention-based Mesh Vertices Uncertainty Regression (AMVUR) model is proposed to capture dependencies among vertices and the correlation between joints and mesh vertices to improve their feature representation. We further propose a learning based occlusion-aware Hand Texture Regression model to achieve high-fidelity texture reconstruction. We demonstrate the flexibility of the proposed probabilistic model to be trained in both supervised and weakly-supervised scenarios. The experimental results demonstrate our probabilistic model's state-of-the-art accuracy in 3D hand and texture reconstruction from a single image in both training schemes, including in the presence of severe occlusions.

MixNeRF: Modeling a Ray With Mixture Density for Novel View Synthesis From Sparse Inputs
Seo, SeunghyeonandHan, DonghoonandChang, YeonjinandKwak, Nojun



研究问题:现有的Neural Radiance Field (NeRF)模型在训练时需要大量的不同相机姿态的图像,这限制了其在实际中的应用。
动机:为了解决这一问题,本文提出了一种新的训练策略MixNeRF,通过将光线建模为混合密度模型,实现从稀疏输入中进行新颖视图合成。
方法:MixNeRF通过将光线样本上的RGB颜色沿光线估计为混合分布来估计联合分布。同时,提出了一个新的任务——射线深度估计,作为有用的训练目标,该目标与3D场景几何高度相关。此外,还根据估计的射线深度重新生成混合权重,进一步提高了对颜色和视点的鲁棒性。
效果:实验结果表明,MixNeRF在各种标准基准测试中优于其他最先进的方法,具有优越的训练和推理效率。

Neural Radiance Field (NeRF) has broken new ground in the novel view synthesis due to its simple concept and state-of-the-art quality. However, it suffers from severe performance degradation unless trained with a dense set of images with different camera poses, which hinders its practical applications. Although previous methods addressing this problem achieved promising results, they relied heavily on the additional training resources, which goes against the philosophy of sparse-input novel-view synthesis pursuing the training efficiency. In this work, we propose MixNeRF, an effective training strategy for novel view synthesis from sparse inputs by modeling a ray with a mixture density model. Our MixNeRF estimates the joint distribution of RGB colors along the ray samples by modeling it with mixture of distributions. We also propose a new task of ray depth estimation as a useful training objective, which is highly correlated with 3D scene geometry. Moreover, we remodel the colors with regenerated blending weights based on the estimated ray depth and further improves the robustness for colors and viewpoints. Our MixNeRF outperforms other state-of-the-art methods in various standard benchmarks with superior efficiency of training and inference.

Cross-Domain 3D Hand Pose Estimation With Dual Modalities
Lin, QiuxiaandYang, LinlinandYao, Angela



研究问题:如何利用合成数据训练神经网络进行手部姿态估计,并解决由于领域差距导致的对真实世界数据的泛化问题。
动机:现有的手部姿态估计方法主要依赖合成数据进行训练,但这种方法往往无法很好地泛化到真实世界的数据上。
方法:提出了一种跨领域的半监督手部姿态估计框架,该框架通过双模态网络同时利用RGB和深度合成图像进行训练。在预训练阶段,网络使用多模态对比学习和注意力融合监督来学习有效的RGB图像表示。在微调阶段,引入了一种新的自我蒸馏技术来减少伪标签噪声。
效果:实验表明,该方法在基准测试中显著提高了3D手部姿态估计和2D关键点检测的性能。

Recent advances in hand pose estimation have shed light on utilizing synthetic data to train neural networks, which however inevitably hinders generalization to real-world data due to domain gaps. To solve this problem, we present a framework for cross-domain semi-supervised hand pose estimation and target the challenging scenario of learning models from labelled multi-modal synthetic data and unlabelled real-world data. To that end, we propose a dual-modality network that exploits synthetic RGB and synthetic depth images. For pre-training, our network uses multi-modal contrastive learning and attention-fused supervision to learn effective representations of the RGB images. We then integrate a novel self-distillation technique during fine-tuning to reduce pseudo-label noise. Experiments show that the proposed method significantly improves 3D hand pose estimation and 2D keypoint detection on benchmarks.

Inverse Rendering of Translucent Objects Using Physical and Neural Renderers
Li, ChenhaoandNgo, TrungThanhandNagahara, Hajime



研究问题:本文提出了一种逆渲染模型,仅通过一对透明物体的捕获图像来估计3D形状、空间变化的反射率、均匀亚表面散射参数和环境照明。
动机:为了解决逆渲染的模糊性问题,我们使用物理基础渲染器和神经渲染器进行场景重建和材料编辑。由于两种渲染器都是可微分的,我们可以计算重构损失以协助参数估计。
方法:我们使用闪光灯和非闪光灯图像对作为输入,并构建了一个大型的透明物体合成数据集,包含117K个场景,以监督训练。
效果:在合成和真实世界的数据集上的定性和定量结果都证明了该模型的有效性。

In this work, we propose an inverse rendering model that estimates 3D shape, spatially-varying reflectance, homogeneous subsurface scattering parameters, and an environment illumination jointly from only a pair of captured images of a translucent object. In order to solve the ambiguity problem of inverse rendering, we use a physically-based renderer and a neural renderer for scene reconstruction and material editing. Because two renderers are differentiable, we can compute a reconstruction loss to assist parameter estimation. To enhance the supervision of the proposed neural renderer, we also propose an augmented loss. In addition, we use a flash and no-flash image pair as the input. To supervise the training, we constructed a large-scale synthetic dataset of translucent objects, which consists of 117K scenes. Qualitative and quantitative results on both synthetic and real-world datasets demonstrated the effectiveness of the proposed model.

Improving Fairness in Facial Albedo Estimation via Visual-Textual Cues
Ren, XingyuandDeng, JiankangandMa, ChaoandYan, YichaoandYang, Xiaokang



研究问题:如何准确估计人脸的反照率,同时避免由种族偏见和光照限制引起的光皮肤偏差。
动机:现有的方法在几何预测方面取得了显著的进步,但在改善反照率方面的进展受到滞后的影响,因为从外观推断反照率是一个病态的问题。
方法:我们重新考虑了反照率和面部属性之间的关系,并提出了ID2Albedo来直接估计反照率,而不约束光照。我们的关键洞察是内在的语义属性,如种族、肤色和年龄,可以约束反照率图。我们首先引入视觉文本线索并设计了一个语义损失来监督面部反照率估计。
效果:实验结果表明,我们的ID2Albedo在准确性和逼真度上优于最先进的反照率估计方法。此外,我们的方法具有优秀的泛化能力和公平性,特别是在野外数据上。

Recent 3D face reconstruction methods have made significant advances in geometry prediction, yet further cosmetic improvements are limited by lagged albedo because inferring albedo from appearance is an ill-posed problem. Although some existing methods consider prior knowledge from illumination to improve albedo estimation, they still produce a light-skin bias due to racially biased albedo models and limited light constraints. In this paper, we reconsider the relationship between albedo and face attributes and propose an ID2Albedo to directly estimate albedo without constraining illumination. Our key insight is that intrinsic semantic attributes such as race, skin color, and age can constrain the albedo map. We first introduce visual-textual cues and design a semantic loss to supervise facial albedo estimation. Specifically, we pre-define text labels such as race, skin color, age, and wrinkles. Then, we employ the text-image model (CLIP) to compute the similarity between the text and the input image, and assign a pseudo-label to each facial image. We constrain generated albedos in the training phase to have the same attributes as the inputs. In addition, we train a high-quality, unbiased facial albedo generator and utilize the semantic loss to learn the mapping from illumination-robust identity features to the albedo latent codes. Finally, our ID2Albedo is trained in a self-supervised way and outperforms state-of-the-art albedo estimation methods in terms of accuracy and fidelity. It is worth mentioning that our approach has excellent generalizability and fairness, especially on in-the-wild data.

SfM-TTR: Using Structure From Motion for Test-Time Refinement of Single-View Depth Networks
Izquierdo, SergioandCivera, Javier



研究问题:如何从单一视角估计稠密深度图?
动机:目前的方法主要依赖于深度学习网络学习深度与视觉外观的关系,而结构运动(SfM)方法则利用多视角约束产生精确但稀疏的映射。
方法:提出一种新颖的测试时修正(TTR)方法,称为SfM-TTR,该方法在测试时使用SfM多视角线索来提升单视角深度网络的性能。具体来说,我们使用稀疏的SfM点云作为测试时的自监督信号,微调网络编码器以学习更好的测试场景表示。
效果:我们的结果显示,将SfM-TTR添加到几种最先进的自我监督和监督网络中,显著提高了它们的性能,主要基于光度学多视角一致性的先前TTR基线相比,表现更好。

Estimating a dense depth map from a single view is geometrically ill-posed, and state-of-the-art methods rely on learning depth's relation with visual appearance using deep neural networks. On the other hand, Structure from Motion (SfM) leverages multi-view constraints to produce very accurate but sparse maps, as matching across images is typically limited by locally discriminative texture. In this work, we combine the strengths of both approaches by proposing a novel test-time refinement (TTR) method, denoted as SfM-TTR, that boosts the performance of single-view depth networks at test time using SfM multi-view cues. Specifically, and differently from the state of the art, we use sparse SfM point clouds as test-time self-supervisory signal, fine-tuning the network encoder to learn a better representation of the test scene. Our results show how the addition of SfM-TTR to several state-of-the-art self-supervised and supervised networks improves significantly their performance, outperforming previous TTR baselines mainly based on photometric multi-view consistency. The code is available at https://github.com/serizba/SfM-TTR.

Implicit View-Time Interpolation of Stereo Videos Using Multi-Plane Disparities and Non-Uniform Coordinates
Paliwal, AvinashandTsarov, AndriiandKalantari, NimaKhademi



研究问题:本文旨在提出一种立体视频的视时插值方法。
动机:目前的X-Fields在处理大基线相机的视差插值上存在问题,因此需要提出新的技术来克服这些挑战。
方法:我们提出了多平面视差和非线性非均匀时间坐标的方法,并进行了多项改进。
效果:实验结果表明,我们的方法优于现有技术,同时运行速度快,内存和存储成本低。

In this paper, we propose an approach for view-time interpolation of stereo videos. Specifically, we build upon X-Fields that approximates an interpolatable mapping between the input coordinates and 2D RGB images using a convolutional decoder. Our main contribution is to analyze and identify the sources of the problems with using X-Fields in our application and propose novel techniques to overcome these challenges. Specifically, we observe that X-Fields struggles to implicitly interpolate the disparities for large baseline cameras. Therefore, we propose multi-plane disparities to reduce the spatial distance of the objects in the stereo views. Moreover, we propose non-uniform time coordinates to handle the non-linear and sudden motion spikes in videos. We additionally introduce several simple, but important, improvements over X-Fields. We demonstrate that our approach is able to produce better results than the state of the art, while running in near real-time rates and having low memory and storage costs.

pCON: Polarimetric Coordinate Networks for Neural Scene Representations
Peters, HenryandBa, YunhaoandKadambi, Achuta



研究问题:目前的神经场景表示模型在重建图像时并未优化物理量的保留。
动机:虽然现有的架构可以正确重建彩色图像,但在尝试拟合极性量图时会产生伪影。
方法:我们提出了极坐标网络(pCON),这是一种新的神经场景表示模型,旨在在准确参数化场景的同时保留极性信息。
效果:我们的模型消除了当前坐标网络架构在重建三个关注的极性量时产生的伪影。

Neural scene representations have achieved great success in parameterizing and reconstructing images, but current state of the art models are not optimized with the preservation of physical quantities in mind. While current architectures can reconstruct color images correctly, they create artifacts when trying to fit maps of polar quantities. We propose polarimetric coordinate networks (pCON), a new model architecture for neural scene representations aimed at preserving polarimetric information while accurately parameterizing the scene. Our model removes artifacts created by current coordinate network architectures when reconstructing three polarimetric quantities of interest.

Visibility Aware Human-Object Interaction Tracking From Single RGB Camera
Xie, XianghuiandBhatnagar, BharatLalandPons-Moll, Gerard



研究问题:如何从单张RGB图像中重建3D人体和物体的交互,并跟踪其在帧间的相对平移。
动机:现有的方法在重建3D人体和物体时,由于假设了固定的深度,导致在不同帧之间的相对平移不一致,且当物体被遮挡时性能显著下降。
方法:提出一种新的方法,通过预拟合SMPL模型到视频序列来获取每帧的SMPL模型估计,以此作为条件进行神经场重建,提高了神经重建的准确性,并产生了连贯的相对平移。同时,利用可见帧中的人体和物体运动信息推断被遮挡物体。
效果:实验表明,该方法在两个数据集上都显著优于现有方法,即使在被遮挡的情况下也能有效地跟踪人体和物体。

Capturing the interactions between humans and their environment in 3D is important for many applications in robotics, graphics, and vision. Recent works to reconstruct the 3D human and object from a single RGB image do not have consistent relative translation across frames because they assume a fixed depth. Moreover, their performance drops significantly when the object is occluded. In this work, we propose a novel method to track the 3D human, object, contacts, and relative translation across frames from a single RGB camera, while being robust to heavy occlusions. Our method is built on two key insights. First, we condition our neural field reconstructions for human and object on per-frame SMPL model estimates obtained by pre-fitting SMPL to a video sequence. This improves neural reconstruction accuracy and produces coherent relative translation across frames. Second, human and object motion from visible frames provides valuable information to infer the occluded object. We propose a novel transformer-based neural network that explicitly uses object visibility and human motion to leverage neighboring frames to make predictions for the occluded frames. Building on these insights, our method is able to track both human and object robustly even under occlusions. Experiments on two datasets show that our method significantly improves over the state-of-the-art methods. Our code and pretrained models are available at: https://virtualhumans.mpi-inf.mpg.de/VisTracker.

Uncertainty-Aware Vision-Based Metric Cross-View Geolocalization
Fervers, FlorianandBullinger, SebastianandBodensteiner, ChristophandArens, MichaelandStiefelhagen, Rainer



研究问题:如何通过匹配地面车辆的摄像头图像和航空图像来确定车辆的地理位置。
动机:航空图像全球范围内可获取且成本低廉,可以作为自动驾驶两种主流方法(使用昂贵的高清预先地图或完全依赖运行时捕获的传感器数据)之间的折衷方案。
方法:提出一种新颖的基于视觉的度量跨视图地理定位(CVGL)方法,该方法使用地面和航空图像预测可能的车辆位置的概率分布。
效果:在多个车辆数据集和航空图像上进行演示,证明了该方法的可行性。即使没有来自测试区域的地面或航空数据,也能大幅提高先前最先进技术的性能,显示出模型在全球范围内的应用潜力。

This paper proposes a novel method for vision-based metric cross-view geolocalization (CVGL) that matches the camera images captured from a ground-based vehicle with an aerial image to determine the vehicle's geo-pose. Since aerial images are globally available at low cost, they represent a potential compromise between two established paradigms of autonomous driving, i.e. using expensive high-definition prior maps or relying entirely on the sensor data captured at runtime. We present an end-to-end differentiable model that uses the ground and aerial images to predict a probability distribution over possible vehicle poses. We combine multiple vehicle datasets with aerial images from orthophoto providers on which we demonstrate the feasibility of our method. Since the ground truth poses are often inaccurate w.r.t. the aerial images, we implement a pseudo-label approach to produce more accurate ground truth poses and make them publicly available. While previous works require training data from the target region to achieve reasonable localization accuracy (i.e. same-area evaluation), our approach overcomes this limitation and outperforms previous results even in the strictly more challenging cross-area case. We improve the previous state-of-the-art by a large margin even without ground or aerial data from the test region, which highlights the model's potential for global-scale application. We further integrate the uncertainty-aware predictions in a tracking framework to determine the vehicle's trajectory over time resulting in a mean position error on KITTI-360 of 0.78m.

DANI-Net: Uncalibrated Photometric Stereo by Differentiable Shadow Handling, Anisotropic Reflectance Modeling, and Neural Inverse Rendering
Li, ZongruiandZheng, QianandShi, BoxinandPan, GangandJiang, Xudong



研究问题:解决未校准的光度立体视觉(UPS)问题,特别是对于具有复杂形状和不规则阴影的一般物体以及具有复杂反射性如各向异性反射的材料。
动机:由于未知光带来的固有模糊性,未校准的光度立体视觉(UPS)具有挑战性。尽管非朗伯体物体的模糊性得到了缓解,但对于引入不规则阴影和具有复杂反射性(如各向异性反射)的一般材料的物体来说,这个问题仍然难以解决。
方法:我们提出了DANI-Net,一个具有可微分阴影处理和各向异性反射建模的逆渲染框架。与大多数使用不可微分阴影映射并假设各向同性材料的先前方法不同,我们的网络通过两条可微分路径受益于阴影和各向异性反射的线索。
效果:我们在多个真实世界数据集上进行的实验表明,我们的方法具有优越且稳健的性能。

Uncalibrated photometric stereo (UPS) is challenging due to the inherent ambiguity brought by the unknown light. Although the ambiguity is alleviated on non-Lambertian objects, the problem is still difficult to solve for more general objects with complex shapes introducing irregular shadows and general materials with complex reflectance like anisotropic reflectance. To exploit cues from shadow and reflectance to solve UPS and improve performance on general materials, we propose DANI-Net, an inverse rendering framework with differentiable shadow handling and anisotropic reflectance modeling. Unlike most previous methods that use non-differentiable shadow maps and assume isotropic material, our network benefits from cues of shadow and anisotropic reflectance through two differentiable paths. Experiments on multiple real-world datasets demonstrate our superior and robust performance.

Ref-NPR: Reference-Based Non-Photorealistic Radiance Fields for Controllable Scene Stylization
Zhang, YuechenandHe, ZexinandXing, JinboandYao, XufengandJia, Jiaya



研究问题:目前的3D场景风格化方法在将纹理和颜色作为样式进行转移时,缺乏有意义的语义对应关系。
动机:为了解决这个问题,我们提出了基于参考的非真实感辐射场(Ref-NPR)方法。
方法:该方法使用单一风格的2D视图作为参考,通过辐射场对3D场景进行风格化。我们还提出了一种基于风格化参考视图的光线注册过程,以在新视图中获得伪光线监督。然后,我们利用内容图像中的语义对应关系,用感知上相似的风格填充遮挡区域,从而生成非真实感且连续的新视图序列。
效果:实验结果表明,与现有的场景和视频风格化方法相比,Ref-NPR在视觉质量和语义对应方面表现更好。

Current 3D scene stylization methods transfer textures and colors as styles using arbitrary style references, lacking meaningful semantic correspondences. We introduce Reference-Based Non-Photorealistic Radiance Fields (Ref-NPR) to address this limitation. This controllable method stylizes a 3D scene using radiance fields with a single stylized 2D view as a reference. We propose a ray registration process based on the stylized reference view to obtain pseudo-ray supervision in novel views. Then we exploit semantic correspondences in content images to fill occluded regions with perceptually similar styles, resulting in non-photorealistic and continuous novel view sequences. Our experimental results demonstrate that Ref-NPR outperforms existing scene and video stylization methods regarding visual quality and semantic correspondence. The code and data are publicly available on the project page at https://ref-npr.github.io.

NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization
Min, ZhixiangandZhuang, BingbingandSchulter, SamuelandLiu, BuyuandDunn, EnriqueandChandraker, Manmohan



研究问题:单目驾驶场景下的三维物体定位是一项关键任务,但由于其病态特性而具有挑战性。
动机:估计物体表面上每个像素的3D坐标具有巨大潜力,因为它为底层PnP问题提供了密集的2D-3D几何约束。然而,由于Lidar数据的稀疏性和各种伪影,以及收集每个实例CAD模型的实际不可行性,在驾驶场景中无法获得高质量的地面真实监督。
方法:我们提出了NeurOCS框架,该框架使用实例掩码和3D框作为输入,通过可微分渲染学习3D物体形状,这进一步作为学习密集物体坐标的监督。
效果:我们的框架在单目驾驶场景下的三维物体定位方面取得了新的最先进的成果,在KITTI-Object基准测试中排名首位。

Monocular 3D object localization in driving scenes is a crucial task, but challenging due to its ill-posed nature. Estimating 3D coordinates for each pixel on the object surface holds great potential as it provides dense 2D-3D geometric constraints for the underlying PnP problem. However, high-quality ground truth supervision is not available in driving scenes due to sparsity and various artifacts of Lidar data, as well as the practical infeasibility of collecting per-instance CAD models. In this work, we present NeurOCS, a framework that uses instance masks and 3D boxes as input to learn 3D object shapes by means of differentiable rendering, which further serves as supervision for learning dense object coordinates. Our approach rests on insights in learning a category-level shape prior directly from real driving scenes, while properly handling single-view ambiguities. Furthermore, we study and make critical design choices to learn object coordinates more effectively from an object-centric view. Altogether, our framework leads to new state-of-the-art in monocular 3D localization that ranks 1st on the KITTI-Object benchmark among published monocular methods.

TMO: Textured Mesh Acquisition of Objects With a Mobile Device by Using Differentiable Rendering
Choi, JaehoonandJung, DongkiandLee, TaejaeandKim, SangwookandJung, YoungdongandManocha, DineshandLee, Donghwan



研究问题:本文旨在提出一种新的方法,通过智能手机获取野外有质感的3D模型。
动机:现有的3D重建和纹理映射方法需要实验室环境或精确的掩膜图像,而本文的方法可以在真实世界中的任何常见物体上应用,无需这些条件。
方法:首先,利用RGBD辅助的结构运动从图片、深度图和有效姿态中生成过滤后的深度图并优化相机姿态;然后,采用神经隐式表面重建方法生成高质量的3D网格,并开发一种新的训练过程,应用经典的多视图立体方法提供的正则化;最后,应用可微渲染来微调不完整的纹理地图,生成更接近原始场景的纹理。
效果:实验结果表明,该方法可以捕获具有复杂形状的物体,并在数值上优于现有的3D重建和纹理映射方法。

We present a new pipeline for acquiring a textured mesh in the wild with a single smartphone which offers access to images, depth maps, and valid poses. Our method first introduces an RGBD-aided structure from motion, which can yield filtered depth maps and refines camera poses guided by corresponding depth. Then, we adopt the neural implicit surface reconstruction method, which allows for high quality mesh and develops a new training process for applying a regularization provided by classical multi-view stereo methods. Moreover, we apply a differentiable rendering to fine-tune incomplete texture maps and generate textures which are perceptually closer to the original scene. Our pipeline can be applied to any common objects in the real world without the need for either in-the-lab environments or accurate mask images. We demonstrate results of captured objects with complex shapes and validate our method numerically against existing 3D reconstruction and texture mapping methods.

Tensor4D: Efficient Neural 4D Decomposition for High-Fidelity Dynamic Reconstruction and Rendering
Shao, RuizhiandZheng, ZerongandTu, HanzhangandLiu, BoningandZhang, HongwenandLiu, Yebin



研究问题:如何有效地进行动态场景建模。
动机:现有的动态场景建模方法存在内存消耗大的问题。
方法:提出一种名为Tensor4D的方法,通过高效的4D张量分解技术直接将动态场景表示为一个四维时空张量,并通过层次化分解以紧凑且高效的方式同时捕获空间信息随时间的变化。
效果:在合成和真实世界的场景上验证了该方法的有效性,实验表明,该方法能够从稀疏视图的相机架或甚至单眼相机实现高质量的动态重建和渲染。

We present Tensor4D, an efficient yet effective approach to dynamic scene modeling. The key of our solution is an efficient 4D tensor decomposition method so that the dynamic scene can be directly represented as a 4D spatio-temporal tensor. To tackle the accompanying memory issue, we decompose the 4D tensor hierarchically by projecting it first into three time-aware volumes and then nine compact feature planes. In this way, spatial information over time can be simultaneously captured in a compact and memory-efficient manner. When applying Tensor4D for dynamic scene reconstruction and rendering, we further factorize the 4D fields to different scales in the sense that structural motions and dynamic detailed changes can be learned from coarse to fine. The effectiveness of our method is validated on both synthetic and real-world scenes. Extensive experiments show that our method is able to achieve high-quality dynamic reconstruction and rendering from sparse-view camera rigs or even a monocular camera.

Blowing in the Wind: CycleNet for Human Cinemagraphs From Still Images
Bertiche, HugoandMitra, NiloyJ.andKulkarni, KuldeepandHuang, Chun-HaoP.andWang, TuanfengY.andMadadi, MeysamandEscalera, SergioandCeylan, Duygu



研究问题:本文旨在开发一种自动方法,从单张RGB图片生成人类动态影像。
动机:目前的动态影像生成技术需要艺术家进行繁琐的手动操作,而自动生成动态影像的方法尚未得到充分探索。
方法:通过在图像正常空间中工作,学习服装运动动力学,并在合成数据上进行训练,然后推广到真实数据。
效果:实验证明,该方法可以在合成和真实数据上生成引人注目且合理的动态影像。

Cinemagraphs are short looping videos created by adding subtle motions to a static image. This kind of media is popular and engaging. However, automatic generation of cinemagraphs is an underexplored area and current solutions require tedious low-level manual authoring by artists. In this paper, we present an automatic method that allows generating human cinemagraphs from single RGB images. We investigate the problem in the context of dressed humans under the wind. At the core of our method is a novel cyclic neural network that produces looping cinemagraphs for the target loop duration. To circumvent the problem of collecting real data, we demonstrate that it is possible, by working in the image normal space, to learn garment motion dynamics on synthetic data and generalize to real data. We evaluate our method on both synthetic and real data and demonstrate that it is possible to create compelling and plausible cinemagraphs from single RGB images.

Panoptic Compositional Feature Field for Editable Scene Rendering With Network-Inferred Labels via Metric Learning
Cheng, XinhuaandWu, YanminandJia, MengxiandWang, QianandZhang, Jian



研究问题:尽管神经隐式表示在高质量视图合成方面表现出色,但将其分解为实例级别的对象进行编辑仍然具有挑战性。
动机:现有的工作通过地面真实实例注释学习了对象组合的表示,并在场景编辑中产生了有希望的结果。然而,地面真实注释是手动标记的,在实践中成本高昂,限制了其在真实世界场景中的使用。
方法:我们尝试通过利用从现成的二维全景分割网络推断出的标签,而不是地面真实注释,来学习可编辑场景渲染的对象组合神经隐式表示。我们提出了一种名为Panoptic Compositional Feature Field(PCFF)的新框架,该框架引入了一个实例四元组度量学习,以建立一个可靠的场景编辑的辨别性全景特征空间。此外,我们还提出了语义相关策略,以进一步挖掘语义和外观属性之间的关联,以实现更好的渲染结果。
效果:我们在多个场景数据集上进行了实验,包括ScanNet、Replica和ToyDesk,实验结果表明,我们提出的方法在新颖视图合成方面表现优越,并产生了令人信服的真实世界场景编辑结果。

Despite neural implicit representations demonstrating impressive high-quality view synthesis capacity, decomposing such representations into objects for instance-level editing is still challenging. Recent works learn object-compositional representations supervised by ground truth instance annotations and produce promising scene editing results. However, ground truth annotations are manually labeled and expensive in practice, which limits their usage in real-world scenes. In this work, we attempt to learn an object-compositional neural implicit representation for editable scene rendering by leveraging labels inferred from the off-the-shelf 2D panoptic segmentation networks instead of the ground truth annotations. We propose a novel framework named Panoptic Compositional Feature Field (PCFF), which introduces an instance quadruplet metric learning to build a discriminating panoptic feature space for reliable scene editing. In addition, we propose semantic-related strategies to further exploit the correlations between semantic and appearance attributes for achieving better rendering results. Experiments on multiple scene datasets including ScanNet, Replica, and ToyDesk demonstrate that our proposed method achieves superior performance for novel view synthesis and produces convincing real-world scene editing results. The code will be available.

Neural Kaleidoscopic Space Sculpting
Ahn, ByeongjooandDeZeeuw, MichaelandGkioulekas, IoannisandSankaranarayanan, AswinC.



研究问题:如何从一张万花筒图像中恢复完整的3D重建。
动机:全环绕3D重建在增强现实和虚拟现实等许多应用中至关重要,而使用单个相机和多个镜子的万花筒可以方便地实现全环绕覆盖,因此成为单次拍摄和动态全环绕3D重建的理想选择。
方法:通过仔细利用万花筒图像中的轮廓、背景、前景和纹理信息,“雕刻”出一种神经表面表示,从而避免了显式估计标签的需要。
效果:在一系列模拟和真实实验中,无论是静态场景还是动态场景,该方法都表现出了显著的优势。

We introduce a method that recovers full-surround 3D reconstructions from a single kaleidoscopic image using a neural surface representation. Full-surround 3D reconstruction is critical for many applications, such as augmented and virtual reality. A kaleidoscope, which uses a single camera and multiple mirrors, is a convenient way of achieving full-surround coverage, as it redistributes light directions and thus captures multiple viewpoints in a single image. This enables single-shot and dynamic full-surround 3D reconstruction. However, using a kaleidoscopic image for multi-view stereo is challenging, as we need to decompose the image into multi-view images by identifying which pixel corresponds to which virtual camera, a process we call labeling. To address this challenge, pur approach avoids the need to explicitly estimate labels, but instead "sculpts" a neural surface representation through the careful use of silhouette, background, foreground, and texture information present in the kaleidoscopic image. We demonstrate the advantages of our method in a range of simulated and real experiments, on both static and dynamic scenes.

Unsupervised Intrinsic Image Decomposition With LiDAR Intensity
Sato, ShogoandYao, YasuhiroandYoshida, TaigaandKaneko, TakuhiroandAndo, ShingoandShimamura, Jun



研究问题:本文旨在解决自然图像分解为反射率和阴影的固有图像分解(IID)任务,由于在一般场景中难以观察到地面真实的反射率和阴影,因此通常通过监督学习方法来解决,但这并不理想。
动机:目前,由于无监督学习方法无法解决病态问题,其性能不如监督学习方法。最近,由于能够进行高精度距离测量,光探测和测距(LiDAR)得到了广泛应用。因此,我们专注于利用LiDAR,特别是LiDAR强度来解决此问题。
方法:本文提出了一种使用LiDAR强度进行无监督固有图像分解的方法(IID-LI)。由于传统的无监督学习方法包括图像到图像的转换,因此简单地输入LiDAR强度并不是一个有效的方法。因此,我们设计了一个强度一致性损失函数,该函数计算LiDAR强度与灰度反射率之间的误差,为病态问题提供了一个标准。此外,由于LiDAR强度的稀疏性和遮挡性,我们提出了一个LiDAR强度密集化模块。
效果:我们使用自己的数据集(包括RGB图像、LiDAR强度和人工判断注释)验证了估计质量。结果显示,我们的方法在估计准确性上超过了传统无监督学习方法。

Intrinsic image decomposition (IID) is the task that decomposes a natural image into albedo and shade. While IID is typically solved through supervised learning methods, it is not ideal due to the difficulty in observing ground truth albedo and shade in general scenes. Conversely, unsupervised learning methods are currently underperforming supervised learning methods since there are no criteria for solving the ill-posed problems. Recently, light detection and ranging (LiDAR) is widely used due to its ability to make highly precise distance measurements. Thus, we have focused on the utilization of LiDAR, especially LiDAR intensity, to address this issue. In this paper, we propose unsupervised intrinsic image decomposition with LiDAR intensity (IID-LI). Since the conventional unsupervised learning methods consist of image-to-image transformations, simply inputting LiDAR intensity is not an effective approach. Therefore, we design an intensity consistency loss that computes the error between LiDAR intensity and gray-scaled albedo to provide a criterion for the ill-posed problem. In addition, LiDAR intensity is difficult to handle due to its sparsity and occlusion, hence, a LiDAR intensity densification module is proposed. We verified the estimating quality using our own dataset, which include RGB images, LiDAR intensity and human judged annotations. As a result, we achieved an estimation accuracy that outperforms conventional unsupervised learning methods.

PET-NeuS: Positional Encoding Tri-Planes for Neural Surfaces
Wang, YiqunandSkorokhodov, IvanandWonka, Peter



研究问题:如何通过改进神经网络表面重建方法,提高重建的精度和质量。
动机:目前的神经网络表面重建方法(如NeuS)虽然取得了一定的成功,但仍存在噪声干扰和表达性不足的问题。
方法:提出了一种基于MLP参数化的距离函数(SDF)的新方法,该方法引入了三个新的组件:一是借鉴EG3D的三平面表示法,将有向距离场表示为三平面和MLP的混合体;二是使用具有可学习权重的新型位置编码来对抗重建过程中的噪声;三是利用自注意力卷积对三平面特征进行可学习的卷积操作,以产生不同频段的特征。
效果:实验表明,新方法在标准数据集上实现了高精度的表面重建,相比于NeuS基线,在Nerf-synthetic数据集上提高了57%,在DTU数据集上提高了15.5%。定性评估显示,新方法能更好地控制高频噪声的干扰。

A signed distance function (SDF) parametrized by an MLP is a common ingredient of neural surface reconstruction. We build on the successful recent method NeuS to extend it by three new components. The first component is to borrow the tri-plane representation from EG3D and represent signed distance fields as a mixture of tri-planes and MLPs instead of representing it with MLPs only. Using tri-planes leads to a more expressive data structure but will also introduce noise in the reconstructed surface. The second component is to use a new type of positional encoding with learnable weights to combat noise in the reconstruction process. We divide the features in the tri-plane into multiple frequency scales and modulate them with sin and cos functions of different frequencies. The third component is to use learnable convolution operations on the tri-plane features using self-attention convolution to produce features with different frequency bands. The experiments show that PET-NeuS achieves high-fidelity surface reconstruction on standard datasets. Following previous work and using the Chamfer metric as the most important way to measure surface reconstruction quality, we are able to improve upon the NeuS baseline by 57% on Nerf-synthetic (0.84 compared to 1.97) and by 15.5% on DTU (0.71 compared to 0.84). The qualitative evaluation reveals how our method can better control the interference of high-frequency noise.

Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation From 2D Supervision
Zhang, XiaoshuaiandKundu, AbhijitandFunkhouser, ThomasandGuibas, LeonidasandSu, HaoandGenova, Kyle



研究问题:本文旨在解决如何从图像中进行高效且结构感知的3D场景表示。
动机:传统的全局NeRFs在表示场景时存在效率低下的问题,需要改进。
方法:提出一种新的局部神经辐射场——Nerflets,每个Nerflet都保持自己的空间位置、方向和范围,共同表示一个场景。通过仅使用光度学和推断的全景图像监督,可以直接联合优化一组Nerflets的参数,形成场景的分解表示,其中每个对象实例由一组Nerflets表示。
效果:实验结果表明,Nerflets比传统的全球NeRFs更有效地拟合和近似场景,可以从任意视角提取全景和光度渲染,并能够执行对NeRFs来说罕见的任务,如3D全景分割和交互式编辑。

We address efficient and structure-aware 3D scene representation from images. Nerflets are our key contribution-- a set of local neural radiance fields that together represent a scene. Each nerflet maintains its own spatial position, orientation, and extent, within which it contributes to panoptic, density, and radiance reconstructions. By leveraging only photometric and inferred panoptic image supervision, we can directly and jointly optimize the parameters of a set of nerflets so as to form a decomposed representation of the scene, where each object instance is represented by a group of nerflets. During experiments with indoor and outdoor environments, we find that nerflets: (1) fit and approximate the scene more efficiently than traditional global NeRFs, (2) allow the extraction of panoptic and photometric renderings from arbitrary views, and (3) enable tasks rare for NeRFs, such as 3D panoptic segmentation and interactive editing.

Multi-View Azimuth Stereo via Tangent Space Consistency
Cao, XuandSanto, HiroakiandOkura, FumioandMatsushita, Yasuyuki



研究问题:提出一种仅使用校准后的多视角表面方位图进行3D重建的方法。
动机:针对传统多视角立体方法难以处理的无纹理或镜面表面,我们提出了一种新的方法。
方法:引入切线空间一致性概念,通过优化神经隐式表面表示来恢复形状。
效果:实验证明,即使没有天顶角,该方法也能准确恢复形状。

We present a method for 3D reconstruction only using calibrated multi-view surface azimuth maps. Our method, multi-view azimuth stereo, is effective for textureless or specular surfaces, which are difficult for conventional multi-view stereo methods. We introduce the concept of tangent space consistency: Multi-view azimuth observations of a surface point should be lifted to the same tangent space. Leveraging this consistency, we recover the shape by optimizing a neural implicit surface representation. Our method harnesses the robust azimuth estimation capabilities of photometric stereo methods or polarization imaging while bypassing potentially complex zenith angle estimation. Experiments using azimuth maps from various sources validate the accurate shape recovery with our method, even without zenith angles.

Self-Supervised Representation Learning for CAD
Jones, BenjaminT.andHu, MichaelandKodnongbua, MilinandKim, VladimirG.andSchulz, Adriana



研究问题:如何利用未标记的计算机辅助设计(CAD)几何数据进行有监督学习任务。
动机:当前,CAD的原生格式——参数化边界表示(B-Rep)缺乏标签数据,这在当前的研究中是一个难以克服的难题。
方法:我们提出了一种新的混合隐式/显式表面表示法,用于对B-Rep几何进行预训练,并证明了这种预训练可以显著提高少次学习的性能,并在几个当前的B-Rep基准测试中实现了最先进的性能。
效果:实验结果表明,该方法能有效地利用未标记的CAD几何数据进行有监督学习任务,提高了学习性能。

Virtually every object in the modern world was created, modified, analyzed and optimized using computer aided design (CAD) tools. An active CAD research area is the use of data-driven machine learning methods to learn from the massive repositories of geometric and program representations. However, the lack of labeled data in CAD's native format, i.e., the parametric boundary representation (B-Rep), poses an obstacle at present difficult to overcome. Several datasets of mechanical parts in B-Rep format have recently been released for machine learning research. However, large-scale databases are mostly unlabeled, and labeled datasets are small. Additionally, task-specific label sets are rare and costly to annotate. This work proposes to leverage unlabeled CAD geometry on supervised learning tasks. We learn a novel, hybrid implicit/explicit surface representation for B-Rep geometry. Further, we show that this pre-training both significantly improves few-shot learning performance and achieves state-of-the-art performance on several current B-Rep benchmarks.

BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects
Wen, BowenandTremblay, JonathanandBlukis, ValtsandTyree, StephenandM\"uller, ThomasandEvans, AlexandFox, DieterandKautz, JanandBirchfield, Stan



研究问题:如何从单目RGBD视频序列中实时(10Hz)跟踪并重建未知物体的6自由度?
动机:目前的方法需要额外的信息或对交互主体的假设,我们的目标是在没有任何额外信息和对交互主体假设的情况下,实现任意刚性物体的实时跟踪和重建。
方法:我们提出了一种神经对象场方法,通过在姿态图优化过程中同时学习,将信息稳健地累积到一个同时捕获几何和外观的一致3D表示中。
效果:我们在HO3D、YCBInEOAT和BEHAVE数据集上的结果证明,我们的方法显著优于现有的方法。

We present a near real-time (10Hz) method for 6-DoF tracking of an unknown object from a monocular RGBD video sequence, while simultaneously performing neural 3D reconstruction of the object. Our method works for arbitrary rigid objects, even when visual texture is largely absent. The object is assumed to be segmented in the first frame only. No additional information is required, and no assumption is made about the interaction agent. Key to our method is a Neural Object Field that is learned concurrently with a pose graph optimization process in order to robustly accumulate information into a consistent 3D representation capturing both geometry and appearance. A dynamic pool of posed memory frames is automatically maintained to facilitate communication between these threads. Our approach handles challenging sequences with large pose changes, partial and full occlusion, untextured surfaces, and specular highlights. We show results on HO3D, YCBInEOAT, and BEHAVE datasets, demonstrating that our method significantly outperforms existing approaches. Project page: https://bundlesdf.github.io/

Humans As Light Bulbs: 3D Human Reconstruction From Thermal Reflection
Liu, RuoshiandVondrick, Carl



研究问题:如何通过人体热反射定位和重构人体姿态,即使人眼无法直接看到。
动机:由于人体温度较高,会发出长波红外光,这种光的波长大于可见光,因此可以作为红外光源来定位和重构人体姿态。
方法:提出一种分析-合成框架,联合建模物体、人和他们的热反射,结合了生成模型和可微分反射渲染。
效果:定量和定性实验表明,该方法在高度挑战性的情况下有效,如曲面镜或当人眼完全看不到普通相机时。

The relatively hot temperature of the human body causes people to turn into long-wave infrared light sources. Since this emitted light has a larger wavelength than visible light, many surfaces in typical scenes act as infrared mirrors with strong specular reflections. We exploit the thermal reflections of a person onto objects in order to locate their position and reconstruct their pose, even if they are not visible to a normal camera. We propose an analysis-by-synthesis framework that jointly models the objects, people, and their thermal reflections, which allows us to combine generative models with differentiable rendering of reflections. Quantitative and qualitative experiments show our approach works in highly challenging cases, such as with curved mirrors or when the person is completely unseen by a normal camera.

Hi4D: 4D Instance Segmentation of Close Human Interaction
Yin, YifeiandGuo, ChenandKaufmann, ManuelandZarate, JuanJoseandSong, JieandHilliges, Otmar



研究问题:如何有效地分析长时间接触下的人体互动?
动机:由于遮挡和复杂形状,现有的多视角系统通常将近距离的多个主体的3D表面融合成一个单一的连接网格,这在分离多个接触主体时是一个挑战。
方法:我们提出了Hi4D方法,该方法利用i)单独拟合的神经隐性化身;ii)交替优化方案,通过频繁的接近来细化姿势和表面;iii)从而将融合的原始扫描分割成单个实例。
效果:我们从这些实例中编译了Hi4D数据集,包含20个主体对、100个序列和总共超过11K帧的4D纹理扫描。Hi4D包含丰富的交互式2D和3D注释以及准确注册的参数化身体模型。我们在该数据集上定义了各种人体姿势和形状估计任务,并提供了最先进的方法在这些基准上的结果。

We propose Hi4D, a method and dataset for the auto analysis of physically close human-human interaction under prolonged contact. Robustly disentangling several in-contact subjects is a challenging task due to occlusions and complex shapes. Hence, existing multi-view systems typically fuse 3D surfaces of close subjects into a single, connected mesh. To address this issue we leverage i) individually fitted neural implicit avatars; ii) an alternating optimization scheme that refines pose and surface through periods of close proximity; and iii) thus segment the fused raw scans into individual instances. From these instances we compile Hi4D dataset of 4D textured scans of 20 subject pairs, 100 sequences, and a total of more than 11K frames. Hi4D contains rich interaction-centric annotations in 2D and 3D alongside accurately registered parametric body models. We define varied human pose and shape estimation tasks on this dataset and provide results from state-of-the-art methods on these benchmarks.

Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery From Sparse Image Ensemble
Yao, Chun-HanandHung, Wei-ChihandLi, YuanzhenandRubinstein, MichaelandYang, Ming-HsuanandJampani, Varun



研究问题:如何仅从稀疏的野外图像集合中自动估计3D骨架、形状、相机视点和部分关节。
动机:大部分现有方法依赖于大规模图像数据集、密集的时间对应关系或人工注释,如相机姿态、2D关键点和形状模板,但这种方法需要大量的用户输入。
方法:我们提出了Hi-LASSIE,它只需要20-30张在线野外图像,无需任何用户定义的形状或骨架模板,就可以进行3D关节重建。我们首先从选定的参考图像中自动估计特定类别的骨架,然后通过新的实例特定优化策略改进形状重建,使重建能够忠实地适应每个实例,同时保留所有图像中学习到的类别特定先验知识。
效果:实验表明,尽管需要最少的用户输入,Hi-LASSIE仍然可以获得更高保真度的最先进的3D重建结果。

Automatically estimating 3D skeleton, shape, camera viewpoints, and part articulation from sparse in-the-wild image ensembles is a severely under-constrained and challenging problem. Most prior methods rely on large-scale image datasets, dense temporal correspondence, or human annotations like camera pose, 2D keypoints, and shape templates. We propose Hi-LASSIE, which performs 3D articulated reconstruction from only 20-30 online images in the wild without any user-defined shape or skeleton templates. We follow the recent work of LASSIE that tackles a similar problem setting and make two significant advances. First, instead of relying on a manually annotated 3D skeleton, we automatically estimate a class-specific skeleton from the selected reference image. Second, we improve the shape reconstructions with novel instance-specific optimization strategies that allow reconstructions to faithful fit on each instance while preserving the class-specific priors learned across all images. Experiments on in-the-wild image ensembles show that Hi-LASSIE obtains higher fidelity state-of-the-art 3D reconstructions despite requiring minimum user input. Project page: chhankyao.github.io/hi-lassie/

PointAvatar: Deformable Point-Based Head Avatars From Videos
Zheng, YufengandYifan, WangandWetzstein, GordonandBlack, MichaelJ.andHilliges, Otmar



研究问题:如何从随意的视频序列中创建逼真的可动画化和可重光照头部头像,并在通信和娱乐领域广泛应用。
动机:现有的方法要么基于明确的3D可变形网格(3DMM),要么利用神经隐含表示,但都存在各种限制。
方法:我们提出了PointAvatar,一种基于点的可变形表示,将源颜色解耦为固有反照率和法线依赖的着色。
效果:PointAvatar成功地连接了现有的网格和隐含表示之间的差距,结合了高质量的几何和外观、拓扑灵活性、易于变形和高效的渲染。我们的方法是首个能在多种来源(包括手持智能手机、笔记本电脑网络摄像头和互联网视频)的单目视频中生成可动画化3D头像的方法,并在挑战性的情况下(如细发丝)实现了超越先前方法的顶级质量,同时在训练效率上也显著优于竞争方法。

The ability to create realistic animatable and relightable head avatars from casual video sequences would open up wide ranging applications in communication and entertainment. Current methods either build on explicit 3D morphable meshes (3DMM) or exploit neural implicit representations. The former are limited by fixed topology, while the latter are non-trivial to deform and inefficient to render. Furthermore, existing approaches entangle lighting and albedo, limiting the ability to re-render the avatar in new environments. In contrast, we propose PointAvatar, a deformable point-based representation that disentangles the source color into intrinsic albedo and normal-dependent shading. We demonstrate that PointAvatar bridges the gap between existing mesh- and implicit representations, combining high-quality geometry and appearance with topological flexibility, ease of deformation and rendering efficiency. We show that our method is able to generate animatable 3D avatars using monocular videos from multiple sources including hand-held smartphones, laptop webcams and internet videos, achieving state-of-the-art quality in challenging cases where previous methods fail, e.g., thin hair strands, while being significantly more efficient in training than competing methods.

Seeing Through the Glass: Neural 3D Reconstruction of Object Inside a Transparent Container
Tong, JinguangandMuthu, SundaramandMaken, FahiraAfzalandNguyen, ChuongandLi, Hongdong



研究问题:本文旨在解决透明封闭空间内物体的三维几何重建问题。
动机:现有的方法无法准确重建透明封闭空间内物体的三维几何,因为光在空气和玻璃等不同传播介质之间的界面上会产生多次反射和折射,导致严重的图像失真。
方法:我们提出了一种新的方法,将场景明确地建模为透明封闭空间内外两个不同的子空间。我们使用一种现有的神经重建方法(NeuS)来隐式表示内部子空间的几何和外观,并开发了一种混合渲染策略,结合体积渲染和光线追踪,以考虑复杂的光交互作用。然后,通过最小化真实图像和渲染图像之间的差异,恢复模型的基本几何和外观。
效果:我们在合成数据和真实数据上都进行了评估,实验结果表明,我们的方法优于最先进的方法。

In this paper, we define a new problem of recovering the 3D geometry of an object confined in a transparent enclosure. We also propose a novel method for solving this challenging problem. Transparent enclosures pose challenges of multiple light reflections and refractions at the interface between different propagation media e.g. air or glass. These multiple reflections and refractions cause serious image distortions which invalidate the single viewpoint assumption. Hence the 3D geometry of such objects cannot be reliably reconstructed using existing methods, such as traditional structure from motion or modern neural reconstruction methods. We solve this problem by explicitly modeling the scene as two distinct sub-spaces, inside and outside the transparent enclosure. We use an existing neural reconstruction method (NeuS) that implicitly represents the geometry and appearance of the inner subspace. In order to account for complex light interactions, we develop a hybrid rendering strategy that combines volume rendering with ray tracing. We then recover the underlying geometry and appearance of the model by minimizing the difference between the real and rendered images. We evaluate our method on both synthetic and real data. Experiment results show that our method outperforms the state-of-the-art (SOTA) methods. Codes and data will be available at https://github.com/hirotong/ReNeuS

Neural Voting Field for Camera-Space 3D Hand Pose Estimation
Huang, LinandLin, Chung-ChingandLin, KevinandLiang, LinandWang, LijuanandYuan, JunsongandLiu, Zicheng



研究问题:如何从单张RGB图像中进行摄像机空间的3D手部姿态估计。
动机:现有的方法大多需要先通过整体或像素级的密集回归获取相对的3D手部姿态,然后进行复杂的第二阶段操作恢复3D全局根或比例,而本文提出了一种新的统一的3D密集回归方案,直接在摄像机视锥体内进行密集3D点投票来估计摄像机空间的3D手部姿态。
方法:通过在3D领域直接建模,借鉴用于3D详细重建的像素对齐隐式函数,提出的神经投票场(NVF)全面地模型了3D密集局部证据和手部全局几何,有助于缓解常见的2D到3D的歧义。具体来说,对于视锥体内的一个3D查询点及其像素对齐的图像特征,由多层感知器表示的NVF会回归:(i)其到手表面的距离;(ii)一组4D偏移向量(1D投票权重和每个手关节的3D方向向量)。按照投票方案,选择近表面点的4D偏移向量,通过加权平均计算3D手关节坐标。
效果:实验证明,NVF在FreiHAND数据集上的摄像机空间3D手部姿态估计上优于现有的最先进算法。我们还将NVF适应于经典的根相对3D手部姿态估计任务,NVF在该任务上也在HO3D数据集上取得了最先进的结果。

We present a unified framework for camera-space 3D hand pose estimation from a single RGB image based on 3D implicit representation. As opposed to recent works, most of which first adopt holistic or pixel-level dense regression to obtain relative 3D hand pose and then follow with complex second-stage operations for 3D global root or scale recovery, we propose a novel unified 3D dense regression scheme to estimate camera-space 3D hand pose via dense 3D point-wise voting in camera frustum. Through direct dense modeling in 3D domain inspired by Pixel-aligned Implicit Functions for 3D detailed reconstruction, our proposed Neural Voting Field (NVF) fully models 3D dense local evidence and hand global geometry, helping to alleviate common 2D-to-3D ambiguities. Specifically, for a 3D query point in camera frustum and its pixel-aligned image feature, NVF, represented by a Multi-Layer Perceptron, regresses: (i) its signed distance to the hand surface; (ii) a set of 4D offset vectors (1D voting weight and 3D directional vector to each hand joint). Following a vote-casting scheme, 4D offset vectors from near-surface points are selected to calculate the 3D hand joint coordinates by a weighted average. Experiments demonstrate that NVF outperforms existing state-of-the-art algorithms on FreiHAND dataset for camera-space 3D hand pose estimation. We also adapt NVF to the classic task of root-relative 3D hand pose estimation, for which NVF also obtains state-of-the-art results on HO3D dataset.

Pointersect: Neural Rendering With Cloud-Ray Intersection
Chang, Jen-HaoRickandChen, Wei-YuandRanjan, AnuragandYi, KwangMooandTuzel, Oncel



研究问题:如何将点云直接渲染为表面,并实现无场景特定优化的表面法线估计、房间规模点云渲染、逆渲染和全局照明的光线追踪。
动机:现有的方法主要关注将点云转换为其他表示形式,如表面或隐式函数,而我们的方法则直接推断光线与给定点云表示的底层表面的交点。
方法:我们训练了一个集合转换器,通过少量的沿光线的局部邻居点,提供交点、表面法线和材质混合权重,用于渲染该光线的结果。通过将问题定位在小的邻域内,我们可以仅使用48个网格模型进行训练,并将其应用于未见过点云的场景。
效果:在三个测试集上,我们的方法在表面重建和点云渲染方面实现了比现有方法更高的估计精度。当应用于房间规模的点云时,无需任何场景特定的优化,该方法就可以达到与最先进的新颖视图渲染方法相竞争的质量。此外,我们还展示了对激光扫描点云进行渲染和操作的能力,如光照控制和对象插入。

We propose a novel method that renders point clouds as if they are surfaces. The proposed method is differentiable and requires no scene-specific optimization. This unique capability enables, out-of-the-box, surface normal estimation, rendering room-scale point clouds, inverse rendering, and ray tracing with global illumination. Unlike existing work that focuses on converting point clouds to other representations--e.g., surfaces or implicit functions--our key idea is to directly infer the intersection of a light ray with the underlying surface represented by the given point cloud. Specifically, we train a set transformer that, given a small number of local neighbor points along a light ray, provides the intersection point, the surface normal, and the material blending weights, which are used to render the outcome of this light ray. Localizing the problem into small neighborhoods enables us to train a model with only 48 meshes and apply it to unseen point clouds. Our model achieves higher estimation accuracy than state-of-the-art surface reconstruction and point-cloud rendering methods on three test sets. When applied to room-scale point clouds, without any scene-specific optimization, the model achieves competitive quality with the state-of-the-art novel-view rendering methods. Moreover, we demonstrate ability to render and manipulate Lidar-scanned point clouds such as lighting control and object insertion.

MagicPony: Learning Articulated 3D Animals in the Wild
Wu, ShangzheandLi, RuiningandJakab, TomasandRupprecht, ChristianandVedaldi, Andrea



研究问题:预测给定一张测试图像的带有关节的动物(如马)的3D形状、关节、视角、纹理和照明。
动机:现有的方法需要大量的假设和数据,我们提出了一种新的方法,MagicPony,只需要在野外拍摄的单视图图像进行训练。
方法:我们的方法结合了神经场和网格的优点,通过一个现成的自监督视觉转换器来理解物体的形状和姿态。
效果:MagicPony在这个具有挑战性的任务上优于先前的工作,并在重建艺术方面表现出色,尽管它只在实际图像上进行过训练。

We consider the problem of predicting the 3D shape, articulation, viewpoint, texture, and lighting of an articulated animal like a horse given a single test image as input. We present a new method, dubbed MagicPony, that learns this predictor purely from in-the-wild single-view images of the object category, with minimal assumptions about the topology of deformation. At its core is an implicit-explicit representation of articulated shape and appearance, combining the strengths of neural fields and meshes. In order to help the model understand an object's shape and pose, we distil the knowledge captured by an off-the-shelf self-supervised vision transformer and fuse it into the 3D model. To overcome local optima in viewpoint estimation, we further introduce a new viewpoint sampling scheme that comes at no additional training cost. MagicPony outperforms prior work on this challenging task and demonstrates excellent generalisation in reconstructing art, despite the fact that it is only trained on real images. The code can be found on the project page at https://3dmagicpony.github.io/.

Unsupervised Inference of Signed Distance Functions From Single Sparse Point Clouds Without Learning Priors
Chen, ChaoandLiu, Yu-ShenandHan, Zhizhong



研究问题:如何从3D点云中推断有符号距离函数(SDFs)。
动机:现有的方法依赖于从大规模监督学习中获取的先验知识,但这些先验知识在面对未见过的各种几何变化时,尤其是在极度稀疏的点云上,泛化能力较差。
方法:提出了一种神经网络,可以直接从单个稀疏的点云中推断出SDFs,无需使用有符号距离监督、学习到的先验知识或法线。通过端到端的方式学习表面参数化和SDFs推理。为了弥补稀疏性,利用参数化表面作为粗表面采样器,在训练迭代中提供许多粗表面估计,根据这些估计挖掘监督信息,并使用基于薄板样条网络以统计方式推断平滑的SDFs。
效果:该方法显著提高了在未见过点云上的泛化能力和准确性。实验结果表明,在合成数据集和真实扫描下,该方法在稀疏点云的表面重建方面优于现有方法。

It is vital to infer signed distance functions (SDFs) from 3D point clouds. The latest methods rely on generalizing the priors learned from large scale supervision. However, the learned priors do not generalize well to various geometric variations that are unseen during training, especially for extremely sparse point clouds. To resolve this issue, we present a neural network to directly infer SDFs from single sparse point clouds without using signed distance supervision, learned priors or even normals. Our insight here is to learn surface parameterization and SDFs inference in an end-to-end manner. To make up the sparsity, we leverage parameterized surfaces as a coarse surface sampler to provide many coarse surface estimations in training iterations, according to which we mine supervision and our thin plate splines (TPS) based network infers SDFs as smooth functions in a statistical way. Our method significantly improves the generalization ability and accuracy in unseen point clouds. Our experimental results show our advantages over the state-of-the-art methods in surface reconstruction for sparse point clouds under synthetic datasets and real scans.The code is available at https://github.com/chenchao15/NeuralTPS.

Learning Neural Duplex Radiance Fields for Real-Time View Synthesis
Wan, ZiyuandRichardt, ChristianandBo\v{z



研究问题:如何有效地渲染出照片级真实的图像?
动机:目前的神经辐射场(NeRFs)虽然能产生高质量的新视角合成,但需要对每个像素进行大量的多层感知器评估,导致成本高昂且无法实时渲染。
方法:提出一种新的方法,将NeRFs蒸馏并烘焙成高效的基于网格的神经表示,完全兼容大规模并行图形渲染管道。我们将场景表示为编码在两层双工网格上的神经辐射特征,通过学习可靠的光线-表面交点区间的聚合辐射信息,有效克服了3D表面重建中的固有不准确性。
效果:通过广泛的实验,证明了该方法在一系列标准数据集上的效果和优越性。

Neural radiance fields (NeRFs) enable novel view synthesis with unprecedented visual quality. However, to render photorealistic images, NeRFs require hundreds of deep multilayer perceptron (MLP) evaluations -- for each pixel. This is prohibitively expensive and makes real-time rendering infeasible, even on powerful modern GPUs. In this paper, we propose a novel approach to distill and bake NeRFs into highly efficient mesh-based neural representations that are fully compatible with the massively parallel graphics rendering pipeline. We represent scenes as neural radiance features encoded on a two-layer duplex mesh, which effectively overcomes the inherent inaccuracies in 3D surface reconstruction by learning the aggregated radiance information from a reliable interval of ray-surface intersections. To exploit local geometric relationships of nearby pixels, we leverage screen-space convolutions instead of the MLPs used in NeRFs to achieve high-quality appearance. Finally, the performance of the whole framework is further boosted by a novel multi-view distillation optimization strategy. We demonstrate the effectiveness and superiority of our approach via extensive experiments on a range of standard datasets.

Occlusion-Free Scene Recovery via Neural Radiance Fields
Zhu, ChengxuanandWan, RenjieandTang, YunkaiandShi, Boxin



研究问题:如何通过联合优化摄像机参数和场景重建来有效地去除遮挡物,而无需依赖外部监督。
动机:尽管已经提出了几种去除遮挡物的方法,但由于依赖于外部监督,其性能仍然不尽人意。
方法:我们提出了一种新的去除遮挡物的方法,通过在位置、视角和无遮挡场景细节之间建立直接映射,利用神经辐射场(NeRF)。我们还开发了一种有效的方案,当存在遮挡时,联合优化摄像机参数和场景重建。
效果:我们在现有的和新收集的数据集上进行的实验结果验证了我们方法的有效性。

Our everyday lives are filled with occlusions that we strive to see through. By aggregating desired background information from different viewpoints, we can easily eliminate such occlusions without any external occlusion-free supervision. Though several occlusion removal methods have been proposed to empower machine vision systems with such ability, their performances are still unsatisfactory due to reliance on external supervision. We propose a novel method for occlusion removal by directly building a mapping between position and viewing angles and the corresponding occlusion-free scene details leveraging Neural Radiance Fields (NeRF). We also develop an effective scheme to jointly optimize camera parameters and scene reconstruction when occlusions are present. An additional depth constraint is applied to supervise the entire optimization without labeled external data for training. The experimental results on existing and newly collected datasets validate the effectiveness of our method.

Learning a 3D Morphable Face Reflectance Model From Low-Cost Data
Han, YuxuanandWang, ZhiboandXu, Feng



研究问题:如何利用低成本的公开数据建立具有空间变化的BRDF的3D可变形人脸反射模型。
动机:现有的工作使用Light Stage数据为漫反射和镜面反射率建立参数化模型,但仅靠漫反射和镜面反射率无法确定完整的BRDF。此外,对研究社区来说,获取Light Stage数据的要求很难满足。
方法:本文提出了第一个仅使用低成本的公开数据的具有空间变化的BRDF的3D可变形人脸反射模型。我们将线性光泽权重应用于参数化建模,以表示空间变化的镜面强度和光泽度。然后开发了一种逆渲染算法,从非Light Stage数据中重建反射参数,用于训练初始的可变形反射模型。为了提高模型的泛化能力和表达能力,我们进一步提出了一种通过重建进行更新的策略,在野外数据集上对其进行微调。
效果:实验结果表明,我们的方法获得了具有可信面部镜面反射率的良好渲染结果。我们的代码发布在https://yxuhan.github.io/ReflectanceMM/index.html。

Modeling non-Lambertian effects such as facial specularity leads to a more realistic 3D Morphable Face Model. Existing works build parametric models for diffuse and specular albedo using Light Stage data. However, only diffuse and specular albedo cannot determine the full BRDF. In addition, the requirement of Light Stage data is hard to fulfill for the research communities. This paper proposes the first 3D morphable face reflectance model with spatially varying BRDF using only low-cost publicly-available data. We apply linear shiness weighting into parametric modeling to represent spatially varying specular intensity and shiness. Then an inverse rendering algorithm is developed to reconstruct the reflectance parameters from non-Light Stage data, which are used to train an initial morphable reflectance model. To enhance the model's generalization capability and expressive power, we further propose an update-by-reconstruction strategy to finetune it on an in-the-wild dataset. Experimental results show that our method obtains decent rendering results with plausible facial specularities. Our code is released at https://yxuhan.github.io/ReflectanceMM/index.html.

SCoDA: Domain Adaptive Shape Completion for Real Scans
Wu, YushuangandYan, ZizhengandChen, CeandWei, LaiandLi, XiaoandLi, GuanbinandLi, YihaoandCui, ShuguangandHan, Xiaoguang



研究问题:如何从点云中完成3D形状,特别是在真实物体扫描的情况下。
动机:由于真实扫描的3D形状缺乏精确的地面真值,现有的工作主要集中在在合成数据上进行基准测试,这限制了这些方法的通用性。
方法:提出了一个新的任务SCoDA,用于从合成数据适应真实扫描的形状完成。同时,创建了一个新的数据集ScanSalon,并提出了一种新的跨领域特征融合方法和一种新的体积一致的自我训练框架。
效果:实验证明,该方法能有效提高6%-7%的mIoU。

3D shape completion from point clouds is a challenging task, especially from scans of real-world objects. Considering the paucity of 3D shape ground truths for real scans, existing works mainly focus on benchmarking this task on synthetic data, e.g. 3D computer-aided design models. However, the domain gap between synthetic and real data limits the generalizability of these methods. Thus, we propose a new task, SCoDA, for the domain adaptation of real scan shape completion from synthetic data. A new dataset, ScanSalon, is contributed with a bunch of elaborate 3D models created by skillful artists according to scans. To address this new task, we propose a novel cross-domain feature fusion method for knowledge transfer and a novel volume-consistent self-training framework for robust learning from real data. Extensive experiments prove our method is effective to bring an improvement of 6% 7% mIoU.

I2-SDF: Intrinsic Indoor Scene Reconstruction and Editing via Raytracing in Neural SDFs
Zhu, JingsenandHuo, YuchiandYe, QiandLuan, FujunandLi, JifanandXi, DianbingandWang, LishaandTang, RuiandHua, WeiandBao, HujunandWang, Rui



研究问题:提出一种新的方法,使用可微分蒙特卡洛光线追踪在神经符号距离场(SDFs)上进行室内场景的重建和编辑。
动机:现有的方法在大规模室内场景的重建和编辑上存在质量不高的问题。
方法:提出了一种名为I^2-SDF的新方法,该方法基于神经SDF框架,通过多视角图像联合恢复底层形状、入射辐射度和材质,并引入了新的气泡损失函数和小物体精细粒度优化以及错误引导的自适应采样方案,以提高大规模室内场景的重建质量。
效果:通过一系列的定性和定量实验,证明该方法在室内场景重建、新视图合成和场景编辑等方面的性能优于现有的最佳基线方法。

In this work, we present I^2-SDF, a new method for intrinsic indoor scene reconstruction and editing using differentiable Monte Carlo raytracing on neural signed distance fields (SDFs). Our holistic neural SDF-based framework jointly recovers the underlying shapes, incident radiance and materials from multi-view images. We introduce a novel bubble loss for fine-grained small objects and error-guided adaptive sampling scheme to largely improve the reconstruction quality on large-scale indoor scenes. Further, we propose to decompose the neural radiance field into spatially-varying material of the scene as a neural field through surface-based, differentiable Monte Carlo raytracing and emitter semantic segmentations, which enables physically based and photorealistic scene relighting and editing applications. Through a number of qualitative and quantitative experiments, we demonstrate the superior quality of our method on indoor scene reconstruction, novel view synthesis, and scene editing compared to state-of-the-art baselines. Our project page is at https://jingsenzhu.github.io/i2-sdf.

Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields
Agaram, RohithandDewan, ShauryaandSajnani, RahulandPoulenard, AdrienandKrishna, MadhavaandSridhar, Srinath



研究问题:如何在没有ShapeNet等提供“标准化”对象实例的数据集的情况下,为3D计算机视觉中的对象类别构建神经场。
动机:尽管有进展,但在没有像ShapeNet这样的数据集的情况下,为对象类别构建神经场仍然具有挑战性。
方法:我们提出了Canonical Field Network(CaFi-Net),这是一种自我监督的方法,用于对表示为神经场的对象类别的实例进行3D位姿的标准化,特别是神经辐射场(NeRFs)。
效果:我们在一个新的包含1300个NeRF模型和13个对象类别的数据集上进行了广泛的实验,结果表明,我们的方法在性能上与基于3D点云的方法相当或优于这些方法。

Coordinate-based implicit neural networks, or neural fields, have emerged as useful representations of shape and appearance in 3D computer vision. Despite advances however, it remains challenging to build neural fields for categories of objects without datasets like ShapeNet that provide "canonicalized" object instances that are consistently aligned for their 3D position and orientation (pose). We present Canonical Field Network (CaFi-Net), a self-supervised method to canonicalize the 3D pose of instances from an object category represented as neural fields, specifically neural radiance fields (NeRFs). CaFi-Net directly learns from continuous and noisy radiance fields using a Siamese network architecture that is designed to extract equivariant field features for category-level canonicalization. During inference, our method takes pre-trained neural radiance fields of novel object instances at arbitrary 3D pose, and estimates a canonical field with consistent 3D pose across the entire category. Extensive experiments on a new dataset of 1300 NeRF models across 13 object categories show that our method matches or exceeds the performance of 3D point cloud-based methods.

Multi-View Inverse Rendering for Large-Scale Real-World Indoor Scenes
Li, ZhenandWang, LingliandCheng, MofangandPan, CihuiandYang, Jiaqi



研究问题:如何有效地重建大规模真实世界室内场景的全局照明和物理合理的SVBRDFs。
动机:现有的表示方法将大型场景的全局照明简化为多个环境贴图,我们提出一种称为基于纹理的照明(TBL)的紧凑表示方法。
方法:我们提出了一种有效的多视角逆渲染方法,用于重建大规模真实世界室内场景的全局照明和物理合理的SVBRDFs。我们的方法包括一个基于3D网格和HDR纹理的紧凑表示,称为基于纹理的照明(TBL),以及一种基于预先计算辐照度的混合照明表示。
效果:实验结果表明,我们的方法在数量和质量上都优于最先进的方法,并能够实现物理上合理的混合现实应用,如材料编辑、可编辑的新视图合成和重照明。

We present a efficient multi-view inverse rendering method for large-scale real-world indoor scenes that reconstructs global illumination and physically-reasonable SVBRDFs. Unlike previous representations, where the global illumination of large scenes is simplified as multiple environment maps, we propose a compact representation called Texture-based Lighting (TBL). It consists of 3D mesh and HDR textures, and efficiently models direct and infinite-bounce indirect lighting of the entire large scene. Based on TBL, we further propose a hybrid lighting representation with precomputed irradiance, which significantly improves the efficiency and alleviates the rendering noise in the material optimization. To physically disentangle the ambiguity between materials, we propose a three-stage material optimization strategy based on the priors of semantic segmentation and room segmentation. Extensive experiments show that the proposed method outperforms the state-of-the-art quantitatively and qualitatively, and enables physically-reasonable mixed-reality applications such as material editing, editable novel view synthesis and relighting. The project page is at https://lzleejean.github.io/TexIR.

Marching-Primitives: Shape Abstraction From Signed Distance Function
Liu, WeixiaoandWu, YuweiandRuan, SipuandChirikjian, GregoryS.



研究问题:如何利用基本的几何原语表示复杂对象,以实现更高的计算效率和准确性。
动机:现有的从有向距离函数(SDF)提取多边形网格的方法存在不足,本文提出了一种新的方法——Marching-Primitives,可以直接从SDF获取基于原语的抽象表示。
方法:通过分析体素的连通性,在不同级别的有向距离范围内逐步生长几何原语(如超二次体),并同时求解原语参数以捕获底层局部几何形状。
效果:在合成和真实世界数据集上评估了该方法的性能,结果表明该方法在准确性方面优于现有技术,并且可直接推广到不同类别和规模。

Representing complex objects with basic geometric primitives has long been a topic in computer vision. Primitive-based representations have the merits of compactness and computational efficiency in higher-level tasks such as physics simulation, collision checking, and robotic manipulation. Unlike previous works which extract polygonal meshes from a signed distance function (SDF), in this paper, we present a novel method, named Marching-Primitives, to obtain a primitive-based abstraction directly from an SDF. Our method grows geometric primitives (such as superquadrics) iteratively by analyzing the connectivity of voxels while marching at different levels of signed distance. For each valid connected volume of interest, we march on the scope of voxels from which a primitive is able to be extracted in a probabilistic sense and simultaneously solve for the parameters of the primitive to capture the underlying local geometry. We evaluate the performance of our method on both synthetic and real-world datasets. The results show that the proposed method outperforms the state-of-the-art in terms of accuracy, and is directly generalizable among different categories and scales. The code is open-sourced at https://github.com/ChirikjianLab/Marching-Primitives.git.

Revisiting the P3P Problem
Ding, YaqingandYang, JianandLarsson, ViktorandOlsson, Carland\r{A



研究问题:如何通过三个2D-to-3D对应关系确定已校准相机的绝对姿态。
动机:解决该问题对于许多视觉系统(如定位和从结构到运动)至关重要,而现有的算法在特定配置下可能会失败。
方法:我们将问题形式化为寻找两个圆锥体的交集,并利用此公式对多项式系统的实数根进行解析表征,为每个问题实例设计定制的解决方案策略。
效果:我们的方法比当前最先进的方法在速度和成功率方面都更胜一筹,且完全稳定,能够正确解决竞争性方法失败的情况。

One of the classical multi-view geometry problems is the so called P3P problem, where the absolute pose of a calibrated camera is determined from three 2D-to-3D correspondences. Since these solvers form a critical component of many vision systems (e.g. in localization and Structure-from-Motion), there have been significant effort in developing faster and more stable algorithms. While the current state-of-the-art solvers are both extremely fast and stable, there still exist configurations where they break down. In this paper we algebraically formulate the problem as finding the intersection of two conics. With this formulation we are able to analytically characterize the real roots of the polynomial system and employ a tailored solution strategy for each problem instance. The result is a fast and completely stable solver, that is able to correctly solve cases where competing methods fail. Our experimental evaluation shows that we outperform the current state-of-the-art methods both in terms of speed and success rate.

Combining Implicit-Explicit View Correlation for Light Field Semantic Segmentation
Cong, RuixuanandYang, DaandChen, RongshanandWang, SizheandCui, ZhenglongandSheng, Hao



研究问题:如何有效地利用光场中的多视图信息进行语义分割。
动机:光场同时记录了光线的空间信息和角度信息,对于许多潜在应用(如语义分割)具有巨大潜力。然而,由于光场的高维特性和有限的内存,使得在保持单视图上下文信息的同时充分利用各视图间的关系变得困难。
方法:本文提出了一种名为LF-IENet的新型网络来进行光场语义分割。该网络包含两种不同的方式从周围视图中挖掘互补信息来分割中心视图:一种是隐式特征集成,通过注意力机制计算视图间和视图内的相似性以调整中心视图的特征;另一种是显式特征传播,直接在其他视图的指导下将特征扭曲到中心视图。这两种方式相互补充,共同实现了光场中跨视图的互补信息融合。
效果:该方法在真实世界和合成光场数据集上都取得了优秀的性能,证明了这种新架构的有效性。

Since light field simultaneously records spatial information and angular information of light rays, it is considered to be beneficial for many potential applications, and semantic segmentation is one of them. The regular variation of image information across views facilitates a comprehensive scene understanding. However, in the case of limited memory, the high-dimensional property of light field makes the problem more intractable than generic semantic segmentation, manifested in the difficulty of fully exploiting the relationships among views while maintaining contextual information in single view. In this paper, we propose a novel network called LF-IENet for light field semantic segmentation. It contains two different manners to mine complementary information from surrounding views to segment central view. One is implicit feature integration that leverages attention mechanism to compute inter-view and intra-view similarity to modulate features of central view. The other is explicit feature propagation that directly warps features of other views to central view under the guidance of disparity. They complement each other and jointly realize complementary information fusion across views in light field. The proposed method achieves outperforming performance on both real-world and synthetic light field datasets, demonstrating the effectiveness of this new architecture.

SunStage: Portrait Reconstruction and Relighting Using the Sun as a Light Stage
Wang, YifanandHolynski, AleksanderandZhang, XiumingandZhang, Xuaner



研究问题:本文旨在开发一种轻量级的光照台替代方案,以更经济、简便的方式捕捉人脸的外观信息。
动机:传统的光照台设备昂贵且技术要求高,限制了其在面部重建和重光照等领域的应用。
方法:我们提出了SunStage,这是一个使用智能手机摄像头和太阳作为光源的轻量级光照台替代方案。用户只需在户外旋转拍摄自拍照视频,利用太阳与面部之间的角度变化来指导面部几何、反射率、相机姿态和照明参数的联合重建。
效果:尽管是在未校准的自然环境中,我们的方法仍能重建出详细的面部外观和几何信息,实现诸如重光照、新视角合成和反射率编辑等引人注目的效果。

A light stage uses a series of calibrated cameras and lights to capture a subject's facial appearance under varying illumination and viewpoint. This captured information is crucial for facial reconstruction and relighting. Unfortunately, light stages are often inaccessible: they are expensive and require significant technical expertise for construction and operation. In this paper, we present SunStage: a lightweight alternative to a light stage that captures comparable data using only a smartphone camera and the sun. Our method only requires the user to capture a selfie video outdoors, rotating in place, and uses the varying angles between the sun and the face as guidance in joint reconstruction of facial geometry, reflectance, camera pose, and lighting parameters. Despite the in-the-wild un-calibrated setting, our approach is able to reconstruct detailed facial appearance and geometry, enabling compelling effects such as relighting, novel view synthesis, and reflectance editing.

Multi-View Reconstruction Using Signed Ray Distance Functions (SRDF)
Zins, PierreandXu, YuanluandBoyer, EdmondandWuhrer, StefanieandTung, Tony



研究问题:本文旨在研究一种新的多视角3D形状重建优化框架。
动机:目前的可微渲染方法和多视图立体方法在3D形状重建上各有优势,但都存在一定问题。前者虽然性能出色,但对几何估计精度不足;后者虽然几何精度高,但对全局优化处理不佳。
方法:本文提出了一种新颖的体积形状表示方法,该方法结合了两者的优点,通过像素深度参数化来更好地实现形状表面的一致性。同时,通过体积积分进行优化,提高了优化效果。
效果:实验结果表明,该方法在标准3D基准测试中的表现优于现有方法,具有更好的几何估计精度。

In this paper, we investigate a new optimization framework for multi-view 3D shape reconstructions. Recent differentiable rendering approaches have provided breakthrough performances with implicit shape representations though they can still lack precision in the estimated geometries. On the other hand multi-view stereo methods can yield pixel wise geometric accuracy with local depth predictions along viewing rays. Our approach bridges the gap between the two strategies with a novel volumetric shape representation that is implicit but parameterized with pixel depths to better materialize the shape surface with consistent signed distances along viewing rays. The approach retains pixel-accuracy while benefiting from volumetric integration in the optimization. To this aim, depths are optimized by evaluating, at each 3D location within the volumetric discretization, the agreement between the depth prediction consistency and the photometric consistency for the corresponding pixels. The optimization is agnostic to the associated photo-consistency term which can vary from a median-based baseline to more elaborate criteria, learned functions. Our experiments demonstrate the benefit of the volumetric integration with depth predictions. They also show that our approach outperforms existing approaches over standard 3D benchmarks with better geometry estimations.

Neural Pixel Composition for 3D-4D View Synthesis From Multi-Views
Bansal, AayushandZollh\"ofer, Michael



研究问题:如何仅使用离散的多视图观察结果进行连续的3D-4D视图合成。
动机:现有的最先进技术需要密集的多视图监督和大量的计算预算,而我们的方法可以在稀疏和宽基线的多视图图像上可靠地运行,并且可以在几秒钟到10分钟内高效地训练高分辨率(12MP)的内容。
方法:我们提出了一种新的方法,即神经像素合成(NPC),该方法包括两个核心创新点:1)一个表示特定位置和时间沿视线的像素的颜色和深度信息的累积的像素表示;2)一个多层感知器(MLP),该网络能够合成为一个像素位置提供的丰富信息以获得最终的颜色输出。
效果:我们在各种多视图序列上进行了大量实验,并与现有方法进行了比较,在多样化和具有挑战性的环境中取得了更好的结果。

We present Neural Pixel Composition (NPC), a novel approach for continuous 3D-4D view synthesis given only a discrete set of multi-view observations as input. Existing state-of-the-art approaches require dense multi-view supervision and an extensive computational budget. The proposed formulation reliably operates on sparse and wide-baseline multi-view imagery and can be trained efficiently within a few seconds to 10 minutes for hi-res (12MP) content, i.e., 200-400X faster convergence than existing methods. Crucial to our approach are two core novelties: 1) a representation of a pixel that contains color and depth information accumulated from multi-views for a particular location and time along a line of sight, and 2) a multi-layer perceptron (MLP) that enables the composition of this rich information provided for a pixel location to obtain the final color output. We experiment with a large variety of multi-view sequences, compare to existing approaches, and achieve better results in diverse and challenging settings.

Hybrid Neural Rendering for Large-Scale Scenes With Motion Blur
Dai, PengandZhang, YindaandYu, XinandLyu, XiaoyangandQi, Xiaojuan



研究问题:如何从在野外拍摄的图像中渲染出高保真度、视角一致的新视图,以解决运动模糊等不可避免的伪影问题。
动机:尽管最近取得了一些进展,但仍然具有挑战性,需要开发新的模型来解决这个问题。
方法:我们开发了一种混合神经网络渲染模型,该模型将基于图像的表示和神经3D表示结合起来,以渲染高质量的、视角一致的图像。同时,我们还提出了模拟模糊效果的策略,以减轻模糊图像对渲染质量的负面影响。
效果:我们在真实和合成数据上的大量实验表明,我们的模型超越了最先进的基于点的新颖视图合成方法。

Rendering novel view images is highly desirable for many applications. Despite recent progress, it remains challenging to render high-fidelity and view-consistent novel views of large-scale scenes from in-the-wild images with inevitable artifacts (e.g., motion blur). To this end, we develop a hybrid neural rendering model that makes image-based representation and neural 3D representation join forces to render high-quality, view-consistent images. Besides, images captured in the wild inevitably contain artifacts, such as motion blur, which deteriorates the quality of rendered images. Accordingly, we propose strategies to simulate blur effects on the rendered images to mitigate the negative influence of blurriness images and reduce their importance during training based on precomputed quality-aware weights. Extensive experiments on real and synthetic data demonstrate our model surpasses state-of-the-art point-based methods for novel view synthesis. The code is available at https://daipengwa.github.io/Hybrid-Rendering-ProjectPage.

Learning 3D Scene Priors With 2D Supervision
Nie, YinyuandDai, AngelaandHan, XiaoguangandNie{\ss



研究问题:如何有效地估计三维环境中的布局配置和物体几何?
动机:现有的方法需要大量的3D标注数据,但收集这些数据既昂贵又困难。
方法:提出一种新的方法,通过多视角RGB图像的2D监督来学习三维场景的布局和形状先验,无需任何3D真实值。
效果:在3D-FRONT和ScanNet上的实验表明,该方法在单视图重建方面优于现有技术,并在场景合成方面实现了最先进的结果。

Holistic 3D scene understanding entails estimation of both layout configuration and object geometry in a 3D environment. Recent works have shown advances in 3D scene estimation from various input modalities (e.g., images, 3D scans), by leveraging 3D supervision (e.g., 3D bounding boxes or CAD models), for which collection at scale is expensive and often intractable. To address this shortcoming, we propose a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth. Instead, we rely on 2D supervision from multi-view RGB images. Our method represents a 3D scene as a latent vector, from which we can progressively decode to a sequence of objects characterized by their class categories, 3D bounding boxes, and meshes. With our trained autoregressive decoder representing the scene prior, our method facilitates many downstream applications, including scene synthesis, interpolation, and single-view reconstruction. Experiments on 3D-FRONT and ScanNet show that our method outperforms state of the art in single-view reconstruction, and achieves state-of-the-art results in scene synthesis against baselines which require for 3D supervision.

Grid-Guided Neural Radiance Fields for Large Urban Scenes
Xu, LinningandXiangli, YuanboandPeng, SidaandPan, XingangandZhao, NanxuanandTheobalt, ChristianandDai, BoandLin, Dahua



研究问题:现有的基于多层感知机(MLP)的神经辐射场(NeRF)方法在大规模场景中由于模型容量有限,往往导致渲染结果模糊不清,存在欠拟合的问题。
动机:目前的解决方案主要是将场景地理分割,采用多个子NeRF对每个区域进行单独建模,但这会导致训练成本和子NeRF数量随着场景的扩大呈线性增长。另一种解决方案是使用特征网格表示,虽然计算效率高且可以自然地扩展到大场景,但特征网格往往约束性较弱,常常无法得到最优解,尤其在几何和纹理复杂的区域,渲染结果会产生噪声。
方法:本文提出了一种新的框架,可以在保持计算效率的同时实现大规模城市场景的高保真渲染。我们提出使用一种紧凑的多分辨率地面特征平面表示来粗略捕捉场景,并通过另一个NeRF分支补充位置编码输入,以联合学习的方式进行渲染。
效果:实验表明,这种集成方法能够利用两种替代方案的优点:在特征网格表示的指导下,轻量级的NeRF足以渲染具有精细细节的真实照片视图;同时,经过联合优化的地面特征平面可以获得进一步的细化,形成更准确、更紧凑的特征空间,输出更自然的渲染结果。

Purely MLP-based neural radiance fields (NeRF-based methods) often suffer from underfitting with blurred renderings on large-scale scenes due to limited model capacity. Recent approaches propose to geographically divide the scene and adopt multiple sub-NeRFs to model each region individually, leading to linear scale-up in training costs and the number of sub-NeRFs as the scene expands. An alternative solution is to use a feature grid representation, which is computationally efficient and can naturally scale to a large scene with increased grid resolutions. However, the feature grid tends to be less constrained and often reaches suboptimal solutions, producing noisy artifacts in renderings, especially in regions with complex geometry and texture. In this work, we present a new framework that realizes high-fidelity rendering on large urban scenes while being computationally efficient. We propose to use a compact multi-resolution ground feature plane representation to coarsely capture the scene, and complement it with positional encoding inputs through another NeRF branch for rendering in a joint learning fashion. We show that such an integration can utilize the advantages of two alternative solutions: a light-weighted NeRF is sufficient, under the guidance of the feature grid representation, to render photorealistic novel views with fine details; and the jointly optimized ground feature planes, can meanwhile gain further refinements, forming a more accurate and compact feature space and output much more natural rendering results.

Local Implicit Ray Function for Generalizable Radiance Field Representation
Huang, XinandZhang, QiandFeng, YingandLi, XiaoyuandWang, XuanandWang, Qing



研究问题:本文旨在提出一种名为LIRF的通用神经渲染方法,用于新颖视图渲染。
动机:目前的通用神经辐射场(NeRF)方法对每个像素只采样一条光线,因此在输入视图和渲染视图观察场景内容时分辨率不同,可能会渲染出模糊或锯齿状的视图。
方法:为了解决这个问题,我们提出了LIRF,它通过聚合来自锥形截锥体的信息来构造一条光线。给定锥形截锥体内的3D位置,LIRF将3D坐标和锥形截锥体的特征作为输入,预测局部体积辐射场。由于坐标是连续的,LIRF可以通过体积渲染在连续的值尺度上渲染高质量的新颖视图。此外,我们还通过基于变压器的特征匹配预测每个输入视图的可见权重,以提高在遮挡区域的性能。
效果:我们在真实世界的场景上的实验结果表明,我们的方法在任意尺度上渲染未见场景的新颖视图方面优于最先进的方法。

We propose LIRF (Local Implicit Ray Function), a generalizable neural rendering approach for novel view rendering. Current generalizable neural radiance fields (NeRF) methods sample a scene with a single ray per pixel and may therefore render blurred or aliased views when the input views and rendered views observe scene content at different resolutions. To solve this problem, we propose LIRF to aggregate the information from conical frustums to construct a ray. Given 3D positions within conical frustums, LIRF takes 3D coordinates and the features of conical frustums as inputs and predicts a local volumetric radiance field. Since the coordinates are continuous, LIRF renders high-quality novel views at a continuously-valued scale via volume rendering. Besides, we predict the visible weights for each input view via transformer-based feature matching to improve the performance in occluded areas. Experimental results on real-world scenes validate that our method outperforms state-of-the-art methods on novel view rendering of unseen scenes at arbitrary scales.

FitMe: Deep Photorealistic 3D Morphable Model Avatars
Lattas, AlexandrosandMoschoglou, StylianosandPloumpis, StylianosandGecer, BarisandDeng, JiankangandZafeiriou, Stefanos



研究问题:如何从单张或多张照片中获取高保真、可渲染的人类头像。
动机:现有的技术需要大量的计算资源和时间,而且结果往往不尽人意。
方法:本文提出了一种名为FitMe的面部反射模型和可微分渲染优化管道,该模型由一个基于风格的多模态生成器和一个基于PCA的形状模型组成,可以捕获面部外观的漫反射和镜面反射。
效果:实验结果表明,FitMe在单张“野外”面部图像上取得了最先进的反射获取和身份保持效果,当给定多个同一身份的无约束面部图像时,它可以产生令人印象深刻的扫描效果。与最近的隐式头像重建相比,FitMe只需要一分钟的时间,就可以生成可重光照的网格和基于纹理的头像,可供终端用户应用程序使用。

In this paper, we introduce FitMe, a facial reflectance model and a differentiable rendering optimization pipeline, that can be used to acquire high-fidelity renderable human avatars from single or multiple images. The model consists of a multi-modal style-based generator, that captures facial appearance in terms of diffuse and specular reflectance, and a PCA-based shape model. We employ a fast differentiable rendering process that can be used in an optimization pipeline, while also achieving photorealistic facial shading. Our optimization process accurately captures both the facial reflectance and shape in high-detail, by exploiting the expressivity of the style-based latent representation and of our shape model. FitMe achieves state-of-the-art reflectance acquisition and identity preservation on single "in-the-wild" facial images, while it produces impressive scan-like results, when given multiple unconstrained facial images pertaining to the same identity. In contrast with recent implicit avatar reconstructions, FitMe requires only one minute and produces relightable mesh and texture-based avatars, that can be used by end-user applications.

expOSE: Accurate Initialization-Free Projective Factorization Using Exponential Regularization
Iglesias, Jos\'ePedroandNilsson, AmandaandOlsson, Carl



研究问题:如何提高结构从运动系统中的光束调整的准确性和效率。
动机:光束调整是结构从运动系统中的关键组成部分,但需要良好的初始化才能收敛到正确的解决方案。
方法:提出了一种新的基于因子分解的光束调整误差替代方法——expOSE,该方法通过指数正则化解决了大深度的问题,并使用二次近似实现了与VarPro的迭代解决方案。此外,还通过将物体空间误差分解为径向和切向组件,增强了径向畸变稳健性。
效果:实验结果表明,该方法对初始化具有鲁棒性,并且即使不进行光束调整优化,也比最先进的方法提高了重建质量。

Bundle adjustment is a key component in practically all available Structure from Motion systems. While it is crucial for achieving accurate reconstruction, convergence to the right solution hinges on good initialization. The recently introduced factorization-based pOSE methods formulate a surrogate for the bundle adjustment error without reliance on good initialization. In this paper, we show that pOSE has an undesirable penalization of large depths. To address this we propose expOSE which has an exponential regularization that is negligible for positive depths. To achieve efficient inference we use a quadratic approximation that allows an iterative solution with VarPro. Furthermore, we extend the method with radial distortion robustness by decomposing the Object Space Error into radial and tangential components. Experimental results confirm that the proposed method is robust to initialization and improves reconstruction quality compared to state-of-the-art methods even without bundle adjustment refinement.

A Large-Scale Homography Benchmark
Barath, DanielandMishkin, DmytroandPolic, MichalandF\"orstner, WolfgangandMatas, Jiri



研究问题:本文旨在开发一个大规模的三维平面数据集Pi3D,并利用该数据集构建了一个大规模的姿态估计基准HEB。
动机:为了解决单目深度估计、表面法线估计和图像匹配算法的训练和评估问题,以及在视角和光照变化下进行姿态估计的问题。
方法:从1DSfM数据集中提取大约1000个平面的图像,创建了包含大约10000张图像的大规模三维平面数据集Pi3D。同时,利用Pi3D构建了包含226260个单应性矩阵和约4百万对应关系的大规模姿态估计基准HEB。
效果:通过严格的评估,建立了当前最先进的鲁棒单应性估计方法。同时,还评估了SIFT方向和尺度相对于底层单应性的不确定性,并为比较自定义检测器的不确定性提供了代码。

We present a large-scale dataset of Planes in 3D, Pi3D, of roughly 1000 planes observed in 10 000 images from the 1DSfM dataset, and HEB, a large-scale homography estimation benchmark leveraging Pi3D. The applications of the Pi3D dataset are diverse, e.g. training or evaluating monocular depth, surface normal estimation and image matching algorithms. The HEB dataset consists of 226 260 homographies and includes roughly 4M correspondences. The homographies link images that often undergo significant viewpoint and illumination changes. As applications of HEB, we perform a rigorous evaluation of a wide range of robust estimators and deep learning-based correspondence filtering methods, establishing the current state-of-the-art in robust homography estimation. We also evaluate the uncertainty of the SIFT orientations and scales w.r.t. the ground truth coming from the underlying homographies and provide codes for comparing uncertainty of custom detectors.

Consistent View Synthesis With Pose-Guided Diffusion Models
Tseng, Hung-YuandLi, QinboandKim, ChangilandAlsisan, SuhibandHuang, Jia-BinandKopf, Johannes



研究问题:如何从单张图片生成一致的长期视频新视角。
动机:现有的技术在相机运动较大时,无法生成一致和高质量的新视角。
方法:提出一种基于姿态引导的扩散模型,设计一个注意力层,利用极线作为约束来促进不同视点的关联。
效果:实验结果表明,该方法在合成和真实世界的数据集上优于最先进的基于变换器和GAN的方法。

Novel view synthesis from a single image has been a cornerstone problem for many Virtual Reality applications that provide immersive experiences. However, most existing techniques can only synthesize novel views within a limited range of camera motion or fail to generate consistent and high-quality novel views under significant camera movement. In this work, we propose a pose-guided diffusion model to generate a consistent long-term video of novel views from a single image. We design an attention layer that uses epipolar lines as constraints to facilitate the association between different viewpoints. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of the proposed diffusion model against state-of-the-art transformer-based and GAN-based approaches. More qualitative results are available at https://poseguided-diffusion.github.io/.

Learning Neural Volumetric Representations of Dynamic Humans in Minutes
Geng, ChenandPeng, SidaandXu, ZhenandBao, HujunandZhou, Xiaowei



研究问题:如何有效地从稀疏的多视角视频中重建动态人物的体积视频。
动机:现有的方法需要花费大量时间进行场景优化,或者牺牲视觉质量以减少优化时间。
方法:提出一种新的方法,通过定义新的基于部分的体素化人体表示和创新的二维运动参数化方案,来学习动态人体的神经体积表示。
效果:实验证明,该方法比之前的逐场景优化方法快100倍,同时在渲染质量上具有竞争力。

This paper addresses the challenge of efficiently reconstructing volumetric videos of dynamic humans from sparse multi-view videos. Some recent works represent a dynamic human as a canonical neural radiance field (NeRF) and a motion field, which are learned from input videos through differentiable rendering. But the per-scene optimization generally requires hours. Other generalizable NeRF models leverage learned prior from datasets to reduce the optimization time by only finetuning on new scenes at the cost of visual fidelity. In this paper, we propose a novel method for learning neural volumetric representations of dynamic humans in minutes with competitive visual quality. Specifically, we define a novel part-based voxelized human representation to better distribute the representational power of the network to different human parts. Furthermore, we propose a novel 2D motion parameterization scheme to increase the convergence rate of deformation field learning. Experiments demonstrate that our model can be learned 100 times faster than previous per-scene optimization methods while being competitive in the rendering quality. Training our model on a 512x512 video with 100 frames typically takes about 5 minutes on a single RTX 3090 GPU. The code is available on our project page: https://zju3dv.github.io/instant_nvr

Symmetric Shape-Preserving Autoencoder for Unsupervised Real Scene Point Cloud Completion
Ma, ChangfengandChen, YinuoandGuo, PengxiaoandGuo, JieandWang, ChongjunandGuo, Yanwen



研究问题:如何实现真实场景物体的无监督补全,同时保留输入形状、预测准确结果并适应多类别数据。
动机:现有的方法在保持输入形状、预测准确结果和适应多类别数据方面存在困难。
方法:提出一种名为USSPA的无监督对称形状保持自动编码网络,通过学习自然和人造物体的显著对称性来预测真实场景中物体的完整点云。设计了一个对称性学习模块来学习和保留结构对称性,并通过精心设计的上采样细化模块从初始粗预测器开始细化完整形状。
效果:实验结果表明,该方法在真实场景物体的无监督补全方面取得了最先进的性能,并在单类别训练中具有一致的性能。

Unsupervised completion of real scene objects is of vital importance but still remains extremely challenging in preserving input shapes, predicting accurate results, and adapting to multi-category data. To solve these problems, we propose in this paper an Unsupervised Symmetric Shape-Preserving Autoencoding Network, termed USSPA, to predict complete point clouds of objects from real scenes. One of our main observations is that many natural and man-made objects exhibit significant symmetries. To accommodate this, we devise a symmetry learning module to learn from those objects and to preserve structural symmetries. Starting from an initial coarse predictor, our autoencoder refines the complete shape with a carefully designed upsampling refinement module. Besides the discriminative process on the latent space, the discriminators of our USSPA also take predicted point clouds as direct guidance, enabling more detailed shape prediction. Clearly different from previous methods which train each category separately, our USSPA can be adapted to the training of multi-category data in one pass through a classifier-guided discriminator, with consistent performance on single category. For more accurate evaluation, we contribute to the community a real scene dataset with paired CAD models as ground truth. Extensive experiments and comparisons demonstrate our superiority and generalization and show that our method achieves state-of-the-art performance on unsupervised completion of real scene objects.

Physically Realizable Natural-Looking Clothing Textures Evade Person Detectors via 3D Modeling
Hu, ZhanhaoandChu, WendaandZhu, XiaopeiandZhang, HuiandZhang, BoandHu, Xiaolin



研究问题:如何制作出能避开人体检测器的对抗性衣物纹理?
动机:目前的对抗性衣物大多只在有限的视角有效,或对人类来说过于明显。
方法:基于3D建模来制作对抗性衣物纹理,利用Voronoi图和Gumbel-softmax技巧参数化迷彩纹理并通过3D建模优化参数。同时提出一种结合拓扑合理投影(TopoProj)和薄板样条(TPS)的高效3D网格增强管道。
效果:实验表明,这些衣物对多种探测器的攻击成功率很高。

Recent works have proposed to craft adversarial clothes for evading person detectors, while they are either only effective at limited viewing angles or very conspicuous to humans. We aim to craft adversarial texture for clothes based on 3D modeling, an idea that has been used to craft rigid adversarial objects such as a 3D-printed turtle. Unlike rigid objects, humans and clothes are non-rigid, leading to difficulties in physical realization. In order to craft natural-looking adversarial clothes that can evade person detectors at multiple viewing angles, we propose adversarial camouflage textures (AdvCaT) that resemble one kind of the typical textures of daily clothes, camouflage textures. We leverage the Voronoi diagram and Gumbel-softmax trick to parameterize the camouflage textures and optimize the parameters via 3D modeling. Moreover, we propose an efficient augmentation pipeline on 3D meshes combining topologically plausible projection (TopoProj) and Thin Plate Spline (TPS) to narrow the gap between digital and real-world objects. We printed the developed 3D texture pieces on fabric materials and tailored them into T-shirts and trousers. Experiments show high attack success rates of these clothes against multiple detectors.

SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes
Gao, YimingandCao, Yan-PeiandShan, Ying



研究问题:如何有效地在线重建和渲染大规模室内场景。
动机:目前的在线重建方法(如SLAM)可以实时重建3D场景几何,但不能产生照片级的真实结果;而基于NeRF的方法虽然能产生有前景的新颖视图合成结果,但其离线优化时间长且缺乏几何约束,对处理在线输入构成挑战。
方法:我们提出了SurfelNeRF,这是一种结合了显式几何表示和NeRF渲染的新型方法,通过使用灵活可扩展的神经表面表示来存储几何属性和从输入图像中提取的外观特征,并进一步将输入帧逐步集成到重建的全局神经场景表示中。此外,我们还提出了一种高效的可微分光栅化方案,用于渲染神经表面辐射场,使SurfelNeRF的训练和推理时间分别提高了10倍。
效果:实验结果表明,我们的方法在ScanNet上实现了最先进的23.82 PSNR和29.58 PSNR,在前向推理和每场景优化设置下分别达到了最佳效果。

Online reconstructing and rendering of large-scale indoor scenes is a long-standing challenge. SLAM-based methods can reconstruct 3D scene geometry progressively in real time but can not render photorealistic results. While NeRF-based methods produce promising novel view synthesis results, their long offline optimization time and lack of geometric constraints pose challenges to efficiently handling online input. Inspired by the complementary advantages of classical 3D reconstruction and NeRF, we thus investigate marrying explicit geometric representation with NeRF rendering to achieve efficient online reconstruction and high-quality rendering. We introduce SurfelNeRF, a variant of neural radiance field which employs a flexible and scalable neural surfel representation to store geometric attributes and extracted appearance features from input images. We further extend conventional surfel-based fusion scheme to progressively integrate incoming input frames into the reconstructed global neural scene representation. In addition, we propose a highly-efficient differentiable rasterization scheme for rendering neural surfel radiance fields, which helps SurfelNeRF achieve 10x speedups in training and inference time, respectively. Experimental results show that our method achieves the state-of-the-art 23.82 PSNR and 29.58 PSNR on ScanNet in feedforward inference and per-scene optimization settings, respectively.

NeUDF: Leaning Neural Unsigned Distance Fields With Volume Rendering
Liu, Yu-TaoandWang, LiandYang, JieandChen, WeikaiandMeng, XiaoxuandYang, BoandGao, Lin



研究问题:现有的基于符号距离函数(SDF)的方法仅限于重建封闭表面,无法重建包含开放表面结构的广泛真实世界对象。
动机:为了解决这一问题,我们引入了一种新的神经渲染框架——编码NeUDF,它仅通过多视图监督就可以从任意拓扑结构的表面进行重建。
方法:NeUDF利用无符号距离函数(UDF)作为表面表示,以获得表示任意表面的灵活性。同时,我们提出了两种专为基于UDF的体积渲染设计的权重函数新公式,并针对开放表面渲染(其中入/出测试不再有效)提出了专门的法线正则化策略来解决表面方向模糊问题。
效果:我们在包括DTU、MGN和Deep Fashion 3D在内的多个具有挑战性的数据集上进行了广泛的评估。实验结果表明,NeUDF在多视图表面重建任务中的性能明显优于最先进的方法,尤其是在处理具有开放边界的复杂形状时。

Multi-view shape reconstruction has achieved impressive progresses thanks to the latest advances in neural implicit surface rendering. However, existing methods based on signed distance function (SDF) are limited to closed surfaces, failing to reconstruct a wide range of real-world objects that contain open-surface structures. In this work, we introduce a new neural rendering framework, coded NeUDF, that can reconstruct surfaces with arbitrary topologies solely from multi-view supervision. To gain the flexibility of representing arbitrary surfaces, NeUDF leverages the unsigned distance function (UDF) as surface representation. While a naive extension of SDF-based neural renderer cannot scale to UDF, we propose two new formulations of weight function specially tailored for UDF-based volume rendering. Furthermore, to cope with open surface rendering, where the in/out test is no longer valid, we present a dedicated normal regularization strategy to resolve the surface orientation ambiguity. We extensively evaluate our method over a number of challenging datasets, including DTU, MGN, and Deep Fashion 3D. Experimental results demonstrate that NeUDF can significantly outperform the state-of-the-art method in the task of multi-view surface reconstruction, especially for the complex shapes with open boundaries.

NEF: Neural Edge Fields for 3D Parametric Curve Reconstruction From Multi-View Images
Ye, YunfanandYi, RenjiaoandGao, ZhiruiandZhu, ChenyangandCai, ZhipingandXu, Kai



研究问题:如何从一组校准的多视图图像中重建物体的3D特征曲线。
动机:现有的方法需要3D边缘的监督或几何运算,我们提出一种无需这些操作的新方法。
方法:我们学习了一种表示3D边缘密度分布的神经隐式场(NEF),并通过基于视图的渲染损失进行优化。
效果:在合成数据上,我们的NEF方法在所有指标上都优于现有的最佳方法。

We study the problem of reconstructing 3D feature curves of an object from a set of calibrated multi-view images. To do so, we learn a neural implicit field representing the density distribution of 3D edges which we refer to as Neural Edge Field (NEF). Inspired by NeRF, NEF is optimized with a view-based rendering loss where a 2D edge map is rendered at a given view and is compared to the ground-truth edge map extracted from the image of that view. The rendering-based differentiable optimization of NEF fully exploits 2D edge detection, without needing a supervision of 3D edges, a 3D geometric operator or cross-view edge correspondence. Several technical designs are devised to ensure learning a range-limited and view-independent NEF for robust edge extraction. The final parametric 3D curves are extracted from NEF with an iterative optimization method. On our benchmark with synthetic data, we demonstrate that NEF outperforms existing state-of-the-art methods on all metrics. Project page: https://yunfan1202.github.io/NEF/.

Inverting the Imaging Process by Learning an Implicit Camera Model
Huang, XinandZhang, QiandFeng, YingandLi, HongdongandWang, Qing



研究问题:如何有效地替代传统的离散信号表示,用隐式坐标基神经网络来表示视觉信号。
动机:现有的隐式神经网络表示主要关注场景建模,本文提出了一种新的隐式相机模型,将相机的物理成像过程表示为深度神经网络。
方法:通过多聚焦堆栈和多曝光包围监督,联合学习隐式场景模型和隐式相机模型。设计了隐式模糊生成器和隐式色调映射器分别模拟相机的光圈和曝光过程。
效果:在大量的测试图像和视频上展示了新模型的有效性,能产生准确且视觉上吸引人的全焦点和高动态范围图像。原则上,新的隐式神经网络相机模型有可能使各种其他逆成像任务受益。

Representing visual signals with implicit coordinate-based neural networks, as an effective replacement of the traditional discrete signal representation, has gained considerable popularity in computer vision and graphics. In contrast to existing implicit neural representations which focus on modelling the scene only, this paper proposes a novel implicit camera model which represents the physical imaging process of a camera as a deep neural network. We demonstrate the power of this new implicit camera model on two inverse imaging tasks: i) generating all-in-focus photos, and ii) HDR imaging. Specifically, we devise an implicit blur generator and an implicit tone mapper to model the aperture and exposure of the camera's imaging process, respectively. Our implicit camera model is jointly learned together with implicit scene models under multi-focus stack and multi-exposure bracket supervision. We have demonstrated the effectiveness of our new model on large number of test images and videos, producing accurate and visually appealing all-in-focus and high dynamic range images. In principle, our new implicit neural camera model has the potential to benefit a wide array of other inverse imaging tasks.

Detecting Human-Object Contact in Images
Chen, YixinandDwivedi, SaiKumarandBlack, MichaelJ.andTzionas, Dimitrios



研究问题:目前,对于从图像中检测人体与物体的接触情况,尚无稳健的方法和数据集。
动机:人类在执行任务时会不断接触物体,因此,检测人体与物体的接触对于构建以人为中心的人工智能至关重要。
方法:我们创建了一个新的数据集HOT("Human-Object conTact"),通过结合两个数据源来构建HOT:一是使用3D人体网格在3D场景中移动的PROX数据集,并通过3D网格接近度和投影自动注释2D图像区域;二是使用V-COCO、HAKE和Watch-n-Patch数据集,并让训练过的标注者在发生接触的2D图像区域周围画多边形。我们还对接触的身体部位进行了标注。然后,我们使用HOT数据集训练了一个新的接触探测器,该探测器接受单色图像作为输入,输出2D接触热图以及接触的身体部位标签。
效果:我们的探测器在广泛的评估中表现出色,定量结果显示,我们的模型优于基线,所有组件都有助于提高性能。在线资源库中的图像结果显示出合理的检测结果和泛化能力。我们的HOT数据集和模型可在https://hot.is.tue.mpg.de进行研究。

Humans constantly contact objects to move and perform tasks. Thus, detecting human-object contact is important for building human-centered artificial intelligence. However, there exists no robust method to detect contact between the body and the scene from an image, and there exists no dataset to learn such a detector. We fill this gap with HOT ("Human-Object conTact"), a new dataset of human-object contacts in images. To build HOT, we use two data sources: (1) We use the PROX dataset of 3D human meshes moving in 3D scenes, and automatically annotate 2D image areas for contact via 3D mesh proximity and projection. (2) We use the V-COCO, HAKE and Watch-n-Patch datasets, and ask trained annotators to draw polygons around the 2D image areas where contact takes place. We also annotate the involved body part of the human body. We use our HOT dataset to train a new contact detector, which takes a single color image as input, and outputs 2D contact heatmaps as well as the body-part labels that are in contact. This is a new and challenging task, that extends current foot-ground or hand-object contact detectors to the full generality of the whole body. The detector uses a part-attention branch to guide contact estimation through the context of the surrounding body parts and scene. We evaluate our detector extensively, and quantitative results show that our model outperforms baselines, and that all components contribute to better performance. Results on images from an online repository show reasonable detections and generalizability. Our HOT data and model are available for research at https://hot.is.tue.mpg.de.

Human Body Shape Completion With Implicit Shape and Flow Learning
Zhou, BoyaoandMeng, DiandFranco, Jean-S\'ebastienandBoyer, Edmond



研究问题:如何通过结合形状和流动估计,利用两个连续深度图像完成人体形状模型。
动机:在考虑部分深度观察时,形状补全是计算机视觉中一个具有挑战性的任务,且高度缺乏约束。
方法:采用学习基础的方法,并探索两个连续帧之间的运动流如何对形状补全任务做出贡献。为了有效利用流动信息,我们的架构结合了两种估计,并实现了两个用于提高稳健性的特征:一是所有到全部的注意力模块,用于编码同一帧内点和不同帧对应点之间的相关性;二是从粗到细,再到稀疏的策略,以平衡表示能力和计算成本。
效果:实验证明,流动实际上有助于人体模型的完成。同时,对于不同的人体形状、姿势和服装,该方法在两个基准测试上都优于最先进的形状补全方法。

In this paper, we investigate how to complete human body shape models by combining shape and flow estimation given two consecutive depth images. Shape completion is a challenging task in computer vision that is highly under-constrained when considering partial depth observations. Besides model based strategies that exploit strong priors, and consequently struggle to preserve fine geometric details, learning based approaches build on weaker assumptions and can benefit from efficient implicit representations. We adopt such a representation and explore how the motion flow between two consecutive frames can contribute to the shape completion task. In order to effectively exploit the flow information, our architecture combines both estimations and implements two features for robustness: First, an all-to-all attention module that encodes the correlation between points in the same frame and between corresponding points in different frames; Second, a coarse-dense to fine-sparse strategy that balances the representation ability and the computational cost. Our experiments demonstrate that the flow actually benefits human body model completion. They also show that our method outperforms the state-of-the-art approaches for shape completion on 2 benchmarks, considering different human shapes, poses, and clothing.

Towards Unbiased Volume Rendering of Neural Implicit Surfaces With Geometry Priors
Zhang, YongqiangandHu, ZhipengandWu, HaoqianandZhao, MindaandLi, LinchengandZou, ZhengxiaandFan, Changjie



研究问题:现有的神经表面重建方法在多视角重建中的准确度有限。
动机:这种限制是由于其体积渲染策略的偏差,特别是在观察方向接近与表面相切时。
方法:我们提出了一种新的渲染方法,通过将SDF场与观察方向和表面法线向量之间的角度进行缩放来消除偏差。
效果:实验结果表明,我们的渲染方法减少了基于SDF的体积渲染的偏差,并在DTU数据集上超越了最先进的神经隐式表面方法。

Learning surface by neural implicit rendering has been a promising way for multi-view reconstruction in recent years. Existing neural surface reconstruction methods, such as NeuS and VolSDF, can produce reliable meshes from multi-view posed images. Although they build a bridge between volume rendering and Signed Distance Function (SDF), the accuracy is still limited. In this paper, we argue that this limited accuracy is due to the bias of their volume rendering strategies, especially when the viewing direction is close to be tangent to the surface. We revise and provide an additional condition for the unbiased volume rendering. Following this analysis, we propose a new rendering method by scaling the SDF field with the angle between the viewing direction and the surface normal vector. Experiments on simulated data indicate that our rendering method reduces the bias of SDF-based volume rendering. Moreover, there still exists non-negligible bias when the learnable standard deviation of SDF is large at early stage, which means that it is hard to supervise the rendered depth with depth priors. Alternatively we supervise zero-level set with surface points obtained from a pre-trained Multi-View Stereo network. We evaluate our method on the DTU dataset and show that it outperforms the state-of-the-arts neural implicit surface methods without mask supervision.

NeRFLight: Fast and Light Neural Radiance Fields Using a Shared Feature Grid
Rivas-Manzaneque, FernandoandSierra-Acosta, JorgeandPenate-Sanchez, AdrianandMoreno-Noguer, FrancescandRibeiro, Angela



研究问题:目前的Neural Radiance Fields(NeRF)模型虽然在场景外观建模上表现出色,但无法实现实时渲染。
动机:为了解决这一问题,研究人员尝试将NeRF的输出烘焙到数据结构中或将可训练参数排列在显式特征网格中,但这些方法会大大增加模型的内存占用,限制了其在带宽受限的应用中的部署。
方法:本文提出了一种新的架构,将基于NeRF表示的密度场分割为N个区域,并使用N个不同的解码器对密度进行建模,这些解码器共享同一个特征网格。这种方法生成了一个较小的网格,其中每个特征位于多个空间位置,迫使它们学习一个对场景不同部分都有效的紧凑表示。
效果:通过在每个区域上对称地处理特征,进一步减小了最终模型的大小,这既有利于训练后的特征剪枝,又允许相邻体素之间平滑的梯度过渡。实验表明,该方法实现了实时性能和质量指标,与最先进的方法相比,FPS/MB比率提高了2倍以上。

While original Neural Radiance Fields (NeRF) have shown impressive results in modeling the appearance of a scene with compact MLP architectures, they are not able to achieve real-time rendering. This has been recently addressed by either baking the outputs of NeRF into a data structure or arranging trainable parameters in an explicit feature grid. These strategies, however, significantly increase the memory footprint of the model which prevents their deployment on bandwidth-constrained applications. In this paper, we extend the grid-based approach to achieve real-time view synthesis at more than 150 FPS using a lightweight model. Our main contribution is a novel architecture in which the density field of NeRF-based representations is split into N regions and the density is modeled using N different decoders which reuse the same feature grid. This results in a smaller grid where each feature is located in more than one spatial position, forcing them to learn a compact representation that is valid for different parts of the scene. We further reduce the size of the final model by disposing of the features symmetrically on each region, which favors feature pruning after training while also allowing smooth gradient transitions between neighboring voxels. An exhaustive evaluation demonstrates that our method achieves real-time performance and quality metrics on a pair with state-of-the-art with an improvement of more than 2x in the FPS/MB ratio.

Compressing Volumetric Radiance Fields to 1 MB
Li, LingzhiandShen, ZhenandWang, ZhongshuandShen, LiandBo, Liefeng



研究问题:如何改进NeRFs,提高训练速度和渲染效率。
动机:现有的方法如DVGO、Plenoxels和TensoRF等在改进NeRFs上取得了显著效果,但需要大量的存储空间,且运行内存消耗大。
方法:本文提出了一种名为向量量化辐射场(VQRF)的简单有效框架,用于压缩基于体积网格的辐射场。通过引入可训练的向量量化来提高网格模型的紧凑性,并结合有效的联合调整策略和后处理,实现了对模型大小的压缩,同时保持视觉质量。
效果:实验表明,该方法在多种具有不同体积结构的方法中表现出色,能够实现无可比拟的性能和良好的泛化能力,为实际应用中的体积辐射场方法提供了广泛的使用可能。

Approximating radiance fields with discretized volumetric grids is one of promising directions for improving NeRFs, represented by methods like DVGO, Plenoxels and TensoRF, which achieve super-fast training convergence and real-time rendering. However, these methods typically require a tremendous storage overhead, costing up to hundreds of megabytes of disk space and runtime memory for a single scene. We address this issue in this paper by introducing a simple yet effective framework, called vector quantized radiance fields (VQRF), for compressing these volume-grid-based radiance fields. We first present a robust and adaptive metric for estimating redundancy in grid models and performing voxel pruning by better exploring intermediate outputs of volumetric rendering. A trainable vector quantization is further proposed to improve the compactness of grid models. In combination with an efficient joint tuning strategy and post-processing, our method can achieve a compression ratio of 100x by reducing the overall model size to 1 MB with negligible loss on visual quality. Extensive experiments demonstrate that the proposed framework is capable of achieving unrivaled performance and well generalization across multiple methods with distinct volumetric structures, facilitating the wide use of volumetric radiance fields methods in real-world applications. Code is available at https://github.com/AlgoHunt/VQRF.

Gated Stereo: Joint Depth Estimation From Gated and Wide-Baseline Active Stereo Cues
Walz, StefanieandBijelic, MarioandRamazzina, AndreaandWalia, AmanpreetandMannan, FahimandHeide, Felix



研究问题:提出一种高分辨率、长距离深度估计技术,即门控立体视觉。
动机:利用主动和高动态范围被动捕获,结合多视角线索和来自主动门控的飞行时间强度线索进行深度估计。
方法:提出一种带有单眼和立体深度预测分支的深度估计方法,并在最终融合阶段进行组合。每个模块通过有监督和门控自我监督损失的组合进行监督。
效果:该方法在最远160米的距离上,比次优RGB立体方法提高了50%以上的MAE,比现有的单眼门控方法提高了74%的MAE。

We propose Gated Stereo, a high-resolution and long-range depth estimation technique that operates on active gated stereo images. Using active and high dynamic range passive captures, Gated Stereo exploits multi-view cues alongside time-of-flight intensity cues from active gating. To this end, we propose a depth estimation method with a monocular and stereo depth prediction branch which are combined in a final fusion stage. Each block is supervised through a combination of supervised and gated self-supervision losses. To facilitate training and validation, we acquire a long-range synchronized gated stereo dataset for automotive scenarios. We find that the method achieves an improvement of more than 50 % MAE compared to the next best RGB stereo method, and 74 % MAE to existing monocular gated methods for distances up to 160 m. Our code, models and datasets are available here: https://light.princeton.edu/gatedstereo/.

Hand Avatar: Free-Pose Hand Animation and Rendering From Monocular Video
Chen, XingyuandWang, BaoyuanandShum, Heung-Yeung



研究问题:本文旨在提出一种新的手部动画和渲染表示方法,即HandAvatar,以生成平滑的组成几何和自遮挡感知纹理。
动机:现有的手部动画和渲染技术无法生成高质量的个性化手部形状,以及逼真的组成几何和自遮挡感知纹理。
方法:首先,我们开发了一个MANO-HD模型作为高分辨率网格拓扑,以适应个性化的手部形状。然后,我们将手部几何分解为每根骨头的刚性部分,并重新组合成对的几何编码,以获得一致的占用场。对于纹理建模,我们提出了一个自遮挡感知的着色场(SelF)。在SelF中,我们在MANO-HD表面上铺设了可驱动的锚点,以记录各种手部姿势下的反射率信息。此外,我们还设计了有向软占用,用于描述光线到表面的关系,从而生成照明场,实现与姿势无关的反射率和与姿势有关光照的解耦。
效果:通过单目视频数据训练的HandAvatar可以在进行自由手部动画和渲染的同时,实现出色的外观保真度。我们还证明,HandAvatar为手部外观编辑提供了一种途径。

We present HandAvatar, a novel representation for hand animation and rendering, which can generate smoothly compositional geometry and self-occlusion-aware texture. Specifically, we first develop a MANO-HD model as a high-resolution mesh topology to fit personalized hand shapes. Sequentially, we decompose hand geometry into per-bone rigid parts, and then re-compose paired geometry encodings to derive an across-part consistent occupancy field. As for texture modeling, we propose a self-occlusion-aware shading field (SelF). In SelF, drivable anchors are paved on the MANO-HD surface to record albedo information under a wide variety of hand poses. Moreover, directed soft occupancy is designed to describe the ray-to-surface relation, which is leveraged to generate an illumination field for the disentanglement of pose-independent albedo and pose-dependent illumination. Trained from monocular video data, our HandAvatar can perform free-pose hand animation and rendering while at the same time achieving superior appearance fidelity. We also demonstrate that HandAvatar provides a route for hand appearance editing.

DiffRF: Rendering-Guided 3D Radiance Field Diffusion
M\"uller, NormanandSiddiqui, YawarandPorzi, LorenzoandBul\`o, SamuelRotaandKontschieder, PeterandNie{\ss



研究问题:本文提出了一种新的3D辐射场合成方法DiffRF,该方法基于去噪扩散概率模型。
动机:现有的基于扩散的方法主要在图像、潜在代码或点云数据上操作,而我们是第一个直接生成体积辐射场的。
方法:我们提出了一个直接在显式体素网格表示上操作的3D去噪模型。为了解决从一组拍摄的图像生成的辐射场可能模糊并包含伪影的问题,我们将去噪公式与渲染损失相结合,使模型能够学习偏向良好图像质量的偏离先验,而不是尝试复制漂浮伪影等拟合误差。
效果:与2D扩散模型相比,我们的模型学习了多视图一致的先验,支持自由视点合成和精确的形状生成。与3D GANs相比,我们的基于扩散的方法自然地支持条件生成,如遮罩完成或单视图3D合成。

We introduce DiffRF, a novel approach for 3D radiance field synthesis based on denoising diffusion probabilistic models. While existing diffusion-based methods operate on images, latent codes, or point cloud data, we are the first to directly generate volumetric radiance fields. To this end, we propose a 3D denoising model which directly operates on an explicit voxel grid representation. However, as radiance fields generated from a set of posed images can be ambiguous and contain artifacts, obtaining ground truth radiance field samples is non-trivial. We address this challenge by pairing the denoising formulation with a rendering loss, enabling our model to learn a deviated prior that favours good image quality instead of trying to replicate fitting errors like floating artifacts. In contrast to 2D-diffusion models, our model learns multi-view consistent priors, enabling free-view synthesis and accurate shape generation. Compared to 3D GANs, our diffusion-based approach naturally enables conditional generation like masked completion or single-view 3D synthesis at inference time.

SUDS: Scalable Urban Dynamic Scenes
Turki, HaithemandZhang, JasonY.andFerroni, FrancescoandRamanan, Deva



研究问题:扩展神经辐射场(NeRFs)以处理大规模的动态城市场景。
动机:目前的工作主要针对短时视频片段进行重建,且需要通过3D边界框和全景标签进行监督。
方法:将场景分解为三个独立的哈希表数据结构,有效地编码静态、动态和远场辐射场;利用未标记的目标信号,包括RGB图像、稀疏LiDAR、现成的自监督2D描述符以及最重要的2D光流。
效果:在1700个视频的地理覆盖范围内,实现了对数千个对象的重建,并超越了依赖真实3D边界框注释的最新技术,同时训练速度提高了10倍。

We extend neural radiance fields (NeRFs) to dynamic large-scale urban scenes. Prior work tends to reconstruct single video clips of short durations (up to 10 seconds). Two reasons are that such methods (a) tend to scale linearly with the number of moving objects and input videos because a separate model is built for each and (b) tend to require supervision via 3D bounding boxes and panoptic labels, obtained manually or via category-specific models. As a step towards truly open-world reconstructions of dynamic cities, we introduce two key innovations: (a) we factorize the scene into three separate hash table data structures to efficiently encode static, dynamic, and far-field radiance fields, and (b) we make use of unlabeled target signals consisting of RGB images, sparse LiDAR, off-the-shelf self-supervised 2D descriptors, and most importantly, 2D optical flow. Operationalizing such inputs via photometric, geometric, and feature-metric reconstruction losses enables SUDS to decompose dynamic scenes into the static background, individual objects, and their motions. When combined with our multi-branch table representation, such reconstructions can be scaled to tens of thousands of objects across 1.2 million frames from 1700 videos spanning geospatial footprints of hundreds of kilometers, (to our knowledge) the largest dynamic NeRF built to date. We present qualitative initial results on a variety of tasks enabled by our representations, including novel-view synthesis of dynamic urban scenes, unsupervised 3D instance segmentation, and unsupervised 3D cuboid detection. To compare to prior work, we also evaluate on KITTI and Virtual KITTI 2, surpassing state-of-the-art methods that rely on ground truth 3D bounding box annotations while being 10x quicker to train.

HandNeRF: Neural Radiance Fields for Animatable Interacting Hands
Guo, ZhiyangandZhou, WengangandWang, MinandLi, LiandLi, Houqiang



研究问题:提出一种新的框架,利用神经辐射场(NeRF)重建交互手的准确外观和几何形状。
动机:为了实现从任意视角进行手势动画的真实感渲染,需要对手部外观和几何形状进行精确建模。
方法:首先使用现成的骨架估计器对手部姿态进行参数化,然后设计一个基于姿态的变形场,建立不同姿态之间的对应关系,优化单手的姿态解耦NeRF。这种统一的建模方式有效地补充了两种手部在罕见观察区域中的几何和纹理线索。同时,利用姿态先验生成伪深度图,作为遮挡感知密度学习的指导。此外,还提出了一种神经特征蒸馏方法,实现颜色优化的跨域对齐。
效果:通过在大规模InterHand2.6M数据集上进行大量实验,验证了提出的HandNeRF的优点,并在定性和定量上都取得了一系列最先进的结果。

We propose a novel framework to reconstruct accurate appearance and geometry with neural radiance fields (NeRF) for interacting hands, enabling the rendering of photo-realistic images and videos for gesture animation from arbitrary views. Given multi-view images of a single hand or interacting hands, an off-the-shelf skeleton estimator is first employed to parameterize the hand poses. Then we design a pose-driven deformation field to establish correspondence from those different poses to a shared canonical space, where a pose-disentangled NeRF for one hand is optimized. Such unified modeling efficiently complements the geometry and texture cues in rarely-observed areas for both hands. Meanwhile, we further leverage the pose priors to generate pseudo depth maps as guidance for occlusion-aware density learning. Moreover, a neural feature distillation method is proposed to achieve cross-domain alignment for color optimization. We conduct extensive experiments to verify the merits of our proposed HandNeRF and report a series of state-of-the-art results both qualitatively and quantitatively on the large-scale InterHand2.6M dataset.

Weakly-Supervised Single-View Image Relighting
Yi, RenjiaoandZhu, ChenyangandXu, Kai



研究问题:如何有效地为朗伯体和低频镜面物体的单张图片重新打光。
动机:为了实现AR应用中将照片中的物体插入新场景并按照新环境光照进行重新打光,需要解决图像的逆渲染和再渲染问题。
方法:提出了一种基于学习的方法,通过弱监督低秩约束解决逆渲染问题,并使用可微分的镜面渲染层对低频非朗伯材料进行各种球谐光照下的再渲染。
效果:通过大规模的数据集和实验验证,该方法实现了最先进的性能,可以用于移动设备的AR物体插入应用。

We present a learning-based approach to relight a single image of Lambertian and low-frequency specular objects. Our method enables inserting objects from photographs into new scenes and relighting them under the new environment lighting, which is essential for AR applications. To relight the object, we solve both inverse rendering and re-rendering. To resolve the ill-posed inverse rendering, we propose a weakly-supervised method by a low-rank constraint. To facilitate the weakly-supervised training, we contribute Relit, a large-scale (750K images) dataset of videos with aligned objects under changing illuminations. For re-rendering, we propose a differentiable specular rendering layer to render low-frequency non-Lambertian materials under various illuminations of spherical harmonics. The whole pipeline is end-to-end and efficient, allowing for a mobile app implementation of AR object insertion. Extensive evaluations demonstrate that our method achieves state-of-the-art performance. Project page: https://renjiaoyi.github.io/relighting/.

H2ONet: Hand-Occlusion-and-Orientation-Aware Network for Real-Time 3D Hand Mesh Reconstruction
Xu, HaoandWang, TianyuandTang, XiaoandFu, Chi-Wing



研究问题:实时3D手部网格重建,特别是在手部持有物体时的挑战。
动机:以前的重建方法无法充分利用多帧非遮挡信息来提高重建质量。
方法:设计了H2ONet模型,将手部网格重建分为两个分支,一个利用手指级别的非遮挡信息,另一个利用全局手部方向。同时提出手指级别和手级别遮挡感知特征融合策略,以获取跨时间帧的非遮挡信息。
效果:在Dex-YCB和HO3D-v2数据集上进行的实验表明,H2ONet能够实时运行并在手部网格和姿态精度上都取得了最先进的性能。

Real-time 3D hand mesh reconstruction is challenging, especially when the hand is holding some object. Beyond the previous methods, we design H2ONet to fully exploit non-occluded information from multiple frames to boost the reconstruction quality. First, we decouple hand mesh reconstruction into two branches, one to exploit finger-level non-occluded information and the other to exploit global hand orientation, with lightweight structures to promote real-time inference. Second, we propose finger-level occlusion-aware feature fusion, leveraging predicted finger-level occlusion information as guidance to fuse finger-level information across time frames. Further, we design hand-level occlusion-aware feature fusion to fetch non-occluded information from nearby time frames. We conduct experiments on the Dex-YCB and HO3D-v2 datasets with challenging hand-object occlusion cases, manifesting that H2ONet is able to run in real-time and achieves state-of-the-art performance on both the hand mesh and pose precision. The code will be released on GitHub.

Structured 3D Features for Reconstructing Controllable Avatars
Corona, EnricandZanfir, MihaiandAlldieck, ThiemoandBazavan, EduardGabrielandZanfir, AndreiandSminchisescu, Cristian



研究问题:如何利用新型隐式3D表示,将像素对齐的图像特征聚合到从参数化、统计人体网格表面采样的密集3D点上,以优化覆盖感兴趣的人,并生成可动画化的3D重建。
动机:现有的模型仅能捕捉到身体形状,无法有效处理配饰、头发和松散的衣物等细节。因此,提出一种基于3D变换器的注意框架,能够从单一视角的无约束姿势图像生成带有反照率和照明分解的可动画化3D重建。
方法:采用新型隐式3D表示,将像素对齐的图像特征聚合到从参数化、统计人体网格表面采样的密集3D点上,形成有语义的3D点,并在3D空间中自由移动。然后,通过一个端到端的模型进行训练,实现单目3D重建以及反照率和照明分解。
效果:S3F模型在各种任务上都超越了先前最先进的技术,包括单目3D重建以及反照率和照明估计。此外,该方法还支持新的视角合成、重新照明和重新定位重建,并可以自然地扩展到处理多个输入图像(例如,同一人的多个视图或视频中的不同姿势)。最后,展示了该模型在3D虚拟试穿应用中的编辑能力。

We introduce Structured 3D Features, a model based on a novel implicit 3D representation that pools pixel-aligned image features onto dense 3D points sampled from a parametric, statistical human mesh surface. The 3D points have associated semantics and can move freely in 3D space. This allows for optimal coverage of the person of interest, beyond just the body shape, which in turn, additionally helps modeling accessories, hair, and loose clothing. Owing to this, we present a complete 3D transformer-based attention framework which, given a single image of a person in an unconstrained pose, generates an animatable 3D reconstruction with albedo and illumination decomposition, as a result of a single end-to-end model, trained semi-supervised, and with no additional postprocessing. We show that our S3F model surpasses the previous state-of-the-art on various tasks, including monocular 3D reconstruction, as well as albedo & shading estimation. Moreover, we show that the proposed methodology allows novel view synthesis, relighting, and re-posing the reconstruction, and can naturally be extended to handle multiple input images (e.g. different views of a person, or the same view, in different poses, in video). Finally, we demonstrate the editing capabilities of our model for 3D virtual try-on applications.

In-Hand 3D Object Scanning From an RGB Sequence
Hampali, ShreyasandHodan, TomasandTran, LuanandMa, LingniandKeskin, CemandLepetit, Vincent



研究问题:如何利用单目相机对未知物体进行手持3D扫描。
动机:目前的大多数基于NeRF的方法都需要已知的相机-物体相对位姿,而我们的方法不需要这个假设。
方法:我们提出了一种增量方法,首先将序列分割成精心选择的重叠段,然后在每个段内独立地重建物体形状和跟踪其位姿,最后在所有段合并后进行全局优化。
效果:我们的方法能够重建有纹理和无纹理挑战物体的形状和颜色,优于只依赖外观特征的经典方法,且其性能接近于假设已知相机位姿的最新方法。

We propose a method for in-hand 3D scanning of an unknown object with a monocular camera. Our method relies on a neural implicit surface representation that captures both the geometry and the appearance of the object, however, by contrast with most NeRF-based methods, we do not assume that the camera-object relative poses are known. Instead, we simultaneously optimize both the object shape and the pose trajectory. As direct optimization over all shape and pose parameters is prone to fail without coarse-level initialization, we propose an incremental approach that starts by splitting the sequence into carefully selected overlapping segments within which the optimization is likely to succeed. We reconstruct the object shape and track its poses independently within each segment, then merge all the segments before performing a global optimization. We show that our method is able to reconstruct the shape and color of both textured and challenging texture-less objects, outperforms classical methods that rely only on appearance features, and that its performance is close to recent methods that assume known camera poses.

OmniVidar: Omnidirectional Depth Estimation From Multi-Fisheye Images
Xie, ShengandWang, DaochuanandLiu, Yun-Hui



研究问题:如何从四个大视场(FoV)摄像头中估计深度,这是一个困难且未被充分研究的问题。
动机:现有的方法无法有效解决这个问题,因此需要提出新的解决方案。
方法:我们提出了一种名为OmniVidar的新系统,该系统将复杂的多视点深度估计问题简化为更易处理的双目深度估计问题。OmniVidar包含三个部分:(1)一种新的相机模型,用于解决现有模型的缺点;(2)一种新的基于多鱼眼相机的极线纠正方法,用于解决图像畸变并简化深度估计问题;(3)一种改进的双目深度估计网络,实现了准确性和效率之间的更好平衡。
效果:实验结果表明,OmniVidar在准确性和性能上都优于其他所有方法。

Estimating depth from four large field of view (FoV) cameras has been a difficult and understudied problem. In this paper, we proposed a novel and simple system that can convert this difficult problem into easier binocular depth estimation. We name this system OmniVidar, as its results are similar to LiDAR, but rely only on vision. OmniVidar contains three components: (1) a new camera model to address the shortcomings of existing models, (2) a new multi-fisheye camera based epipolar rectification method for solving the image distortion and simplifying the depth estimation problem, (3) an improved binocular depth estimation network, which achieves a better balance between accuracy and efficiency. Unlike other omnidirectional stereo vision methods, OmniVidar does not contain any 3D convolution, so it can achieve higher resolution depth estimation at fast speed. Results demonstrate that OmniVidar outperforms all other methods in terms of accuracy and performance.

Octree Guided Unoriented Surface Reconstruction
Koneputugodage, ChaminHewaandBen-Shabat, YizhakandGould, Stephen



研究问题:解决从无方向点云进行表面重建的问题。
动机:目前的隐式神经表示(INRs)在这项任务中很受欢迎,但当形状内部和外部的信息不可用时(如形状占据、有符号的距离或表面法线方向),优化依赖于启发式方法和正则化器来恢复表面,这可能导致收敛缓慢并容易陷入局部最小值。
方法:我们提出了一个两步法,OG-INR,首先构建一个离散的八叉树并标记内部和外部,然后使用初始由八叉树标签引导的INR对连续且高保真度的形状进行优化。为了解决我们的标记问题,我们在离散结构上提出了一个能量函数,并提供了一个高效的移动生成算法,该算法探索了许多可能的标记。此外,我们还展示了可以很容易地将知识注入到离散的八叉树中,从而简单地影响连续INR的结果。
效果:我们在两个无方向表面重建数据集上评估了我们方法的有效性,并与其他无方向和一些有向的方法进行了比较,结果显示出良好的竞争力。我们的结果表明,通过移动生成算法的探索避免了纯梯度下降优化方法所达到的许多不良局部最小值(见图1)。

We address the problem of surface reconstruction from unoriented point clouds. Implicit neural representations (INRs) have become popular for this task, but when information relating to the inside versus outside of a shape is not available (such as shape occupancy, signed distances or surface normal orientation) optimization relies on heuristics and regularizers to recover the surface. These methods can be slow to converge and easily get stuck in local minima. We propose a two-step approach, OG-INR, where we (1) construct a discrete octree and label what is inside and outside (2) optimize for a continuous and high-fidelity shape using an INR that is initially guided by the octree's labelling. To solve for our labelling, we propose an energy function over the discrete structure and provide an efficient move-making algorithm that explores many possible labellings. Furthermore we show that we can easily inject knowledge into the discrete octree, providing a simple way to influence the result from the continuous INR. We evaluate the effectiveness of our approach on two unoriented surface reconstruction datasets and show competitive performance compared to other unoriented, and some oriented, methods. Our results show that the exploration by the move-making algorithm avoids many of the bad local minima reached by purely gradient descent optimized methods (see Figure 1).

Rigidity-Aware Detection for 6D Object Pose Estimation
Hai, YangandSong, RuiandLi, JiaojiaoandSalzmann, MathieuandHu, Yinlin



研究问题:现有的6D物体姿态估计方法在实际应用中,由于初始的2D边界框定位不准确,导致后续的姿态网络训练效果不佳。
动机:为了解决这一问题,本文提出了一种刚体感知检测方法,利用6D姿态估计中目标物体是刚体的特性,从整个可见物体区域采样正样本,而非简单地从可能被遮挡的边界框中心采样。
方法:通过构建一个可见性图,该图使用边界框内每个像素到边界的最小距离,使得每个可见的物体部分都能对最终的边界框预测做出贡献,从而提高检测鲁棒性。
效果:在七个具有挑战性的6D姿态估计数据集上进行实验,结果表明该方法比通用检测框架有大幅度的提升。结合姿态回归网络,本文的方法在具有挑战性的BOP基准测试上取得了最先进的姿态估计结果。

Most recent 6D object pose estimation methods first use object detection to obtain 2D bounding boxes before actually regressing the pose. However, the general object detection methods they use are ill-suited to handle cluttered scenes, thus producing poor initialization to the subsequent pose network. To address this, we propose a rigidity-aware detection method exploiting the fact that, in 6D pose estimation, the target objects are rigid. This lets us introduce an approach to sampling positive object regions from the entire visible object area during training, instead of naively drawing samples from the bounding box center where the object might be occluded. As such, every visible object part can contribute to the final bounding box prediction, yielding better detection robustness. Key to the success of our approach is a visibility map, which we propose to build using a minimum barrier distance between every pixel in the bounding box and the box boundary. Our results on seven challenging 6D pose estimation datasets evidence that our method outperforms general detection frameworks by a large margin. Furthermore, combined with a pose regression network, we obtain state-of-the-art pose estimation results on the challenging BOP benchmark.

DP-NeRF: Deblurred Neural Radiance Field With Physical Scene Priors
Lee, DogyoonandLee, MinhyeokandShin, ChajinandLee, Sangyoun



研究问题:现有的NeRF模型在处理模糊图像时,没有考虑到三维空间中的几何和外观一致性,导致重建场景的感知质量下降。
动机:为了解决这一问题,本文提出了一种名为DP-NeRF的新型清晰NeRF框架,用于处理模糊图像。
方法:DP-NeRF采用了两个物理先验来约束模型,这两个先验是从相机在图像采集过程中的实际模糊过程推导出来的。具体来说,DP-NeRF提出了刚性模糊核来利用物理先验实现3D一致性,并采用自适应权重提议来考虑深度和模糊之间的关系,从而优化颜色合成误差。
效果:实验结果表明,DP-NeRF成功地提高了重建NeRF的感知质量,确保了3D几何和外观一致性。通过全面的消融分析,进一步证明了该模型的有效性。

Neural Radiance Field (NeRF) has exhibited outstanding three-dimensional (3D) reconstruction quality via the novel view synthesis from multi-view images and paired calibrated camera parameters. However, previous NeRF-based systems have been demonstrated under strictly controlled settings, with little attention paid to less ideal scenarios, including with the presence of noise such as exposure, illumination changes, and blur. In particular, though blur frequently occurs in real situations, NeRF that can handle blurred images has received little attention. The few studies that have investigated NeRF for blurred images have not considered geometric and appearance consistency in 3D space, which is one of the most important factors in 3D reconstruction. This leads to inconsistency and the degradation of the perceptual quality of the constructed scene. Hence, this paper proposes a DP-NeRF, a novel clean NeRF framework for blurred images, which is constrained with two physical priors. These priors are derived from the actual blurring process during image acquisition by the camera. DP-NeRF proposes rigid blurring kernel to impose 3D consistency utilizing the physical priors and adaptive weight proposal to refine the color composition error in consideration of the relationship between depth and blur. We present extensive experimental results for synthetic and real scenes with two types of blur: camera motion blur and defocus blur. The results demonstrate that DP-NeRF successfully improves the perceptual quality of the constructed NeRF ensuring 3D geometric and appearance consistency. We further demonstrate the effectiveness of our model with comprehensive ablation analysis.

MACARONS: Mapping and Coverage Anticipation With RGB Online Self-Supervision
Gu\'edon, AntoineandMonnier, TomandMonasse, PascalandLepetit, Vincent



研究问题:如何仅通过彩色图像同时探索和重建大型环境,并解决下一步最佳视角问题。
动机:目前的方法大多依赖深度传感器,需要3D监督且无法处理大规模场景。
方法:提出一种自我监督的方法,仅使用彩色相机预测体积占用场,并从该场预测下一步最佳视角。
效果:在各种3D场景的数据集上进行测试,表现优于依赖深度传感器的最新方法,适用于无人机拍摄的户外场景。

We introduce a method that simultaneously learns to explore new large environments and to reconstruct them in 3D from color images only. This is closely related to the Next Best View problem (NBV), where one has to identify where to move the camera next to improve the coverage of an unknown scene. However, most of the current NBV methods rely on depth sensors, need 3D supervision and/or do not scale to large scenes. Our method requires only a color camera and no 3D supervision. It simultaneously learns in a self-supervised fashion to predict a volume occupancy field from color images and, from this field, to predict the NBV. Thanks to this approach, our method performs well on new scenes as it is not biased towards any training 3D data. We demonstrate this on a recent dataset made of various 3D scenes and show it performs even better than recent methods requiring a depth sensor, which is not a realistic assumption for outdoor scenes captured with a flying drone.

REC-MV: REconstructing 3D Dynamic Cloth From Monocular Videos
Qiu, LingtengandChen, GuanyingandZhou, JiapengandXu, MutianandWang, JunleandHan, Xiaoguang



研究问题:如何从单目视频中重建具有开放边界的动态3D服装表面?
动机:现有的神经渲染方法无法将服装表面与身体分离,基于特征曲线表示的服装重建方法在视频输入上难以生成时间一致的表面。
方法:本文将此任务表述为3D服装特征曲线和表面重建的优化问题,提出了一种名为REC-MV的新方法,用于联合优化显式特征曲线和服装的隐式符号距离场(SDF)。然后通过在规范空间中的服装模板注册提取开放的服装网格。
效果:实验表明,该方法优于现有方法,可以生成高质量的动态服装表面。

Reconstructing dynamic 3D garment surfaces with open boundaries from monocular videos is an important problem as it provides a practical and low-cost solution for clothes digitization. Recent neural rendering methods achieve high-quality dynamic clothed human reconstruction results from monocular video, but these methods cannot separate the garment surface from the body. Moreover, despite existing garment reconstruction methods based on feature curve representation demonstrating impressive results for garment reconstruction from a single image, they struggle to generate temporally consistent surfaces for the video input. To address the above limitations, in this paper, we formulate this task as an optimization problem of 3D garment feature curves and surface reconstruction from monocular video. We introduce a novel approach, called REC-MV to jointly optimize the explicit feature curves and the implicit signed distance field (SDF) of the garments. Then the open garment meshes can be extracted via garment template registration in the canonical space. Experiments on multiple casually captured datasets show that our approach outperforms existing methods and can produce high-quality dynamic garment surfaces.

RUST: Latent Neural Scene Representations From Unposed Imagery
Sajjadi, MehdiS.M.andMahendran, AravindhandKipf, ThomasandPot, EtienneandDuckworth, DanielandLu\v{c



研究问题:如何从2D观察中推断3D场景的结构是计算机视觉中的一个基本挑战。
动机:目前流行的基于神经场景表示的方法已经在各种应用中取得了巨大的影响,但训练一个能提供有效泛化到单个场景之外的潜表示的单一模型仍然是这个领域的主要挑战之一。
方法:我们提出了RUST(真正无姿态的场景表示转换器),这是一种仅使用RGB图像进行训练的无姿态新颖视图合成方法。我们的主要见解是,可以训练一个窥视目标图像并学习用于视图合成的潜在姿态嵌入的姿态编码器。
效果:我们对学习到的潜在姿态结构进行了实证研究,结果表明,它允许有意义的测试时间相机变换和准确的显式姿态读出。令人惊讶的是,RUST实现了与具有完美相机姿态的方法相当的质量,从而解锁了大规模训练共享神经场景表示的潜力。

Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations.

Spatio-Focal Bidirectional Disparity Estimation From a Dual-Pixel Image
Kim, DonggunandJang, HyeonjoongandKim, InchulandKim, MinH.



研究问题:如何充分利用双倍像素摄影的超高清分辨率,并解决其深度估计性能下降的问题。
动机:双倍像素摄影具有方向性视差,需要浅景深才能捕获广泛的双倍像素视差,但这会导致图像严重模糊,降低深度估计性能。
方法:提出一种自我监督学习方法,通过利用双倍像素摄影中的各向异性模糊核的性质来学习双向视差。
效果:该方法不需要依赖不存在的双倍像素视差训练数据集,能从双倍像素图像中估计出完整的视差图,超越基线双倍像素方法。

Dual-pixel photography is monocular RGB-D photography with an ultra-high resolution, enabling many applications in computational photography. However, there are still several challenges to fully utilizing dual-pixel photography. Unlike the conventional stereo pair, the dual pixel exhibits a bidirectional disparity that includes positive and negative values, depending on the focus plane depth in an image. Furthermore, capturing a wide range of dual-pixel disparity requires a shallow depth of field, resulting in a severely blurred image, degrading depth estimation performance. Recently, several data-driven approaches have been proposed to mitigate these two challenges. However, due to the lack of the ground-truth dataset of the dual-pixel disparity, existing data-driven methods estimate either inverse depth or blurriness map. In this work, we propose a self-supervised learning method that learns bidirectional disparity by utilizing the nature of anisotropic blur kernels in dual-pixel photography. We observe that the dual-pixel left/right images have reflective-symmetric anisotropic kernels, so their sum is equivalent to that of a conventional image. We take a self-supervised training approach with the novel kernel-split symmetry loss accounting for the phenomenon. Our method does not rely on a training dataset of dual-pixel disparity that does not exist yet. Our method can estimate a complete disparity map with respect to the focus-plane depth from a dual-pixel image, outperforming the baseline dual-pixel methods.

Four-View Geometry With Unknown Radial Distortion
Hruby, PetrandKorotynskiy, ViktorandDuff, TimothyandOeding, LukeandPollefeys, MarcandPajdla, TomasandLarsson, Viktor



研究问题:本文解决了在相机标定参数(焦距和径向畸变)未知的情况下,从图像中估计相对位姿的问题。
动机:现有的方法需要模型化这些参数才能进行度量重建,而我们的方法不需要这样做。
方法:我们提出了一种新的解决方案,通过将已知和未知的相机都视为4视图中的13个点,将问题分解为一系列子问题进行求解。
效果:实验结果表明,我们的方法在模拟数据和真实数据上都优于以往的无标定解决方案,可以有效地启动带有径向相机的SfM管道。

We present novel solutions to previously unsolved problems of relative pose estimation from images whose calibration parameters, namely focal lengths and radial distortion, are unknown. Our approach enables metric reconstruction without modeling these parameters. The minimal case for reconstruction requires 13 points in 4 views for both the calibrated and uncalibrated cameras. We describe and implement the first solution to these minimal problems. In the calibrated case, this may be modeled as a polynomial system of equations with 3584 solutions. Despite the apparent intractability, the problem decomposes spectacularly. Each solution falls into a Euclidean symmetry class of size 16, and we can estimate 224 class representatives by solving a sequence of three subproblems with 28, 2, and 4 solutions. We highlight the relationship between internal constraints on the radial quadrifocal tensor and the relations among the principal minors of a 4x4 matrix. We also address the case of 4 upright cameras, where 7 points are minimal. Finally, we evaluate our approach on simulated and real data and benchmark against previous calibration-free solutions, and show that our method provides an efficient startup for an SfM pipeline with radial cameras.

HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics
Grigorev, ArturandBlack, MichaelJ.andHilliges, Otmar



研究问题:提出一种利用图神经网络、多级消息传递和无监督训练实时预测真实服装动态的方法。
动机:现有的基于线性混合蒙皮的方法必须针对特定服装进行训练,而我们的方法对体型不敏感,适用于紧身服装和宽松的流动服装。
方法:我们的方法进一步处理拓扑变化(如带有纽扣或拉链的服装)和材料属性在推理时的变化。我们提出了一种分层的消息传递方案,有效地传播了刚性拉伸模式,同时保留了局部细节。
效果:实验结果表明,我们的方法在数量上优于强大的基线,并且其结果被认为比最先进的方法更真实。

We propose a method that leverages graph neural networks, multi-level message passing, and unsupervised training to enable real-time prediction of realistic clothing dynamics. Whereas existing methods based on linear blend skinning must be trained for specific garments, our method is agnostic to body shape and applies to tight-fitting garments as well as loose, free-flowing clothing. Our method furthermore handles changes in topology (e.g., garments with buttons or zippers) and material properties at inference time. As one key contribution, we propose a hierarchical message-passing scheme that efficiently propagates stiff stretching modes while preserving local detail. We empirically show that our method outperforms strong baselines quantitatively and that its results are perceived as more realistic than state-of-the-art methods.

HyperReel: High-Fidelity 6-DoF Video With Ray-Conditioned Sampling
Attal, BenjaminandHuang, Jia-BinandRichardt, ChristianandZollh\"ofer, MichaelandKopf, JohannesandO{\textquoteright



研究问题:现有的体积场景表示方法在质量、渲染速度和内存效率之间需要做出精细的权衡,特别是在实时性能、小内存占用和高质量渲染方面。
动机:为了解决现有方法在处理具有挑战性的现实世界场景时无法同时实现实时性能、小内存占用和高质量渲染的问题。
方法:提出了一种新的6-DoF视频表示方法——HyperReel,包括两个核心组件:(1)一种射线条件样本预测网络,能够在高分辨率下实现高保真、高帧率的渲染;(2)一种紧凑且内存效率高的动态体积表示。
效果:与先前和同时代的其他方法相比,HyperReel在视觉质量方面表现最佳,同时内存需求较小,并且在没有自定义CUDA代码的情况下,以每秒18帧的速率在百万像素分辨率下进行渲染。

Volumetric scene representations enable photorealistic view synthesis for static scenes and form the basis of several existing 6-DoF video techniques. However, the volume rendering procedures that drive these representations necessitate careful trade-offs in terms of quality, rendering speed, and memory efficiency. In particular, existing methods fail to simultaneously achieve real-time performance, small memory footprint, and high-quality rendering for challenging real-world scenes. To address these issues, we present HyperReel --- a novel 6-DoF video representation. The two core components of HyperReel are: (1) a ray-conditioned sample prediction network that enables high-fidelity, high frame rate rendering at high resolutions and (2) a compact and memory-efficient dynamic volume representation. Our 6-DoF video pipeline achieves the best performance compared to prior and contemporary approaches in terms of visual quality with small memory requirements, while also rendering at up to 18 frames-per-second at megapixel resolution without any custom CUDA code.

Pose Synchronization Under Multiple Pair-Wise Relative Poses
Sun, YifanandHuang, Qixing



研究问题:本文旨在解决在多个对象对之间存在大量错误相对位姿估计的情况下,如何进行位姿同步的问题。
动机:现有的通过恢复编码相对位姿的低秩矩阵来解决位姿同步的方法在存在大量错误相对位姿估计的情况下无法有效工作。
方法:提出了一种三步算法来进行多相对位姿输入下的位姿同步。第一步进行扩散和聚类以计算输入对象的候选位姿;第二步联合优化每个对象的最佳位姿;第三步细化第二步的输出结果。
效果:在结构从运动和基于扫描的几何重建基准数据集上的实验结果表明,该方法比最先进的位姿同步技术提供了更准确的绝对位姿。

Pose synchronization, which seeks to estimate consistent absolute poses among a collection of objects from noisy relative poses estimated between pairs of objects in isolation, is a fundamental problem in many inverse applications. This paper studies an extreme setting where multiple relative pose estimates exist between each object pair, and the majority is incorrect. Popular methods that solve pose synchronization via recovering a low-rank matrix that encodes relative poses in block fail under this extreme setting. We introduce a three-step algorithm for pose synchronization under multiple relative pose inputs. The first step performs diffusion and clustering to compute the candidate poses of the input objects. We present a theoretical result to justify our diffusion formulation. The second step jointly optimizes the best pose for each object. The final step refines the output of the second step. Experimental results on benchmark datasets of structurefrom-motion and scan-based geometry reconstruction show that our approach offers more accurate absolute poses than state-of-the-art pose synchronization techniques.

Virtual Occlusions Through Implicit Depth
Watson, JamieandSayed, MohamedandQureshi, ZawarandBrostow, GabrielJ.andVicente, SaraandMacAodha, OisinandFirman, Michael



研究问题:如何提高虚拟现实元素在真实世界中的遮挡效果,使其看起来更自然。
动机:目前的深度估计模型在边界或时间变化时会出现不一致性,影响虚拟现实元素的遮挡效果。
方法:提出一种隐式深度模型,直接预测遮挡掩码,输入为一张或多张彩色图像和虚拟几何体的已知深度。
效果:实验结果表明,该方法比传统深度估计模型的预测更准确、更稳定,在ScanNetv2数据集上取得了最先进的遮挡效果,并在真实场景上获得了优越的定性结果。

For augmented reality (AR), it is important that virtual assets appear to 'sit among' real world objects. The virtual element should variously occlude and be occluded by real matter, based on a plausible depth ordering. This occlusion should be consistent over time as the viewer's camera moves. Unfortunately, small mistakes in the estimated scene depth can ruin the downstream occlusion mask, and thereby the AR illusion. Especially in real-time settings, depths inferred near boundaries or across time can be inconsistent. In this paper, we challenge the need for depth-regression as an intermediate step. We instead propose an implicit model for depth and use that to predict the occlusion mask directly. The inputs to our network are one or more color images, plus the known depths of any virtual geometry. We show how our occlusion predictions are more accurate and more temporally stable than predictions derived from traditional depth-estimation models. We obtain state-of-the-art occlusion results on the challenging ScanNetv2 dataset and superior qualitative results on real scenes.

Instant Multi-View Head Capture Through Learnable Registration
Bolkart, TimoandLi, TianyeandBlack, MichaelJ.



研究问题:如何直接从校准的多视图图像中推断密集对应关系的3D头部。
动机:现有的方法在捕获3D头部数据集时速度慢,且通常分两步进行;多视图立体重建(MVS)重建后进行非刚性配准。
方法:提出TEMPEH模型,通过直接从校准的多视图图像中推断密集对应关系的3D头部,同时注册3D头部数据集。
效果:预测一个头部大约需要0.3秒,中位重建误差为0.26毫米,比当前最先进的方法低64%,能够有效地捕获包含多个人和不同面部运动的大数据集。

Existing methods for capturing datasets of 3D heads in dense semantic correspondence are slow and commonly address the problem in two separate steps; multi-view stereo (MVS) reconstruction followed by non-rigid registration. To simplify this process, we introduce TEMPEH (Towards Estimation of 3D Meshes from Performances of Expressive Heads) to directly infer 3D heads in dense correspondence from calibrated multi-view images. Registering datasets of 3D scans typically requires manual parameter tuning to find the right balance between accurately fitting the scans' surfaces and being robust to scanning noise and outliers. Instead, we propose to jointly register a 3D head dataset while training TEMPEH. Specifically, during training, we minimize a geometric loss commonly used for surface registration, effectively leveraging TEMPEH as a regularizer. Our multi-view head inference builds on a volumetric feature representation that samples and fuses features from each view using camera calibration information. To account for partial occlusions and a large capture volume that enables head movements, we use view- and surface-aware feature fusion, and a spatial transformer-based head localization module, respectively. We use raw MVS scans as supervision during training, but, once trained, TEMPEH directly predicts 3D heads in dense correspondence without requiring scans. Predicting one head takes about 0.3 seconds with a median reconstruction error of 0.26 mm, 64% lower than the current state-of-the-art. This enables the efficient capture of large datasets containing multiple people and diverse facial motions. Code, model, and data are publicly available at https://tempeh.is.tue.mpg.de.

NeAT: Learning Neural Implicit Surfaces With Arbitrary Topologies From Multi-View Images
Meng, XiaoxuandChen, WeikaiandYang, Bo



研究问题:本文旨在解决现有神经隐式函数在重建高保真3D形状时只能处理封闭表面的问题。
动机:目前的神经渲染方法受限于需要用符号距离场表示的表面,因此只能处理封闭的3D形状。
方法:本文提出了一种新的神经渲染框架NeAT,该框架可以从多视图图像中学习具有任意拓扑的隐式表面。具体来说,NeAT将3D表面表示为带有有效性分支的符号距离函数(SDF)的水平集,用于估计查询位置的表面存在概率。
效果:实验结果表明,NeAT不仅能够忠实地重建无缝隙的表面,而且在开放表面重建任务上显著优于最先进的方法。

Recent progress in neural implicit functions has set new state-of-the-art in reconstructing high-fidelity 3D shapes from a collection of images. However, these approaches are limited to closed surfaces as they require the surface to be represented by a signed distance field. In this paper, we propose NeAT, a new neural rendering framework that can learn implicit surfaces with arbitrary topologies from multi-view images. In particular, NeAT represents the 3D surface as a level set of a signed distance function (SDF) with a validity branch for estimating the surface existence probability at the query positions. We also develop a novel neural volume rendering method, which uses SDF and validity to calculate the volume opacity and avoids rendering points with low validity. NeAT supports easy field-to-mesh conversion using the classic Marching Cubes algorithm. Extensive experiments on DTU, MGN, and Deep Fashion 3D datasets indicate that our approach is able to faithfully reconstruct both watertight and non-watertight surfaces. In particular, NeAT significantly outperforms the state-of-the-art methods in the task of open surface reconstruction both quantitatively and qualitatively.

SPARF: Neural Radiance Fields From Sparse and Noisy Poses
Truong, PruneandRakotosaona, Marie-JulieandManhardt, FabianandTombari, Federico



研究问题:如何利用稀疏输入视图进行新颖的视图合成。
动机:现有的神经辐射场(NeRF)模型需要密集的输入视图和高精度的相机姿态,限制了其在现实世界中的应用。
方法:提出稀疏姿态调整辐射场(SPARF)方法,通过利用多视图几何约束来联合学习NeRF并优化相机姿态。
效果:在多个具有挑战性的数据集上,该方法在稀疏视图领域取得了新的最先进的性能。

Neural Radiance Field (NeRF) has recently emerged as a powerful representation to synthesize photorealistic novel views. While showing impressive performance, it relies on the availability of dense input views with highly accurate camera poses, thus limiting its application in real-world scenarios. In this work, we introduce Sparse Pose Adjusting Radiance Field (SPARF), to address the challenge of novel-view synthesis given only few wide-baseline input images (as low as 3) with noisy camera poses. Our approach exploits multi-view geometry constraints in order to jointly learn the NeRF and refine the camera poses. By relying on pixel matches extracted between the input views, our multi-view correspondence objective enforces the optimized scene and camera poses to converge to a global and geometrically accurate solution. Our depth consistency loss further encourages the reconstructed scene to be consistent from any viewpoint. Our approach sets a new state of the art in the sparse-view regime on multiple challenging datasets.

ABLE-NeRF: Attention-Based Rendering With Learnable Embeddings for Neural Radiance Field
Tang, ZheJunandCham, Tat-JenandZhao, Haiyu



研究问题:现有的神经辐射场(NeRF)方法在渲染3D场景时,存在光泽和透明物体表现不佳的问题。
动机:为了解决这一问题,本文提出了一种基于自注意力机制的体积光线框架,并借鉴游戏引擎中的光照探针技术,引入可学习嵌入来捕捉场景内的视点相关效应。
方法:本文提出的ABLE-NeRF方法通过优化连续体积场景函数来表示3D场景,同时利用自注意力机制对体积进行约束,以及引入可学习嵌入来捕捉视点相关效应。
效果:实验结果表明,ABLE-NeRF在Blender数据集上取得了最先进的结果,并在所有三个图像质量指标(PSNR、SSIM、LPIPS)上都超过了Ref-NeRF,显著减少了渲染中“模糊”的光泽表面,并产生了真实感强的透明表面。

Neural Radiance Field (NeRF) is a popular method in representing 3D scenes by optimising a continuous volumetric scene function. Its large success which lies in applying volumetric rendering (VR) is also its Achilles' heel in producing view-dependent effects. As a consequence, glossy and transparent surfaces often appear murky. A remedy to reduce these artefacts is to constrain this VR equation by excluding volumes with back-facing normal. While this approach has some success in rendering glossy surfaces, translucent objects are still poorly represented. In this paper, we present an alternative to the physics-based VR approach by introducing a self-attention-based framework on volumes along a ray. In addition, inspired by modern game engines which utilise Light Probes to store local lighting passing through the scene, we incorporate Learnable Embeddings to capture view dependent effects within the scene. Our method, which we call ABLE-NeRF, significantly reduces 'blurry' glossy surfaces in rendering and produces realistic translucent surfaces which lack in prior art. In the Blender dataset, ABLE-NeRF achieves SOTA results and surpasses Ref-NeRF in all 3 image quality metrics PSNR, SSIM, LPIPS.

PermutoSDF: Fast Multi-View Reconstruction With Implicit Surfaces Using Permutohedral Lattices
Rosu, RaduAlexandruandBehnke, Sven



研究问题:如何通过混合神经网络辐射密度场方法和基于位置编码的哈希方法,提高新视角渲染的准确性和效率。
动机:当前的方法在恢复表面几何形状时存在困难,且训练和推断速度较慢。
方法:提出一种新的基于哈希的隐式表面表示方法,使用排列八面体格对体积哈希编码进行优化,并引入一种用于恢复高频几何细节的正则化方案。
效果:实验结果表明,该方法能在保持高帧率的同时,准确恢复毛孔和皱纹等微观几何细节,并在多个数据集上进行了有效评估。

Neural radiance-density field methods have become increasingly popular for the task of novel-view rendering. Their recent extension to hash-based positional encoding ensures fast training and inference with visually pleasing results. However, density-based methods struggle with recovering accurate surface geometry. Hybrid methods alleviate this issue by optimizing the density based on an underlying SDF. However, current SDF methods are overly smooth and miss fine geometric details. In this work, we combine the strengths of these two lines of work in a novel hash-based implicit surface representation. We propose improvements to the two areas by replacing the voxel hash encoding with a permutohedral lattice which optimizes faster, especially for higher dimensions. We additionally propose a regularization scheme which is crucial for recovering high-frequency geometric detail. We evaluate our method on multiple datasets and show that we can recover geometric detail at the level of pores and wrinkles while using only RGB images for supervision. Furthermore, using sphere tracing we can render novel views at 30 fps on an RTX 3090. Code is publicly available at https://radualexandru.github.io/permuto_sdf

1000 FPS HDR Video With a Spike-RGB Hybrid Camera
Chang, YakunandZhou, ChuandHong, YuchenandHu, LiwenandXu, ChaoandHuang, TiejunandShi, Boxin



研究问题:如何利用传统的帧基相机在高速场景中捕捉高帧率和高动态范围(HFR&HDR)彩色视频。
动机:使用传统帧基相机在高速场景中捕捉高帧率和高动态范围的彩色视频是非常具有挑战性的,因为提高帧率通常需要缩短曝光时间,从而使得捕获的视频受到噪声的严重干扰。
方法:我们引入了一个由脉冲相机和交替曝光RGB相机组成的混合相机系统,以高保真度捕捉HFR&HDR场景。首先,我们将脉冲帧进行重构以获取运动信息,然后基于这些脉冲帧指导恢复中长曝光RGB图像的丢失时间信息,同时保留其可靠的颜色外观。最后,利用从脉冲序列中估计出的强时间约束,恢复了丢失和失真的颜色交叉RGB帧,生成了时间一致的高帧率彩色帧。
效果:我们收集了一个新的Spike-RGB数据集,其中包含300个合成数据序列和20组真实世界数据,实验结果表明,我们的方法能够产生超过1000FPS的HDR视频,性能超过了HDR视频重建方法和商业高速相机。

Capturing high frame rate and high dynamic range (HFR&HDR) color videos in high-speed scenes with conventional frame-based cameras is very challenging. The increasing frame rate is usually guaranteed by using shorter exposure time so that the captured video is severely interfered by noise. Alternating exposures could alleviate the noise issue but sacrifice frame rate due to involving long-exposure frames. The neuromorphic spiking camera records high-speed scenes of high dynamic range without colors using a completely different sensing mechanism and visual representation. We introduce a hybrid camera system composed of a spiking and an alternating-exposure RGB camera to capture HFR&HDR scenes with high fidelity. Our insight is to bring each camera's superiority into full play. The spike frames, with accurate fast motion information encoded, are first reconstructed for motion representation, from which the spike-based optical flows guide the recovery of missing temporal information for middle- and long-exposure RGB images while retaining their reliable color appearances. With the strong temporal constraint estimated from spike trains, both missing and distorted colors cross RGB frames are recovered to generate time-consistent and HFR color frames. We collect a new Spike-RGB dataset that contains 300 sequences of synthetic data and 20 groups of real-world data to demonstrate 1000 FPS HDR videos outperforming HDR video reconstruction methods and commercial high-speed cameras.

Learning To Fuse Monocular and Multi-View Cues for Multi-Frame Depth Estimation in Dynamic Scenes
Li, RuiandGong, DongandYin, WeiandChen, HaoandZhu, YuandWang, KaixuanandChen, XiaozhiandSun, JinqiuandZhang, Yanning



研究问题:动态场景下的多帧深度估计通常依赖于多视图几何一致性,但在动态区域中,这种一致性通常会被破坏,导致估计结果失真。
动机:许多多帧方法通过显式掩码识别动态区域,并用局部单眼深度或特征作为单眼线索来补偿多视图线索,但这种方法的效果有限,因为掩码的质量无法控制,而且两种类型的线索的融合效益没有得到充分利用。
方法:本文提出了一种新的方法,无需手动创建掩码,就可以学习融合编码为体积的多视图和单眼线索。通过分析发现,多视图线索在静态区域中能捕获更准确的几何信息,而单眼线索在动态区域中能捕获更有用的上下文信息。为了将静态区域中从多视图线索学习的几何感知传播到动态区域的单眼表示,并让单眼线索增强多视图成本体积的表示,我们提出了跨线索融合(CCF)模块,其中包括跨线索注意力(CCA)以编码空间上非局部的相对内部关系,从而增强另一种表示。
效果:在真实世界数据集上的实验证明了该方法的显著有效性和泛化能力。

Multi-frame depth estimation generally achieves high accuracy relying on the multi-view geometric consistency. When applied in dynamic scenes, e.g., autonomous driving, this consistency is usually violated in the dynamic areas, leading to corrupted estimations. Many multi-frame methods handle dynamic areas by identifying them with explicit masks and compensating the multi-view cues with monocular cues represented as local monocular depth or features. The improvements are limited due to the uncontrolled quality of the masks and the underutilized benefits of the fusion of the two types of cues. In this paper, we propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the heuristically crafted masks. As unveiled in our analyses, the multi-view cues capture more accurate geometric information in static areas, and the monocular cues capture more useful contexts in dynamic areas. To let the geometric perception learned from multi-view cues in static areas propagate to the monocular representation in dynamic areas and let monocular cues enhance the representation of multi-view cost volume, we propose a cross-cue fusion (CCF) module, which includes the cross-cue attention (CCA) to encode the spatially non-local relative intra-relations from each source to enhance the representation of the other. Experiments on real-world datasets prove the significant effectiveness and generalization ability of the proposed method.

Neural Volumetric Memory for Visual Locomotion Control
Yang, RuihanandYang, GeandWang, Xiaolong



研究问题:如何使腿式机器人在具有挑战性的地形上自主移动。
动机:由于部分可观察性问题,机器人必须依靠过去的观察来推断当前地形下的地形。
方法:采用计算机视觉中显式建模场景3D几何的方法,提出神经体积记忆(NVM),一种显式考虑3D世界SE(3)等变的几何记忆架构。
效果:通过在物理机器人上测试学习到的视觉-运动策略,表明了我们的方法——使用神经体积记忆学习腿式运动,在具有挑战性的地形上产生了性能提升。

Legged robots have the potential to expand the reach of autonomy beyond paved roads. In this work, we consider the difficult problem of locomotion on challenging terrains using a single forward-facing depth camera. Due to the partial observability of the problem, the robot has to rely on past observations to infer the terrain currently beneath it. To solve this problem, we follow the paradigm in computer vision that explicitly models the 3D geometry of the scene and propose Neural Volumetric Memory (NVM), a geometric memory architecture that explicitly accounts for the SE(3) equivariance of the 3D world. NVM aggregates feature volumes from multiple camera views by first bringing them back to the ego-centric frame of the robot. We test the learned visual-locomotion policy on a physical robot and show that our approach, learning legged locomotion with neural volumetric memory, produces performance gains over prior works on challenging terrains. We include ablation studies and show that the representations stored in the neural volumetric memory capture sufficient geometric information to reconstruct the scene. Our project page with videos is https://rchalyang.github.io/NVM/

Propagate and Calibrate: Real-Time Passive Non-Line-of-Sight Tracking
Wang, YihaoandWang, ZhigangandZhao, BinandWang, DongandChen, MulinandLi, Xuelong



研究问题:如何实现非视线(NLOS)跟踪,即在对象不在视线范围内的情况下进行追踪。
动机:现有的NLOS跟踪技术大多依赖主动照明,如激光,成本高且实验条件复杂,且由于设置过于简单化,距离实际应用还有一段距离。
方法:提出一种纯被动的追踪方法,通过观察中继墙来追踪人在隐形房间中的行走。引入差异帧作为时间-局部运动信息的重要载体,并构建交替传播和校准的PAC-Net模型,以在帧级别粒度上利用动态和静态信息。
效果:构建并发布了首个动态被动NLOS跟踪数据集NLOS-Track,填补了现实NLOS数据集的空白。该数据集包含数千个NLOS视频片段和相应的轨迹,包括实拍和合成数据。

Non-line-of-sight (NLOS) tracking has drawn increasing attention in recent years, due to its ability to detect object motion out of sight. Most previous works on NLOS tracking rely on active illumination, e.g., laser, and suffer from high cost and elaborate experimental conditions. Besides, these techniques are still far from practical application due to oversimplified settings. In contrast, we propose a purely passive method to track a person walking in an invisible room by only observing a relay wall, which is more in line with real application scenarios, e.g., security. To excavate imperceptible changes in videos of the relay wall, we introduce difference frames as an essential carrier of temporal-local motion messages. In addition, we propose PAC-Net, which consists of alternating propagation and calibration, making it capable of leveraging both dynamic and static messages on a frame-level granularity. To evaluate the proposed method, we build and publish the first dynamic passive NLOS tracking dataset, NLOS-Track, which fills the vacuum of realistic NLOS datasets. NLOS-Track contains thousands of NLOS video clips and corresponding trajectories. Both real-shot and synthetic data are included. Our codes and dataset are available at https://againstentropy.github.io/NLOS-Track/.

Neural Fields Meet Explicit Geometric Representations for Inverse Rendering of Urban Scenes
Wang, ZianandShen, TianchangandGao, JunandHuang, ShengyuandMunkberg, JacobandHasselgren, JonandGojcic, ZanandChen, WenzhengandFidler, Sanja



研究问题:如何从捕获的图像中重建和内在分解场景,以实现诸如重照明和虚拟对象插入等许多应用。
动机:现有的基于NeRF的方法在3D重建的准确性上取得了令人印象深刻的成果,但将光照和阴影烘焙到辐射场中,而通过可微分渲染促进内在分解的基于网格的方法尚未扩展到户外场景的复杂性和规模。
方法:我们提出了一种新的逆渲染框架,能够从一组带有可选深度的已定位RGB图像中联合重建场景几何、空间变化的材质和HDR光照。具体来说,我们使用一个神经场来表示主光线,并使用一个显式网格(从底层神经场重建)来模拟产生高阶光照效应(如投射阴影)的次级光线。
效果:通过忠实地将复杂的几何和材质与光照效应分离,我们的方法能够在多个户外数据集上实现具有镜面和阴影效果的照片般真实的重照明。此外,它还支持物理基础的场景操作,如带有射线追踪阴影投射的虚拟物体插入。

Reconstruction and intrinsic decomposition of scenes from captured imagery would enable many applications such as relighting and virtual object insertion. Recent NeRF based methods achieve impressive fidelity of 3D reconstruction, but bake the lighting and shadows into the radiance field, while mesh-based methods that facilitate intrinsic decomposition through differentiable rendering have not yet scaled to the complexity and scale of outdoor scenes. We present a novel inverse rendering framework for large urban scenes capable of jointly reconstructing the scene geometry, spatially-varying materials, and HDR lighting from a set of posed RGB images with optional depth. Specifically, we use a neural field to account for the primary rays, and use an explicit mesh (reconstructed from the underlying neural field) for modeling secondary rays that produce higher-order lighting effects such as cast shadows. By faithfully disentangling complex geometry and materials from lighting effects, our method enables photorealistic relighting with specular and shadow effects on several outdoor datasets. Moreover, it supports physics-based scene manipulations such as virtual object insertion with ray-traced shadow casting.

NeRF-RPN: A General Framework for Object Detection in NeRFs
Hu, BenranandHuang, JunkaiandLiu, YichenandTai, Yu-WingandTang, Chi-Keung



研究问题:本文提出了首个直接在NeRF上操作的目标检测框架,即NeRF-RPN。
动机:现有的目标检测框架无法直接在NeRF上进行操作,因此作者提出了一种新方法来解决这个问题。
方法:通过利用一种新的体素表示法,该方法结合了多尺度的3D神经体积特征,可以在不渲染任何视角的NeRF的情况下直接回归物体的3D边界框。
效果:实验结果表明,NeRF-RPN可以有效地在NeRF上进行目标检测,并且可以应用于无类别标签的对象检测。此外,作者还建立了一个新的基准数据集,以促进未来在NeRF上进行目标检测的研究。

This paper presents the first significant object detection framework, NeRF-RPN, which directly operates on NeRF. Given a pre-trained NeRF model, NeRF-RPN aims to detect all bounding boxes of objects in a scene. By exploiting a novel voxel representation that incorporates multi-scale 3D neural volumetric features, we demonstrate it is possible to regress the 3D bounding boxes of objects in NeRF directly without rendering the NeRF at any viewpoint. NeRF-RPN is a general framework and can be applied to detect objects without class labels. We experimented NeRF-RPN with various backbone architectures, RPN head designs, and loss functions. All of them can be trained in an end-to-end manner to estimate high quality 3D bounding boxes. To facilitate future research in object detection for NeRF, we built a new benchmark dataset which consists of both synthetic and real-world data with careful labeling and clean up. Code and dataset are available at https://github.com/lyclyc52/NeRF_RPN.

Masked Wavelet Representation for Compact Neural Radiance Fields
Rho, DanielandLee, ByeonghyeonandNam, SeungtaeandLee, JooChanandKo, JongHwanandPark, Eunbyung



研究问题:如何减少使用多层感知器(MLP)表示3D场景或对象所需的巨大计算资源和时间。
动机:尽管最近的研究通过使用额外的数据结构(如网格或树)来降低这些计算效率,但这些显式的数据结构需要大量的内存。
方法:我们提出了一种在不牺牲额外数据结构优点的情况下减小其大小的方法。具体来说,我们建议在基于网格的神经场中使用小波变换。
效果:实验结果表明,非空间网格系数(如小波系数)能够实现比空间网格系数更高的稀疏性,从而得到更紧凑的表示。通过我们的掩码和压缩管道,我们在2MB的内存预算内实现了最先进的性能。

Neural radiance fields (NeRF) have demonstrated the potential of coordinate-based neural representation (neural fields or implicit neural representation) in neural rendering. However, using a multi-layer perceptron (MLP) to represent a 3D scene or object requires enormous computational resources and time. There have been recent studies on how to reduce these computational inefficiencies by using additional data structures, such as grids or trees. Despite the promising performance, the explicit data structure necessitates a substantial amount of memory. In this work, we present a method to reduce the size without compromising the advantages of having additional data structures. In detail, we propose using the wavelet transform on grid-based neural fields. Grid-based neural fields are for fast convergence, and the wavelet transform, whose efficiency has been demonstrated in high-performance standard codecs, is to improve the parameter efficiency of grids. Furthermore, in order to achieve a higher sparsity of grid coefficients while maintaining reconstruction quality, we present a novel trainable masking approach. Experimental results demonstrate that non-spatial grid coefficients, such as wavelet coefficients, are capable of attaining a higher level of sparsity than spatial grid coefficients, resulting in a more compact representation. With our proposed mask and compression pipeline, we achieved state-of-the-art performance within a memory budget of 2 MB. Our code is available at https://github.com/daniel03c1/masked_wavelet_nerf.

PersonNeRF: Personalized Reconstruction From Photo Collections
Weng, Chung-YiandSrinivasan, PratulP.andCurless, BrianandKemelmacher-Shlizerman, Ira



研究问题:如何从多角度、多姿态和多外观的人物照片中,生成具有任意新视角、姿态和外观的人物3D模型。
动机:现有的方法在处理稀疏的观测数据时存在困难,即同一姿态可能只有单一视角和单一外观的观察,同一外观也只有少数不同姿态的观察。
方法:提出PersonNeRF方法,通过建立一个定制化的神经体积3D模型来捕捉人物的全部空间,包括相机视角、姿态和外观。同时,通过恢复一个标准的T姿态神经体积表示,允许在不同的观察中改变外观,但使用所有观察共享的姿态依赖运动场。
效果:该方法能够有效地从这些具有挑战性的非结构化照片集合中,生成令人信服的新视角、姿态和外观的人物图像,优于先前的自由视点人类渲染工作。

We present PersonNeRF, a method that takes a collection of photos of a subject (e.g., Roger Federer) captured across multiple years with arbitrary body poses and appearances, and enables rendering the subject with arbitrary novel combinations of viewpoint, body pose, and appearance. PersonNeRF builds a customized neural volumetric 3D model of the subject that is able to render an entire space spanned by camera viewpoint, body pose, and appearance. A central challenge in this task is dealing with sparse observations; a given body pose is likely only observed by a single viewpoint with a single appearance, and a given appearance is only observed under a handful of different body poses. We address this issue by recovering a canonical T-pose neural volumetric representation of the subject that allows for changing appearance across different observations, but uses a shared pose-dependent motion field across all observations. We demonstrate that this approach, along with regularization of the recovered volumetric geometry to encourage smoothness, is able to recover a model that renders compelling images from novel combinations of viewpoint, pose, and appearance from these challenging unstructured photo collections, outperforming prior work for free-viewpoint human rendering.

Learning a Depth Covariance Function
Dexheimer, EricandDavison, AndrewJ.



研究问题:提出一种深度协方差函数学习方法,并将其应用于几何视觉任务。
动机:利用RGB图像作为输入,协方差函数可以灵活地定义深度函数的先验、观察后的预测分布以及主动点选择方法。
方法:通过这些技术,我们为一系列下游任务(如深度补全、光束法平差和单目密集视觉里程计)提供了支持。
效果:实验结果表明,该方法在各种几何视觉任务上取得了良好的性能。

We propose learning a depth covariance function with applications to geometric vision tasks. Given RGB images as input, the covariance function can be flexibly used to define priors over depth functions, predictive distributions given observations, and methods for active point selection. We leverage these techniques for a selection of downstream tasks: depth completion, bundle adjustment, and monocular dense visual odometry.

What You Can Reconstruct From a Shadow
Liu, RuoshiandMenon, SachitandMao, ChengzhiandPark, DennisandStent, SimonandVondrick, Carl



研究问题:本文旨在解决计算机视觉中的基本问题——3D重建,特别是在部分或完全遮挡的对象进行重建时的挑战。
动机:当需要重建的对象被部分或完全遮挡时,3D重建任务变得尤其困难。因此,我们提出了一种使用未被观察到的物体投射的阴影来推断遮挡下的可能3D体积的方法。
方法:我们创建了一个可微分的图像形成模型,该模型可以联合推断物体的3D形状、其姿态和光源的位置。由于该方法是端到端可微分的,我们可以整合学到的物体几何先验知识,以生成不同类别物体的真实3D形状。
效果:实验和可视化结果显示,该方法能够生成多个与阴影观察一致的可能解决方案。即使在光源位置和物体姿态都未知的情况下,该方法也能正常工作。此外,对于真实世界的图像,即使不知道地面真值阴影掩码,该方法也是鲁棒的。

3D reconstruction is a fundamental problem in computer vision, and the task is especially challenging when the object to reconstruct is partially or fully occluded. We introduce a method that uses the shadows cast by an unobserved object in order to infer the possible 3D volumes under occlusion. We create a differentiable image formation model that allows us to jointly infer the 3D shape of an object, its pose, and the position of a light source. Since the approach is end-to-end differentiable, we are able to integrate learned priors of object geometry in order to generate realistic 3D shapes of different object categories. Experiments and visualizations show that the method is able to generate multiple possible solutions that are consistent with the observation of the shadow. Our approach works even when the position of the light source and object pose are both unknown. Our approach is also robust to real-world images where ground-truth shadow mask is unknown.

HelixSurf: A Robust and Efficient Neural Implicit Surface Learning of Indoor Scenes With Iterative Intertwined Regularization
Liang, ZhihaoandHuang, ZhangjinandDing, ChangxingandJia, Kui



研究问题:如何从多视角图像中恢复底层场景几何?
动机:现有的方法在处理复杂场景表面时表现不佳,而传统的多视角立体视觉在处理具有丰富纹理的场景时效果较好。
方法:提出一种名为HelixSurf的方法,该方法利用两种策略的互补优势,通过迭代地在训练过程中进行相互约束来优化学习过程。
效果:实验表明,HelixSurf在室内场景的表面重建方面优于现有方法,并且即使对于一些使用辅助训练数据的方法,HelixSurf的速度也快了几个数量级。

Recovery of an underlying scene geometry from multi-view images stands as a long-time challenge in computer vision research. The recent promise leverages neural implicit surface learning and differentiable volume rendering, and achieves both the recovery of scene geometry and synthesis of novel views, where deep priors of neural models are used as an inductive smoothness bias. While promising for object-level surfaces, these methods suffer when coping with complex scene surfaces. In the meanwhile, traditional multi-view stereo can recover the geometry of scenes with rich textures, by globally optimizing the local, pixel-wise correspondences across multiple views. We are thus motivated to make use of the complementary benefits from the two strategies, and propose a method termed Helix-shaped neural implicit Surface learning or HelixSurf; HelixSurf uses the intermediate prediction from one strategy as the guidance to regularize the learning of the other one, and conducts such intertwined regularization iteratively during the learning process. We also propose an efficient scheme for differentiable volume rendering in HelixSurf. Experiments on surface reconstruction of indoor scenes show that our method compares favorably with existing methods and is orders of magnitude faster, even when some of existing methods are assisted with auxiliary training data. The source code is available at https://github.com/Gorilla-Lab-SCUT/HelixSurf.

3D-Aware Facial Landmark Detection via Multi-View Consistent Training on Synthetic Data
Zeng, LibingandChen, LeleandBao, WentaoandLi, ZhongandXu, YiandYuan, JunsongandKalantari, NimaKhademi



研究问题:如何提高在野外图像中准确检测面部地标的能力。
动机:由于缺乏多视角的野外训练数据,现有的方法在检测3D/2D面部地标时难以保持3D一致性。
方法:利用生成视觉模型和神经渲染的最新进展,构建一个合成数据集,并提出一种新的多视角一致学习策略来提高在野外图像中的3D面部地标检测准确性。
效果:所提出的3D感知模块可以插入到任何基于学习的地标检测算法中以提高其准确性。通过在多个真实和合成数据集上与最先进的方法进行广泛比较,证明了所提出的插件模块的优越性。

Accurate facial landmark detection on wild images plays an essential role in human-computer interaction, entertainment, and medical applications. Existing approaches have limitations in enforcing 3D consistency while detecting 3D/2D facial landmarks due to the lack of multi-view in-the-wild training data. Fortunately, with the recent advances in generative visual models and neural rendering, we have witnessed rapid progress towards high quality 3D image synthesis. In this work, we leverage such approaches to construct a synthetic dataset and propose a novel multi-view consistent learning strategy to improve 3D facial landmark detection accuracy on in-the-wild images. The proposed 3D-aware module can be plugged into any learning-based landmark detection algorithm to enhance its accuracy. We demonstrate the superiority of the proposed plug-in module with extensive comparison against state-of-the-art methods on several real and synthetic datasets.

MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures
Chen, ZhiqinandFunkhouser, ThomasandHedman, PeterandTagliasacchi, Andrea



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Neural Radiance Fields (NeRFs) have demonstrated amazing ability to synthesize images of 3D scenes from novel views. However, they rely upon specialized volumetric rendering algorithms based on ray marching that are mismatched to the capabilities of widely deployed graphics hardware. This paper introduces a new NeRF representation based on textured polygons that can synthesize novel images efficiently with standard rendering pipelines. The NeRF is represented as a set of polygons with textures representing binary opacities and feature vectors. Traditional rendering of the polygons with a z-buffer yields an image with features at every pixel, which are interpreted by a small, view-dependent MLP running in a fragment shader to produce a final pixel color. This approach enables NeRFs to be rendered with the traditional polygon rasterization pipeline, which provides massive pixel-level parallelism, achieving interactive frame rates on a wide range of compute platforms, including mobile phones.

POEM: Reconstructing Hand in a Point Embedded Multi-View Stereo
Yang, LixinandXu, JianandZhong, LichengandZhan, XinyuandWang, ZhichengandWu, KejianandLu, Cewu



研究问题:如何使神经网络在多视角视觉任务中捕捉到3D几何感知特征。
动机:以前的多视角立体视觉方法通常将3D信息编码到2D特征中,而我们提出了一种直接在3D点上操作的新方法POEM,用于重建手部网格。
方法:POEM利用了嵌入在多视图立体视觉中的3D点来表示复杂的3D手部网格,这些点携带了来自不同视图的特征,并环绕着手部。我们设计了基于点的特诊融合和交叉集点注意力机制两种操作。
效果:在三个具有挑战性的多视角数据集上的评估表明,POEM在手部网格重建方面优于最先进的技术。代码和模型可以在github.com/lixiny/POEM进行研究。

Enable neural networks to capture 3D geometrical-aware features is essential in multi-view based vision tasks. Previous methods usually encode the 3D information of multi-view stereo into the 2D features. In contrast, we present a novel method, named POEM, that directly operates on the 3D POints Embedded in the Multi-view stereo for reconstructing hand mesh in it. Point is a natural form of 3D information and an ideal medium for fusing features across views, as it has different projections on different views. Our method is thus in light of a simple yet effective idea, that a complex 3D hand mesh can be represented by a set of 3D points that 1) are embedded in the multi-view stereo, 2) carry features from the multi-view images, and 3) encircle the hand. To leverage the power of points, we design two operations: point-based feature fusion and cross-set point attention mechanism. Evaluation on three challenging multi-view datasets shows that POEM outperforms the state-of-the-art in hand mesh reconstruction. Code and models are available for research at github.com/lixiny/POEM

DrapeNet: Garment Generation and Self-Supervised Draping
DeLuigi, LucaandLi, RenandGuillard, Beno{\^\i



研究问题:如何训练一个能快速覆盖任意人体服装的模型,同时减少对大型训练集的依赖。
动机:目前的服装覆盖模型需要为每一件服装训练一个网络,限制了其泛化能力。
方法:利用自我监督训练一个能覆盖多种服装的网络,通过预测基于生成网络潜在代码的3D变形场来实现。
效果:该模型可以生成和覆盖以前未见过的任意拓扑结构的服装,形状可以通过操作其潜在代码进行编辑。并且可以从部分观察(如图像或3D扫描)中通过梯度下降恢复准确的服装3D模型。

Recent approaches to drape garments quickly over arbitrary human bodies leverage self-supervision to eliminate the need for large training sets. However, they are designed to train one network per clothing item, which severely limits their generalization abilities. In our work, we rely on self-supervision to train a single network to drape multiple garments. This is achieved by predicting a 3D deformation field conditioned on the latent codes of a generative network, which models garments as unsigned distance fields. Our pipeline can generate and drape previously unseen garments of any topology, whose shape can be edited by manipulating their latent codes. Being fully differentiable, our formulation makes it possible to recover accurate 3D models of garments from partial observations -- images or 3D scans -- via gradient descent. Our code is publicly available at https://github.com/liren2515/DrapeNet.

Progressively Optimized Local Radiance Fields for Robust View Synthesis
Meuleman, Andr\'easandLiu, Yu-LunandGao, ChenandHuang, Jia-BinandKim, ChangilandKim, MinH.andKopf, Johannes



研究问题:如何从单张随意捕获的视频中重建大规模场景的辐射场。
动机:大多数现有的辐射场重建方法依赖于结构从运动算法准确预先估计的相机位姿,这在野外视频中经常失败;使用具有有限表示能力的单个全局辐射场无法扩展到无界场景中的更长轨迹。
方法:我们逐步联合估计相机位姿和辐射场以处理未知位姿,动态分配新训练的局部辐射场以处理大型无界场景。
效果:我们在坦克和寺庙数据集以及我们收集的户外数据集静态远足上进行了广泛的评估,结果表明我们的方法与最先进的方法相比具有优势。

We present an algorithm for reconstructing the radiance field of a large-scale scene from a single casually captured video. The task poses two core challenges. First, most existing radiance field reconstruction approaches rely on accurate pre-estimated camera poses from Structure-from-Motion algorithms, which frequently fail on in-the-wild videos. Second, using a single, global radiance field with finite representational capacity does not scale to longer trajectories in an unbounded scene. For handling unknown poses, we jointly estimate the camera poses with radiance field in a progressive manner. We show that progressive optimization significantly improves the robustness of the reconstruction. For handling large unbounded scenes, we dynamically allocate new local radiance fields trained with frames within a temporal window. This further improves robustness (e.g., performs well even under moderate pose drifts) and allows us to scale to large scenes. Our extensive evaluation on the Tanks and Temples dataset and our collected outdoor dataset, Static Hikes, show that our approach compares favorably with the state-of-the-art.

AligNeRF: High-Fidelity Neural Radiance Fields via Alignment-Aware Training
Jiang, YifanandHedman, PeterandMildenhall, BenandXu, DejiaandBarron, JonathanT.andWang, ZhangyangandXue, Tianfan



研究问题:本文旨在探索Neural Radiance Fields (NeRFs)在高分辨率场景重建中的局限性,并提出相应的解决方案。
动机:现有的基于NeRF的方法在重建高分辨率真实场景时面临一些挑战,包括参数数量过多、输入数据不匹配以及过于平滑的细节。
方法:本文提出了一种新颖的NeRF训练策略,包括将多层感知器与卷积层结合以编码更多的邻域信息并减少参数数量;解决由移动物体或小相机校准错误引起的不匹配问题的新方法;以及一种高频感知损失函数。
效果:实验结果表明,该方法可以恢复比当前最先进的NeRF模型更多的高频细节,且几乎不需要额外的训练/测试成本。

Neural Radiance Fields (NeRFs) are a powerful representation for modeling a 3D scene as a continuous function. Though NeRF is able to render complex 3D scenes with view-dependent effects, few efforts have been devoted to exploring its limits in a high-resolution setting. Specifically, existing NeRF-based methods face several limitations when reconstructing high-resolution real scenes, including a very large number of parameters, misaligned input data, and overly smooth details. In this work, we conduct the first pilot study on training NeRF with high-resolution data and propose the corresponding solutions: 1) marrying the multilayer perceptron (MLP) with convolutional layers which can encode more neighborhood information while reducing the total number of parameters; 2) a novel training strategy to address misalignment caused by moving objects or small camera calibration errors; and 3) a high-frequency aware loss. Our approach is nearly free without introducing obvious training/testing costs, while experiments on different datasets demonstrate that it can recover more high-frequency details compared with the current state-of-the-art NeRF models. Project page: https://yifanjiang19.github.io/alignerf.

Implicit 3D Human Mesh Recovery Using Consistency With Pose and Shape From Unseen-View
Cho, HanbyelandCho, YooshinandAhn, JaesungandKim, Junmo



研究问题:如何从人像图像中推断出人的自然3D姿态和形状,即使存在模糊性。
动机:由于现有方法的结构限制,它们只考虑了图像拍摄的方向,而我们提出的方法可以隐式地通过神经特征场在特征级别上想象人在3D空间中的形象。
方法:我们提出了“隐式3D人体网格恢复(ImpHMR)”方法,该方法通过基于CNN的图像编码器生成特征场,然后根据给定的查看方向从特征场中进行2D特征图的体积渲染,并从特征中回归姿态和形状参数。
效果:广泛的评估表明,该方法是有效的。

From an image of a person, we can easily infer the natural 3D pose and shape of the person even if ambiguity exists. This is because we have a mental model that allows us to imagine a person's appearance at different viewing directions from a given image and utilize the consistency between them for inference. However, existing human mesh recovery methods only consider the direction in which the image was taken due to their structural limitations. Hence, we propose "Implicit 3D Human Mesh Recovery (ImpHMR)" that can implicitly imagine a person in 3D space at the feature-level via Neural Feature Fields. In ImpHMR, feature fields are generated by CNN-based image encoder for a given image. Then, the 2D feature map is volume-rendered from the feature field for a given viewing direction, and the pose and shape parameters are regressed from the feature. To utilize consistency with pose and shape from unseen-view, if there are 3D labels, the model predicts results including the silhouette from an arbitrary direction and makes it equal to the rotated ground-truth. In the case of only 2D labels, we perform self-supervised learning through the constraint that the pose and shape parameters inferred from different directions should be the same. Extensive evaluations show the efficacy of the proposed method.

Teleidoscopic Imaging System for Microscale 3D Shape Reconstruction
Kawahara, RyoandKuo, Meng-YuJenniferandNobuhara, Shohei



研究问题:本文旨在提出一种实用的显微三维形状捕获方法,通过远程光学成像系统实现。
动机:微型三维形状重建的主要挑战在于从多个视角捕捉目标,同时需要足够大的景深。
方法:采用由三个平面镜和单心透镜组成的远程光学测量系统。平面镜通过多次反射虚拟定义多个视角,单心透镜即使在近距离成像中也能实现高放大倍率、低模糊度和全景视图。
效果:我们的贡献包括一个处理折射和反射投影射线的结构化光线像素相机模型,对远程光学成像系统的景深进行解析评估,以及远程光学成像系统的实用校准算法。通过真实图像的评估证明了我们的测量系统的概念。

This paper proposes a practical method of microscale 3D shape capturing by a teleidoscopic imaging system. The main challenge in microscale 3D shape reconstruction is to capture the target from multiple viewpoints with a large enough depth-of-field. Our idea is to employ a teleidoscopic measurement system consisting of three planar mirrors and monocentric lens. The planar mirrors virtually define multiple viewpoints by multiple reflections, and the monocentric lens realizes a high magnification with less blurry and surround view even in closeup imaging. Our contributions include, a structured ray-pixel camera model which handles refractive and reflective projection rays efficiently, analytical evaluations of depth of field of our teleidoscopic imaging system, and a practical calibration algorithm of the teleidoscppic imaging system. Evaluations with real images prove the concept of our measurement system.

UV Volumes for Real-Time Rendering of Editable Free-View Human Performance
Chen, YueandWang, XuanandChen, XingyuandZhang, QiandLi, XiaoyuandGuo, YuandWang, JueandWang, Fei



研究问题:如何降低神经体渲染在沉浸式VR/AR应用中对高计算成本的依赖,实现人类表演者的实时、可编辑的自由视点视频渲染。
动机:现有的神经体渲染方法由于高昂的计算成本,在实践中受到严重限制。
方法:提出UV Volumes新方法,将高频(即非平滑)的人体外观与3D体积分离,编码为2D神经纹理堆栈(NTS)。通过参数化人体模型与平滑纹理坐标之间的映射,实现了对新姿态和形状的更好泛化。
效果:在CMU Panoptic、ZJU Mocap和H36M数据集上的大量实验表明,该方法可以在平均30FPS的速度下渲染出具有与最先进方法相当的照片写实度的960 x 540图像。

Neural volume rendering enables photo-realistic renderings of a human performer in free-view, a critical task in immersive VR/AR applications. But the practice is severely limited by high computational costs in the rendering process. To solve this problem, we propose the UV Volumes, a new approach that can render an editable free-view video of a human performer in real-time. It separates the high-frequency (i.e., non-smooth) human appearance from the 3D volume, and encodes them into 2D neural texture stacks (NTS). The smooth UV volumes allow much smaller and shallower neural networks to obtain densities and texture coordinates in 3D while capturing detailed appearance in 2D NTS. For editability, the mapping between the parameterized human model and the smooth texture coordinates allows us a better generalization on novel poses and shapes. Furthermore, the use of NTS enables interesting applications, e.g., retexturing. Extensive experiments on CMU Panoptic, ZJU Mocap, and H36M datasets show that our model can render 960 x 540 images in 30FPS on average with comparable photo-realism to state-of-the-art methods.

Multi-View Stereo Representation Revist: Region-Aware MVSNet
Zhang, YisuandZhu, JiankeandLin, Lixiang



研究问题:现有的深度学习多视角立体重建方法通常只通过最小化预测点与光线和表面交点的间隙来估计像素级的深度值,这通常会忽略表面的拓扑结构,导致纹理缺失区域和表面边界无法正确重建。
动机:为了解决这个问题,我们提出利用点到表面的距离,使模型能够感知更广泛的表面。为此,我们从成本体积预测距离体积以估计表面周围点的有符号距离。
方法:我们提出的RA-MVSNet是分块的,因为通过将假设平面与表面补丁关联起来,感知范围得到了增强。因此,它可以增加纹理缺失区域的完成度并减少边界的异常值。此外,引入的距离体积可以生成具有精细细节的网格拓扑结构。
效果:与传统的深度学习多视角立体重建方法相比,我们的RA-MVSNet方法通过利用有符号距离监督获得了更完整的重建结果。在DTU和Tanks & Temples数据集上的实验表明,我们的方法实现了最先进的结果。

Deep learning-based multi-view stereo has emerged as a powerful paradigm for reconstructing the complete geometrically-detailed objects from multi-views. Most of the existing approaches only estimate the pixel-wise depth value by minimizing the gap between the predicted point and the intersection of ray and surface, which usually ignore the surface topology. It is essential to the textureless regions and surface boundary that cannot be properly reconstructed.To address this issue, we suggest to take advantage of point-to-surface distance so that the model is able to perceive a wider range of surfaces. To this end, we predict the distance volume from cost volume to estimate the signed distance of points around the surface. Our proposed RA-MVSNet is patch-awared, since the perception range is enhanced by associating hypothetical planes with a patch of surface. Therefore, it could increase the completion of textureless regions and reduce the outliers at the boundary. Moreover, the mesh topologies with fine details can be generated by the introduced distance volume. Comparing to the conventional deep learning-based multi-view stereo methods, our proposed RA-MVSNet approach obtains more complete reconstruction results by taking advantage of signed distance supervision. The experiments on both the DTU and Tanks & Temples datasets demonstrate that our proposed approach achieves the state-of-the-art results.

Robust Dynamic Radiance Fields
Liu, Yu-LunandGao, ChenandMeuleman, Andr\'easandTseng, Hung-YuandSaraf, AyushandKim, ChangilandChuang, Yung-YuandKopf, JohannesandHuang, Jia-Bin



研究问题:动态场景的结构和外观建模。
动机:现有的方法依赖于结构从运动(SfM)算法来估计准确的相机姿态,但在具有高度动态对象、纹理表面较差和旋转相机运动的挑战性视频中,这些方法往往失败或产生错误的姿态。
方法:我们通过联合估计静态和动态辐射场以及相机参数(姿态和焦距)来解决此问题。
效果:我们的方法在广泛的定量和定性实验中表现出强大的稳健性,性能优于最先进的动态视图合成方法。

Dynamic radiance field reconstruction methods aim to model the time-varying structure and appearance of a dynamic scene. Existing methods, however, assume that accurate camera poses can be reliably estimated by Structure from Motion (SfM) algorithms. These methods, thus, are unreliable as SfM algorithms often fail or produce erroneous poses on challenging videos with highly dynamic objects, poorly textured surfaces, and rotating camera motion. We address this issue by jointly estimating the static and dynamic radiance fields along with the camera parameters (poses and focal length). We demonstrate the robustness of our approach via extensive quantitative and qualitative experiments. Our results show favorable performance over the state-of-the-art dynamic view synthesis methods.

PLIKS: A Pseudo-Linear Inverse Kinematic Solver for 3D Human Body Estimation
Shetty, KarthikandBirkhold, AnnetteandJaganathan, SrikrishnaandStrobel, NorbertandKowarschik, MarkusandMaier, AndreasandEgger, Bernhard



研究问题:本文旨在通过单张2D图像重建3D人体模型。
动机:目前的直接回归方法对外部影响缺乏灵活性,我们希望通过模型循环优化解决这个问题。
方法:我们提出了PLIKS(伪线性逆运动学解算器),它基于参数化SMPL模型的线性化形式,可以通过2D像素对齐顶点进行人体模型的解析重建。
效果:实验结果表明,与其他最先进的方法相比,PLIKS在标准的3D人体姿态和形状基准上实现了10%以上的更准确的重建,同时在新的AGORA数据集上,重建误差改善了12.9毫米。

We introduce PLIKS (Pseudo-Linear Inverse Kinematic Solver) for reconstruction of a 3D mesh of the human body from a single 2D image. Current techniques directly regress the shape, pose, and translation of a parametric model from an input image through a non-linear mapping with minimal flexibility to any external influences. We approach the task as a model-in-the-loop optimization problem. PLIKS is built on a linearized formulation of the parametric SMPL model. Using PLIKS, we can analytically reconstruct the human model via 2D pixel-aligned vertices. This enables us with the flexibility to use accurate camera calibration information when available. PLIKS offers an easy way to introduce additional constraints such as shape and translation. We present quantitative evaluations which confirm that PLIKS achieves more accurate reconstruction with greater than 10% improvement compared to other state-of-the-art methods with respect to the standard 3D human pose and shape benchmarks while also obtaining a reconstruction error improvement of 12.9 mm on the newer AGORA dataset.

gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction
Chen, ZeruiandChen, ShizheandSchmid, CordeliaandLaptev, Ivan



研究问题:本文旨在利用手部结构作为指导,对基于SDF的3D形状重建进行改进。
动机:虽然SDFs在3D形状重建中表现出色,但它们缺乏对底层3D几何的显式建模。
方法:我们使用手部结构和姿势来指导基于SDF的形状重建,并预测姿势变换的运动链以与高度关节化的手部姿势对齐。我们还通过几何对齐改善了3D点的视觉特征,并进一步利用时间信息增强了对遮挡和运动模糊的鲁棒性。
效果:我们在具有挑战性的ObMan和DexYCB基准上进行了广泛的实验,证明该方法比现有技术有显著改进。

Signed distance functions (SDFs) is an attractive framework that has recently shown promising results for 3D shape reconstruction from images. SDFs seamlessly generalize to different shape resolutions and topologies but lack explicit modelling of the underlying 3D geometry. In this work, we exploit the hand structure and use it as guidance for SDF-based shape reconstruction. In particular, we address reconstruction of hands and manipulated objects from monocular RGB images. To this end, we estimate poses of hands and objects and use them to guide 3D reconstruction. More specifically, we predict kinematic chains of pose transformations and align SDFs with highly-articulated hand poses. We improve the visual features of 3D points with geometry alignment and further leverage temporal information to enhance the robustness to occlusion and motion blurs. We conduct extensive experiments on the challenging ObMan and DexYCB benchmarks and demonstrate significant improvements of the proposed method over the state of the art.

topic-3

Topic words :  domain,  training,  distribution,  propose,  samples,  performance,  target,  loss

Deep Frequency Filtering for Domain Generalization
Lin, ShiqiandZhang, ZhizhengandHuang, ZhipengandLu, YanandLan, CuilingandChu, PengandYou, QuanzengandWang, JiangandLiu, ZichengandParulkar, AmeyandNavkal, VirajandChen, Zhibo



研究问题:如何提高深度神经网络的泛化能力,以应对实际应用中的挑战。
动机:深度神经网络在训练过程中对某些频率成分有偏好,这可能影响学习到的特征的鲁棒性。
方法:提出深度频率过滤(DFF)方法,通过在训练过程中显式调整不同领域转移困难的频率成分,来学习具有领域通用性的特征。具体做法是对不同层次的特征图进行快速傅里叶变换(FFT),然后采用轻量级模块从FFT后的频率表示中学习注意力掩码,增强可转移成分,同时抑制不利于泛化的成分。
效果:实验表明,DFF方法能有效提高深度神经网络的泛化能力,并在不同领域的迁移任务上超越现有最佳方法。

Improving the generalization ability of Deep Neural Networks (DNNs) is critical for their practical uses, which has been a longstanding challenge. Some theoretical studies have uncovered that DNNs have preferences for some frequency components in the learning process and indicated that this may affect the robustness of learned features. In this paper, we propose Deep Frequency Filtering (DFF) for learning domain-generalizable features, which is the first endeavour to explicitly modulate the frequency components of different transfer difficulties across domains in the latent space during training. To achieve this, we perform Fast Fourier Transform (FFT) for the feature maps at different layers, then adopt a light-weight module to learn attention masks from the frequency representations after FFT to enhance transferable components while suppressing the components not conducive to generalization. Further, we empirically compare the effectiveness of adopting different types of attention designs for implementing DFF. Extensive experiments demonstrate the effectiveness of our proposed DFF and show that applying our DFF on a plain baseline outperforms the state-of-the-art methods on different domain generalization tasks, including close-set classification and open-set retrieval.

Unsupervised Cumulative Domain Adaptation for Foggy Scene Optical Flow
Zhou, HanyuandChang, YiandYan, WendingandYan, Luxin



研究问题:现有的光学流方法在清晰场景下表现良好,但在雾天场景下性能受限。
动机:为了弥合清晰到雾天的场景差距,现有方法通常采用领域适应将运动知识从清晰转移到合成的雾天领域,但忽视了合成到真实领域的差距,因此在应用到真实世界场景时会出现错误。
方法:本文提出了一种新的无监督累积领域适应光学流(UCDA-Flow)框架:深度关联运动适应和相关性对齐运动适应。具体来说,我们发现深度是影响光学流的关键因素:深度越深,光学流越差,这激发我们设计一个深度关联运动适应模块来弥合清晰到雾天的场景差距。此外,我们发现成本体积相关性在合成和真实的雾天图像中具有相似的分布,这启发我们设计一个相关性对齐运动适应模块将合成雾天领域的运动知识提炼到真实雾天领域。
效果:在统一的框架下,提出的累积适应逐步将知识从清晰场景转移到真实雾天场景。广泛的实验已经进行以验证所提出方法的优越性。

Optical flow has achieved great success under clean scenes, but suffers from restricted performance under foggy scenes. To bridge the clean-to-foggy domain gap, the existing methods typically adopt the domain adaptation to transfer the motion knowledge from clean to synthetic foggy domain. However, these methods unexpectedly neglect the synthetic-to-real domain gap, and thus are erroneous when applied to real-world scenes. To handle the practical optical flow under real foggy scenes, in this work, we propose a novel unsupervised cumulative domain adaptation optical flow (UCDA-Flow) framework: depth-association motion adaptation and correlation-alignment motion adaptation. Specifically, we discover that depth is a key ingredient to influence the optical flow: the deeper depth, the inferior optical flow, which motivates us to design a depth-association motion adaptation module to bridge the clean-to-foggy domain gap. Moreover, we figure out that the cost volume correlation shares similar distribution of the synthetic and real foggy images, which enlightens us to devise a correlation-alignment motion adaptation module to distill motion knowledge of the synthetic foggy domain to the real foggy domain. Note that synthetic fog is designed as the intermediate domain. Under this unified framework, the proposed cumulative adaptation progressively transfers knowledge from clean scenes to real foggy scenes. Extensive experiments have been performed to verify the superiority of the proposed method.

NoisyTwins: Class-Consistent and Diverse Image Generation Through StyleGANs
Rangwani, HarshandBansal, LavishandSharma, KartikandKarmali, TejanandJampani, VarunandBabu, R.Venkatesh



研究问题:训练StyleGANs在大规模长尾数据集上的性能退化问题。
动机:StyleGANs在语义解耦的潜在空间中表现优秀,适合图像编辑和操作,但在大规模长尾数据集上进行类别条件训练时性能会严重下降。
方法:提出NoisyTwins方法,通过引入有效的低成本类别嵌入增强策略,并在W空间中基于自我监督对潜在变量进行解相关处理。
效果:在ImageNet-LT和iNaturalist 2019等大规模真实世界长尾数据集上,该方法比其他方法提高了19%的FID分数,建立了新的最先进的状态。

StyleGANs are at the forefront of controllable image generation as they produce a latent space that is semantically disentangled, making it suitable for image editing and manipulation. However, the performance of StyleGANs severely degrades when trained via class-conditioning on large-scale long-tailed datasets. We find that one reason for degradation is the collapse of latents for each class in the W latent space. With NoisyTwins, we first introduce an effective and inexpensive augmentation strategy for class embeddings, which then decorrelates the latents based on self-supervision in the W space. This decorrelation mitigates collapse, ensuring that our method preserves intra-class diversity with class-consistency in image generation. We show the effectiveness of our approach on large-scale real-world long-tailed datasets of ImageNet-LT and iNaturalist 2019, where our method outperforms other methods by 19% on FID, establishing a new state-of-the-art.

Robust Outlier Rejection for 3D Registration With Variational Bayes
Jiang, HaoboandDang, ZhengandWei, ZhenandXie, JinandYang, JianandSalzmann, Mathieu



研究问题:如何有效地从3D注册中移除异常值(不匹配的对应关系)以实现稳健对齐。
动机:现有的基于学习的异常值(不匹配的对应关系)拒绝方法通常将异常值去除问题形式化为内点/外点分类问题,其成功的核心在于学习判别性的内点/外点特征表示。
方法:本文提出了一种新的变分非局部网络基异常值拒绝框架用于稳健对齐。通过用变分贝叶斯推理重新表述非局部特征学习,可以建模贝叶斯驱动的长范围依赖性以聚合判别性的几何上下文信息进行内点/外点区分。
效果:在3DMatch、3DLoMatch和KITTI数据集上的大量实验验证了我们的方法的有效性。

Learning-based outlier (mismatched correspondence) rejection for robust 3D registration generally formulates the outlier removal as an inlier/outlier classification problem. The core for this to be successful is to learn the discriminative inlier/outlier feature representations. In this paper, we develop a novel variational non-local network-based outlier rejection framework for robust alignment. By reformulating the non-local feature learning with variational Bayesian inference, the Bayesian-driven long-range dependencies can be modeled to aggregate discriminative geometric context information for inlier/outlier distinction. Specifically, to achieve such Bayesian-driven contextual dependencies, each query/key/value component in our non-local network predicts a prior feature distribution and a posterior one. Embedded with the inlier/outlier label, the posterior feature distribution is label-dependent and discriminative. Thus, pushing the prior to be close to the discriminative posterior in the training step enables the features sampled from this prior at test time to model high-quality long-range dependencies. Notably, to achieve effective posterior feature guidance, a specific probabilistic graphical model is designed over our non-local model, which lets us derive a variational low bound as our optimization objective for model training. Finally, we propose a voting-based inlier searching strategy to cluster the high-quality hypothetical inliers for transformation estimation. Extensive experiments on 3DMatch, 3DLoMatch, and KITTI datasets verify the effectiveness of our method.

Dynamically Instance-Guided Adaptation: A Backward-Free Approach for Test-Time Domain Adaptive Semantic Segmentation
Wang, WeiandZhong, ZhunandWang, WeijieandChen, XiandLing, CharlesandWang, BoyuandSebe, Nicu



研究问题:本文旨在解决语义分割中测试时领域适应的问题,即如何同时提高效率和效果。
动机:现有的方法或者效率低(如后向优化),或者忽视了语义适应(如分布对齐)。此外,它们还会受到由不稳定的优化和异常分布引起的累积错误的影响。
方法:我们提出了一种名为动态实例引导适应(DIGA)的新型无后向方法来解决这些问题。该方法利用每个实例动态地以非参数方式指导其自身的适应,从而避免了错误累积问题和昂贵的优化成本。具体来说,DIGA由分布适应模块(DAM)和语义适应模块(SAM)组成,使我们能够联合调整模型的两个不可或缺的方面。
效果:我们在五个目标领域进行了广泛的实验,结果表明了所提出方法的有效性和效率。我们的DIGA在TTDA-Seg中建立了新的最先进的性能。

In this paper, we study the application of Test-time domain adaptation in semantic segmentation (TTDA-Seg) where both efficiency and effectiveness are crucial. Existing methods either have low efficiency (e.g., backward optimization) or ignore semantic adaptation (e.g., distribution alignment). Besides, they would suffer from the accumulated errors caused by unstable optimization and abnormal distributions. To solve these problems, we propose a novel backward-free approach for TTDA-Seg, called Dynamically Instance-Guided Adaptation (DIGA). Our principle is utilizing each instance to dynamically guide its own adaptation in a non-parametric way, which avoids the error accumulation issue and expensive optimizing cost. Specifically, DIGA is composed of a distribution adaptation module (DAM) and a semantic adaptation module (SAM), enabling us to jointly adapt the model in two indispensable aspects. DAM mixes the instance and source BN statistics to encourage the model to capture robust representation. SAM combines the historical prototypes with instance-level prototypes to adjust semantic predictions, which can be associated with the parametric classifier to mutually benefit the final results. Extensive experiments evaluated on five target domains demonstrate the effectiveness and efficiency of the proposed method. Our DIGA establishes new state-of-the-art performance in TTDA-Seg.

LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data
Park, JihyeandKim, SunwooandKim, SoohyunandCho, SeokjuandYoo, JaejunandUh, YoungjungandKim, Seungryong



研究问题:现有的图像到图像翻译技术存在严重依赖每个样本的领域标注和/或无法处理每个图像的多个属性的问题。
动机:近期真正无监督的方法采用聚类方法来轻松提供每个样本的独热领域标签,但它们不能解决现实世界的情况:一个样本可能具有多个属性。此外,聚类的语义与人类理解不容易耦合。
方法:我们提出了一种语言驱动的图像到图像翻译模型,称为LANIT。我们利用文本中给出的易于获取的候选属性来为数据集生成相似性,该相似性表示每个样本的领域标签。这种形式自然地实现了多热标签,使用户可以用一组语言中的属性指定目标领域。为了解决初始提示不准确的问题,我们还提出了提示学习。我们还提出了域正则化损失,以强制将翻译后的图像映射到相应的域。
效果:在几个标准基准上的实验表明,LANIT实现了与现有模型相当或更好的性能。代码可在github.com/KU-CVLAB/LANIT上获得。

Existing techniques for image-to-image translation commonly have suffered from two critical problems: heavy reliance on per-sample domain annotation and/or inability to handle multiple attributes per image. Recent truly-unsupervised methods adopt clustering approaches to easily provide per-sample one-hot domain labels. However, they cannot account for the real-world setting: one sample may have multiple attributes. In addition, the semantics of the clusters are not easily coupled to human understanding. To overcome these, we present LANguage-driven Image-to-image Translation model, dubbed LANIT. We leverage easy-to-obtain candidate attributes given in texts for a dataset: the similarity between images and attributes indicates per-sample domain labels. This formulation naturally enables multi-hot labels so that users can specify the target domain with a set of attributes in language. To account for the case that the initial prompts are inaccurate, we also present prompt learning. We further present domain regularization loss that enforces translated images to be mapped to the corresponding domain. Experiments on several standard benchmarks demonstrate that LANIT achieves comparable or superior performance to existing models. The code is available at github.com/KU-CVLAB/LANIT.

Contrastive Semi-Supervised Learning for Underwater Image Restoration via Reliable Bank
Huang, ShiruiandWang, KeyanandLiu, HuanandChen, JunandLi, Yunsong



研究问题:尽管近期的水下图像修复技术取得了显著成就,但缺乏标注数据已成为进一步进步的主要障碍。
动机:为了解决这一问题,我们提出了一种基于均值教师的半监督水下图像修复(Semi-UIR)框架,以将未标注的数据纳入网络训练中。
方法:我们首先引入了一个可靠的银行来存储"有史以来最好的"输出作为伪真实值。然后,我们根据单调性属性进行实证分析,选择最可信的自然图像质量评估(NR-IQA)方法来评估输出的质量。此外,为了防止过度拟合错误标签,我们还引入了对比正则化。
效果:我们在全参考和非参考水下基准测试上的实验结果表明,我们的算法在定量和定性上都明显优于最先进的方法。

Despite the remarkable achievement of recent underwater image restoration techniques, the lack of labeled data has become a major hurdle for further progress. In this work, we propose a mean-teacher based Semi-supervised Underwater Image Restoration (Semi-UIR) framework to incorporate the unlabeled data into network training. However, the naive mean-teacher method suffers from two main problems: (1) The consistency loss used in training might become ineffective when the teacher's prediction is wrong. (2) Using L1 distance may cause the network to overfit wrong labels, resulting in confirmation bias. To address the above problems, we first introduce a reliable bank to store the "best-ever" outputs as pseudo ground truth. To assess the quality of outputs, we conduct an empirical analysis based on the monotonicity property to select the most trustworthy NR-IQA method. Besides, in view of the confirmation bias problem, we incorporate contrastive regularization to prevent the overfitting on wrong labels. Experimental results on both full-reference and non-reference underwater benchmarks demonstrate that our algorithm has obvious improvement over SOTA methods quantitatively and qualitatively. Code has been released at https://github.com/Huang-ShiRui/Semi-UIR.

EcoTTA: Memory-Efficient Continual Test-Time Adaptation via Self-Distilled Regularization
Song, JunhaandLee, JungsooandKweon, InSoandChoi, Sungha



研究问题:如何在有限的内存下提高持续测试时适应(TTA)的效率。
动机:TTA主要在边缘设备上进行,这些设备的内存有限,因此减少内存消耗至关重要,但以前的TTA研究忽视了这一点。此外,长期适应往往会导致灾难性遗忘和错误积累,阻碍了TTA在实际部署中的应用。
方法:提出了两个组件来解决这个问题。首先,我们提出了轻量级的元网络,可以适应冻结的原始网络到目标领域。这种新的架构通过减少反向传播所需的中间激活的大小来最小化内存消耗。其次,我们的自我蒸馏正则化控制元网络的输出不会显著偏离冻结的原始网络的输出,从而保留源领域的训练良好的知识。没有额外的内存,这种正则化防止了错误积累和灾难性遗忘,即使在长期测试时也能保持稳定的性能。
效果:我们在各种基准测试中展示了这种方法在图像分类和语义分割任务上优于其他最先进的方法。值得注意的是,我们使用ResNet-50和WideResNet-40的方法比最近最先进的CoTTA方法减少了86%和80%的内存消耗。

This paper presents a simple yet effective approach that improves continual test-time adaptation (TTA) in a memory-efficient manner. TTA may primarily be conducted on edge devices with limited memory, so reducing memory is crucial but has been overlooked in previous TTA studies. In addition, long-term adaptation often leads to catastrophic forgetting and error accumulation, which hinders applying TTA in real-world deployments. Our approach consists of two components to address these issues. First, we present lightweight meta networks that can adapt the frozen original networks to the target domain. This novel architecture minimizes memory consumption by decreasing the size of intermediate activations required for backpropagation. Second, our novel self-distilled regularization controls the output of the meta networks not to deviate significantly from the output of the frozen original networks, thereby preserving well-trained knowledge from the source domain. Without additional memory, this regularization prevents error accumulation and catastrophic forgetting, resulting in stable performance even in long-term test-time adaptation. We demonstrate that our simple yet effective strategy outperforms other state-of-the-art methods on various benchmarks for image classification and semantic segmentation tasks. Notably, our proposed method with ResNet-50 and WideResNet-40 takes 86% and 80% less memory than the recent state-of-the-art method, CoTTA.

Unlearnable Clusters: Towards Label-Agnostic Unlearnable Examples
Zhang, JiamingandMa, XingjunandYi, QiandSang, JitaoandJiang, Yu-GangandWang, YaoweiandXu, Changsheng



研究问题:如何有效地防止网络视觉隐私泄露。
动机:现有的不可学习示例(UEs)生成方法都依赖于标签一致性的假设,但在实际中,攻击者可能会以与保护者完全不同的方式利用受保护的数据。
方法:提出并推广一种更实用的标签无关设置,以及一种新的技术——不可学习簇(UCs),通过簇级扰动生成标签无关的不可学习示例。同时,利用像CLIP这样的视觉和语言预训练模型作为替代模型,提高所构建的UCs在不同领域的可转移性。
效果:在各种设置下,包括不同的数据集、目标模型,甚至商业平台如微软Azure和百度PaddlePaddle等,实证验证了所提出方法的有效性。

There is a growing interest in developing unlearnable examples (UEs) against visual privacy leaks on the Internet. UEs are training samples added with invisible but unlearnable noise, which have been found can prevent unauthorized training of machine learning models. UEs typically are generated via a bilevel optimization framework with a surrogate model to remove (minimize) errors from the original samples, and then applied to protect the data against unknown target models. However, existing UE generation methods all rely on an ideal assumption called labelconsistency, where the hackers and protectors are assumed to hold the same label for a given sample. In this work, we propose and promote a more practical label-agnostic setting, where the hackers may exploit the protected data quite differently from the protectors. E.g., a m-class unlearnable dataset held by the protector may be exploited by the hacker as a n-class dataset. Existing UE generation methods are rendered ineffective in this challenging setting. To tackle this challenge, we present a novel technique called Unlearnable Clusters (UCs) to generate label-agnostic unlearnable examples with cluster-wise perturbations. Furthermore, we propose to leverage Vision-and-Language Pretrained Models (VLPMs) like CLIP as the surrogate model to improve the transferability of the crafted UCs to diverse domains. We empirically verify the effectiveness of our proposed approach under a variety of settings with different datasets, target models, and even commercial platforms Microsoft Azure and Baidu PaddlePaddle. Code is available at https://github.com/jiamingzhang94/ Unlearnable-Clusters.

Rethinking Federated Learning With Domain Shift: A Prototype View
Huang, WenkeandYe, MangandShi, ZekunandLi, HeandDu, Bo



研究问题:如何在数据来自不同领域时,通过联邦学习提高模型的泛化性能。
动机:现有的联邦学习方法主要关注同一领域的私有数据,当分布式数据来自不同领域时,本地模型在其他领域会出现退化性能(即领域偏移)。
方法:提出联邦原型学习(FPL)方法,构建集群原型和无偏原型,提供有价值的领域知识和公平的收敛目标。一方面,将样本嵌入拉向属于相同语义的集群原型,而不是不同类别的集群原型;另一方面,引入一致性正则化,使本地实例与各自的无偏原型对齐。
效果:在Digits和Office Caltech任务上进行的实证结果表明,该方法有效,关键模块效率高。

Federated learning shows a bright promise as a privacy-preserving collaborative learning technique. However, prevalent solutions mainly focus on all private data sampled from the same domain. An important challenge is that when distributed data are derived from diverse domains. The private model presents degenerative performance on other domains (with domain shift). Therefore, we expect that the global model optimized after the federated learning process stably provides generalizability performance on multiple domains. In this paper, we propose Federated Prototypes Learning (FPL) for federated learning under domain shift. The core idea is to construct cluster prototypes and unbiased prototypes, providing fruitful domain knowledge and a fair convergent target. On the one hand, we pull the sample embedding closer to cluster prototypes belonging to the same semantics than cluster prototypes from distinct classes. On the other hand, we introduce consistency regularization to align the local instance with the respective unbiased prototype. Empirical results on Digits and Office Caltech tasks demonstrate the effectiveness of the proposed solution and the efficiency of crucial modules.

Augmentation Matters: A Simple-Yet-Effective Approach to Semi-Supervised Semantic Segmentation
Zhao, ZhenandYang, LiheandLong, SifanandPi, JiminandZhou, LupingandWang, Jingdong



研究问题:本文旨在通过数据扰动提高半监督语义分割(SSS)的性能。
动机:尽管现有的最先进的方法在性能上表现出色,但它们往往设计复杂,引入更多的网络组件和额外的训练过程。
方法:本文遵循标准的教师-学生框架,提出了AugSeg,一种简单而干净的方法,主要关注数据扰动以提升SSS性能。我们调整了各种数据增强方法,使其更好地适应半监督场景,而不是直接从监督学习中应用这些技术。
效果:实验结果表明,我们的简单AugSeg在不使用任何花哨技巧的情况下,就能轻易地在不同的划分协议下实现SSS基准测试的新的最佳性能。

Recent studies on semi-supervised semantic segmentation (SSS) have seen fast progress. Despite their promising performance, current state-of-the-art methods tend to increasingly complex designs at the cost of introducing more network components and additional training procedures. Differently, in this work, we follow a standard teacher-student framework and propose AugSeg, a simple and clean approach that focuses mainly on data perturbations to boost the SSS performance. We argue that various data augmentations should be adjusted to better adapt to the semi-supervised scenarios instead of directly applying these techniques from supervised learning. Specifically, we adopt a simplified intensity-based augmentation that selects a random number of data transformations with uniformly sampling distortion strengths from a continuous space. Based on the estimated confidence of the model on different unlabeled samples, we also randomly inject labelled information to augment the unlabeled samples in an adaptive manner. Without bells and whistles, our simple AugSeg can readily achieve new state-of-the-art performance on SSS benchmarks under different partition protocols.

Multiclass Confidence and Localization Calibration for Object Detection
Pathiraja, BimsaraandGunawardhana, MalithaandKhan, MuhammadHaris



研究问题:尽管深度神经网络在许多具有挑战性的计算机视觉问题上取得了高预测准确性,但它们往往过于自信,导致预测结果的校准效果不佳。
动机:目前大多数改进深度神经网络校准的研究都仅限于分类任务,并且只针对同领域的预测进行校准。然而,对于占据视觉安全敏感和安全关键应用核心地位的目标检测方法的校准研究却鲜有涉及。
方法:本文提出了一种新的训练时目标检测方法的校准技术,该技术能够通过利用预测不确定性来联合校准多类别置信度和框定位。
效果:我们在多个同领域和异领域的检测基准上进行了广泛的实验。结果显示,我们提出的新型训练时校准方法在减少同领域和异领域预测的校准误差方面始终优于几个基线。我们的代码和模型可以在https://github.com/bimsarapathiraja/MCCL获取。

Albeit achieving high predictive accuracy across many challenging computer vision problems, recent studies suggest that deep neural networks (DNNs) tend to make overconfident predictions, rendering them poorly calibrated. Most of the existing attempts for improving DNN calibration are limited to classification tasks and restricted to calibrating in-domain predictions. Surprisingly, very little to no attempts have been made in studying the calibration of object detection methods, which occupy a pivotal space in vision-based security-sensitive, and safety-critical applications. In this paper, we propose a new train-time technique for calibrating modern object detection methods. It is capable of jointly calibrating multiclass confidence and box localization by leveraging their predictive uncertainties. We perform extensive experiments on several in-domain and out-of-domain detection benchmarks. Results demonstrate that our proposed train-time calibration method consistently outperforms several baselines in reducing calibration error for both in-domain and out-of-domain predictions. Our code and models are available at https://github.com/bimsarapathiraja/MCCL

Leveraging Inter-Rater Agreement for Classification in the Presence of Noisy Labels
Bucarelli, MariaSofiaandCassano, LucasandSiciliano, FedericoandMantrach, AminandSilvestri, Fabrizio



研究问题:如何利用标注者之间的统计信息来估计标签所受的噪声分布,以及如何使用该噪声分布来从有噪声的数据集中学习。
动机:在实际应用中,分类数据集是通过人工标注过程获得的,由于多个可能意见不合的标注者对同一样本的标注可能不同,因此标签可能存在噪声。
方法:本文提出了利用标注者之间的统计信息来估计标签所受的噪声分布的方法,并介绍了使用该噪声分布来从有噪声的数据集中学习的方法。
效果:通过实验验证了本文提出的方法的有效性。

In practical settings, classification datasets are obtained through a labelling process that is usually done by humans. Labels can be noisy as they are obtained by aggregating the different individual labels assigned to the same sample by multiple, and possibly disagreeing, annotators. The inter-rater agreement on these datasets can be measured while the underlying noise distribution to which the labels are subject is assumed to be unknown. In this work, we: (i) show how to leverage the inter-annotator statistics to estimate the noise distribution to which labels are subject; (ii) introduce methods that use the estimate of the noise distribution to learn from the noisy dataset; and (iii) establish generalization bounds in the empirical risk minimization framework that depend on the estimated quantities. We conclude the paper by providing experiments that illustrate our findings.

Meta Compositional Referring Expression Segmentation
Xu, LiandHuang, MarkHeandShang, XindiandYuan, ZehuanandSun, YingandLiu, Jun



研究问题:本文旨在解决参照表达式分割任务中,现有模型可能无法充分捕捉语义和视觉概念表示的问题。
动机:尽管参照表达式分割任务取得了进展,但现有的模型在处理新的概念组合时,其泛化能力受到限制。
方法:通过元学习的视角,提出了一种元成分参照表达式分割(MCRES)框架,以增强模型的组合泛化性能。该框架首先使用训练数据构建虚拟训练集和多个虚拟测试集,然后通过一种新的元优化方案,使模型在虚拟训练集上训练后能在虚拟测试集上获得良好的测试性能。
效果:实验证明,该框架能有效提升模型对单个概念的语义和视觉表示的捕捉能力,从而在处理新的概念组合时获得稳健的泛化性能。

Referring expression segmentation aims to segment an object described by a language expression from an image. Despite the recent progress on this task, existing models tackling this task may not be able to fully capture semantics and visual representations of individual concepts, which limits their generalization capability, especially when handling novel compositions of learned concepts. In this work, through the lens of meta learning, we propose a Meta Compositional Referring Expression Segmentation (MCRES) framework to enhance model compositional generalization performance. Specifically, to handle various levels of novel compositions, our framework first uses training data to construct a virtual training set and multiple virtual testing sets, where data samples in each virtual testing set contain a level of novel compositions w.r.t. the support set. Then, following a novel meta optimization scheme to optimize the model to obtain good testing performance on the virtual testing sets after training on the virtual training set, our framework can effectively drive the model to better capture semantics and visual representations of individual concepts, and thus obtain robust generalization performance even when handling novel compositions. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our framework.

Continual Semantic Segmentation With Automatic Memory Sample Selection
Zhu, LanyunandChen, TianrunandYin, JianxiongandSee, SimonandLiu, Jun



研究问题:本文旨在解决连续语义分割(CSS)中新类别引入导致的灾难性遗忘问题。
动机:现有的方法在处理CSS中的新类别时,会随机或基于单一因素选择记忆样本进行重放,这无法保证最优效果。
方法:本文提出了一种新的记忆样本选择机制,通过考虑样本多样性和类别性能等全面因素,自动选择有信息量的记忆样本进行有效重放。
效果:在Pascal-VOC 2012和ADE 20K数据集上的大量实验表明,该方法的有效性,并在6阶段设置上以比第二名高出12.54%的性能达到了最先进的效果。

Continual Semantic Segmentation (CSS) extends static semantic segmentation by incrementally introducing new classes for training. To alleviate the catastrophic forgetting issue in CSS, a memory buffer that stores a small number of samples from the previous classes is constructed for replay. However, existing methods select the memory samples either randomly or based on a single-factor-driven hand-crafted strategy, which has no guarantee to be optimal. In this work, we propose a novel memory sample selection mechanism that selects informative samples for effective replay in a fully automatic way by considering comprehensive factors including sample diversity and class performance. Our mechanism regards the selection operation as a decision-making process and learns an optimal selection policy that directly maximizes the validation performance on a reward set. To facilitate the selection decision, we design a novel state representation and a dual-stage action space. Our extensive experiments on Pascal-VOC 2012 and ADE 20K datasets demonstrate the effectiveness of our approach with state-of-the-art (SOTA) performance achieved, outperforming the second-place one by 12.54% for the 6-stage setting on Pascal-VOC 2012.

Meta-Tuning Loss Functions and Data Augmentation for Few-Shot Object Detection
Demirel, BerkanandBaran, OrhunBu\u{g



研究问题:本文旨在解决在少量训练实例下进行新物体检测类别建模的问题。
动机:目前的少量学习技术和物体检测技术中,针对新物体类别的检测效果仍有待提高。
方法:提出一种基于元学习的优化损失函数和增强策略的训练方案,通过这种方式来改进现有的少量学习技术。
效果:实验结果表明,该方法在标准和广义的少量性能指标上均优于现有的基于微调的少量物体检测基线,并在Pascal VOC和MS-COCO数据集上取得了显著的改进。

Few-shot object detection, the problem of modelling novel object detection categories with few training instances, is an emerging topic in the area of few-shot learning and object detection. Contemporary techniques can be divided into two groups: fine-tuning based and meta-learning based approaches. While meta-learning approaches aim to learn dedicated meta-models for mapping samples to novel class models, fine-tuning approaches tackle few-shot detection in a simpler manner, by adapting the detection model to novel classes through gradient based optimization. Despite their simplicity, fine-tuning based approaches typically yield competitive detection results. Based on this observation, we focus on the role of loss functions and augmentations as the force driving the fine-tuning process, and propose to tune their dynamics through meta-learning principles. The proposed training scheme, therefore, allows learning inductive biases that can boost few-shot detection, while keeping the advantages of fine-tuning based approaches. In addition, the proposed approach yields interpretable loss functions, as opposed to highly parametric and complex few-shot meta-models. The experimental results highlight the merits of the proposed scheme, with significant improvements over the strong fine-tuning based few-shot detection baselines on benchmark Pascal VOC and MS-COCO datasets, in terms of both standard and generalized few-shot performance metrics.

GCFAgg: Global and Cross-View Feature Aggregation for Multi-View Clustering
Yan, WeiqingandZhang, YuanyangandLv, ChenleiandTang, ChangandYue, GuanghuiandLiao, LiangandLin, Weisi



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Multi-view clustering can partition data samples into their categories by learning a consensus representation in unsupervised way and has received more and more attention in recent years. However, most existing deep clustering methods learn consensus representation or view-specific representations from multiple views via view-wise aggregation way, where they ignore structure relationship of all samples. In this paper, we propose a novel multi-view clustering network to address these problems, called Global and Cross-view Feature Aggregation for Multi-View Clustering (GCFAggMVC). Specifically, the consensus data presentation from multiple views is obtained via cross-sample and cross-view feature aggregation, which fully explores the complementary of similar samples. Moreover, we align the consensus representation and the view-specific representation by the structure-guided contrastive learning module, which makes the view-specific representations from different samples with high structure relationship similar. The proposed module is a flexible multi-view data representation module, which can be also embedded to the incomplete multi-view data clustering task via plugging our module into other frameworks. Extensive experiments show that the proposed method achieves excellent performance in both complete multi-view data clustering tasks and incomplete multi-view data clustering tasks.

Class Balanced Adaptive Pseudo Labeling for Federated Semi-Supervised Learning
Li, MingandLi, QingliandWang, Yan



研究问题:本文旨在解决联邦半监督学习中的问题,即少数客户端有完全标记的数据(标记客户端),而其他客户端的训练数据集是完全未标记的(未标记客户端)。
动机:现有的方法试图处理非独立同分布数据设置带来的挑战。尽管已经提出了如子共识模型等方法,但它们通常在未标记的客户端上采用标准的伪标签或一致性正则化,这很容易受到不平衡的类别分布的影响。因此,联邦半监督学习的问题仍然有待解决。
方法:我们提出了一种名为“Class Balanced Adaptive Pseudo Labeling”的方法,从伪标签的角度来研究联邦半监督学习。在CBAFed中,第一个关键元素是固定伪标签策略,用于处理灾难性遗忘问题;第二个关键元素是通过考虑本地客户端所有训练数据的实证分布来设计类平衡自适应阈值,以鼓励平衡的训练过程。为了使模型达到更好的优化状态,我们还提出了局部有监督训练和全局模型聚合之间的残差权重连接。
效果:我们在五个数据集上进行了广泛的实验,证明了CBAFed的优越性。代码将发布。

This paper focuses on federated semi-supervised learning (FSSL), assuming that few clients have fully labeled data (labeled clients) and the training datasets in other clients are fully unlabeled (unlabeled clients). Existing methods attempt to deal with the challenges caused by not independent and identically distributed data (Non-IID) setting. Though methods such as sub-consensus models have been proposed, they usually adopt standard pseudo labeling or consistency regularization on unlabeled clients which can be easily influenced by imbalanced class distribution. Thus, problems in FSSL are still yet to be solved. To seek for a fundamental solution to this problem, we present Class Balanced Adaptive Pseudo Labeling (CBAFed), to study FSSL from the perspective of pseudo labeling. In CBAFed, the first key element is a fixed pseudo labeling strategy to handle the catastrophic forgetting problem, where we keep a fixed set by letting pass information of unlabeled data at the beginning of the unlabeled client training in each communication round. The second key element is that we design class balanced adaptive thresholds via considering the empirical distribution of all training data in local clients, to encourage a balanced training process. To make the model reach a better optimum, we further propose a residual weight connection in local supervised training and global model aggregation. Extensive experiments on five datasets demonstrate the superiority of CBAFed. Code will be released.

Rethinking Out-of-Distribution (OOD) Detection: Masked Image Modeling Is All You Need
Li, JingyaoandChen, PengguangandHe, ZexinandYu, ShaozuoandLiu, ShuandJia, Jiaya



研究问题:如何提高模型在分布外(OOD)检测中的性能。
动机:现有的OOD检测方法主要基于识别,但这种方法往往学习的是捷径而非全面表示。
方法:本文提出了一种基于重建的预训练任务,即掩蔽图像建模,用于OOD检测框架(MOOD)。
效果:实验结果显示,MOOD在各种OOD检测任务上均取得了显著提升,包括单类OOD检测、多类OOD检测和近分布OOD检测,甚至击败了不包括任何OOD样本的10类异常暴露OOD检测。

The core of out-of-distribution (OOD) detection is to learn the in-distribution (ID) representation, which is distinguishable from OOD samples. Previous work applied recognition-based methods to learn the ID features, which tend to learn shortcuts instead of comprehensive representations. In this work, we find surprisingly that simply using reconstruction-based methods could boost the performance of OOD detection significantly. We deeply explore the main contributors of OOD detection and find that reconstruction-based pretext tasks have the potential to provide a generally applicable and efficacious prior, which benefits the model in learning intrinsic data distributions of the ID dataset. Specifically, we take Masked Image Modeling as a pretext task for our OOD detection framework (MOOD). Without bells and whistles, MOOD outperforms previous SOTA of one-class OOD detection by 5.7%, multi-class OOD detection by 3.0%, and near-distribution OOD detection by 2.1%. It even defeats the 10-shot-per-class outlier exposure OOD detection, although we do not include any OOD samples for our detection.

Multi Domain Learning for Motion Magnification
Singh, JasdeepandMurala, SubrahmanyamandKosuru, G.SankaraRaju



研究问题:如何有效地放大视频中的微小运动,如呼吸时的胸部微动和移动物体的细微振动,同时减少噪声、光照变化和大运动的影响。
动机:现有的最先进方法主要依赖于手工设计的概念,虽然可以实现一定的放大效果,但会产生环状伪影等问题。基于深度学习的方法虽然可以提供更高的放大效果,但在一些场景中会出现严重的伪影。
方法:我们提出了一种新的基于相位的深度网络进行视频运动放大,该网络在频率和空间两个域内操作,首先从频率域的相位波动生成运动放大,然后在空间域内改善其质量。所提出的模型是轻量级网络,参数较少(分别为0.11M和0.05M)。
效果:我们将提出的方法与最先进的方法进行了比较,并在真实世界和合成视频上进行了评估。最后,我们还进行了消融研究,以显示网络不同部分的影响。

Video motion magnification makes subtle invisible motions visible, such as small chest movements while breathing, subtle vibrations in the moving objects etc. But small motions are prone to noise, illumination changes, large motions, etc. making the task difficult. Most state-of-the-art methods use hand-crafted concepts which result in small magnification, ringing artifacts etc. The deep learning based approach has higher magnification but is prone to severe artifacts in some scenarios. We propose a new phase based deep network for video motion magnification that operates in both domains (frequency and spatial) to address this issue. It generates motion magnification from frequency domain phase fluctuations and then improves its quality in the spatial domain. The proposed models are lightweight networks with fewer parameters ( 0.11M and 0.05M). Further, the proposed networks performance is compared to the SOTA approaches and evaluated on real-world and synthetic videos. Finally, an ablation study is also conducted to show the impact of different parts of the network.

You Do Not Need Additional Priors or Regularizers in Retinex-Based Low-Light Image Enhancement
Fu, HuiyuanandZheng, WenkaiandMeng, XiangyuandWang, XinandWang, ChuanmingandMa, Huadong



研究问题:如何改善在低光条件下拍摄的图像质量。
动机:现有的基于深度视网膜的方法需要将图像分解为反射和照明成分,这是一个病态难题,且没有可用的地面真值。
方法:提出一种对比学习和自我知识蒸馏方法,训练模型进行视网膜分解,无需复杂的手工制作的正则化函数。
效果:实验结果表明,该方法优于现有技术。

Images captured in low-light conditions often suffer from significant quality degradation. Recent works have built a large variety of deep Retinex-based networks to enhance low-light images. The Retinex-based methods require decomposing the image into reflectance and illumination components, which is a highly ill-posed problem and there is no available ground truth. Previous works addressed this problem by imposing some additional priors or regularizers. However, finding an effective prior or regularizer that can be applied in various scenes is challenging, and the performance of the model suffers from too many additional constraints. We propose a contrastive learning method and a self-knowledge distillation method that allow training our Retinex-based model for Retinex decomposition without elaborate hand-crafted regularization functions. Rather than estimating reflectance and illuminance images and representing the final images as their element-wise products as in previous works, our regularizer-free Retinex decomposition and synthesis network (RFR) extracts reflectance and illuminance features and synthesizes them end-to-end. In addition, we propose a loss function for contrastive learning and a progressive learning strategy for self-knowledge distillation. Extensive experimental results demonstrate that our proposed methods can achieve superior performance compared with state-of-the-art approaches.

Re-Thinking Model Inversion Attacks Against Deep Neural Networks
Nguyen, Ngoc-BaoandChandrasegaran, KeshigeyanandAbdollahzadeh, MiladandCheung, Ngai-Man



研究问题:模型倒置攻击(MI)旨在通过滥用对模型的访问来推断和重建私有训练数据,引发了对敏感信息泄露(如用于训练人脸识别系统的私人面部图像)的关注。
动机:现有的所有最先进的MI算法存在两个基本问题,我们提出了解决方案,显著提高了所有最先进的MI的攻击性能。
方法:我们重新审视了MI,研究了所有最先进的MI算法的两个基本问题,并提出了解决方案,显著提高了所有最先进的MI的攻击性能。特别是,我们的贡献有两个方面:1) 我们分析了最先进的MI算法的优化目标,认为该目标对于实现MI是次优的,并提出了改进的优化目标,显著提高了攻击性能。2) 我们分析了"MI过拟合",表明它可能会阻止重构的图像学习训练数据的语义,并提出了一种新颖的"模型增强"思想来克服这个问题。
效果:例如,在标准的CelebA基准测试中,我们的解决方案将准确率提高了11.8%,并首次实现了超过90%的攻击准确率。我们的发现表明,深度学习模型存在明显的敏感信息泄露风险。我们敦促人们认真考虑隐私影响。我们的代码、演示和模型可在https://ngoc-nguyen-0.github.io/re-thinking_model_inversion_attacks/获取。

Model inversion (MI) attacks aim to infer and reconstruct private training data by abusing access to a model. MI attacks have raised concerns about the leaking of sensitive information (e.g. private face images used in training a face recognition system). Recently, several algorithms for MI have been proposed to improve the attack performance. In this work, we revisit MI, study two fundamental issues pertaining to all state-of-the-art (SOTA) MI algorithms, and propose solutions to these issues which lead to a significant boost in attack performance for all SOTA MI. In particular, our contributions are two-fold: 1) We analyze the optimization objective of SOTA MI algorithms, argue that the objective is sub-optimal for achieving MI, and propose an improved optimization objective that boosts attack performance significantly. 2) We analyze "MI overfitting", show that it would prevent reconstructed images from learning semantics of training data, and propose a novel "model augmentation" idea to overcome this issue. Our proposed solutions are simple and improve all SOTA MI attack accuracy significantly. E.g., in the standard CelebA benchmark, our solutions improve accuracy by 11.8% and achieve for the first time over 90% attack accuracy. Our findings demonstrate that there is a clear risk of leaking sensitive information from deep learning models. We urge serious consideration to be given to the privacy implications. Our code, demo, and models are available at https://ngoc-nguyen-0.github.io/re-thinking_model_inversion_attacks/

MetaMix: Towards Corruption-Robust Continual Learning With Temporally Self-Adaptive Data Transformation
Wang, ZhenyiandShen, LiandZhan, DonglinandSuo, QiulingandZhu, YanjunandDuan, TiehangandGao, Mingchen



研究问题:如何评估和提高持续学习模型的抗腐败能力,使其在安全关键场景中具有可信赖性和鲁棒性。
动机:现有的持续学习模型对各种数据腐败非常脆弱,特别是在测试阶段。
方法:提出一种元学习框架——自我适应的数据增强(MetaMix),通过自动转换新任务数据或记忆数据来处理持续学习中的腐败鲁棒性问题。
效果:通过构建不同严重程度的持续学习腐败数据集,并在任务连续学习和类别连续学习上进行全面实验,证明该方法比现有的最佳基线更有效。

Continual Learning (CL) has achieved rapid progress in recent years. However, it is still largely unknown how to determine whether a CL model is trustworthy and how to foster its trustworthiness. This work focuses on evaluating and improving the robustness to corruptions of existing CL models. Our empirical evaluation results show that existing state-of-the-art (SOTA) CL models are particularly vulnerable to various data corruptions during testing. To make them trustworthy and robust to corruptions deployed in safety-critical scenarios, we propose a meta-learning framework of self-adaptive data augmentation to tackle the corruption robustness in CL. The proposed framework, MetaMix, learns to augment and mix data, automatically transforming the new task data or memory data. It directly optimizes the generalization performance against data corruptions during training. To evaluate the corruption robustness of our proposed approach, we construct several CL corruption datasets with different levels of severity. We perform comprehensive experiments on both task- and class-continual learning. Extensive experiments demonstrate the effectiveness of our proposed method compared to SOTA baselines.

DART: Diversify-Aggregate-Repeat Training Improves Generalization of Neural Networks
Jain, SamyakandAddepalli, SravantiandSahu, PawanKumarandDey, PriyamandBabu, R.Venkatesh



研究问题:如何提高神经网络在真实世界中的安全性部署?
动机:常见的训练策略,如数据增强、集成和模型平均,有助于改善神经网络的泛化能力。
方法:我们首先建立了一个简单但强大的泛化基准,利用训练小批量内的多样化增强,并发现这可以学习到更平衡的特征分布。然后,我们提出了“多样化-聚合-重复训练”(DART)策略,该策略首先使用不同的增强(或领域)训练多样化的模型以探索损失盆地,然后聚合它们的权重以结合它们的专业知识并获得改进的泛化能力。我们发现在整个训练过程中重复聚合步骤可以改善整体优化轨迹,并确保个体模型具有足够低的损失障碍以获得更好的组合泛化效果。
效果:除了在域内泛化上的改进外,我们在流行的DomainBed框架中的领域泛化基准上展示了最先进的性能。我们的方法具有通用性,可以很容易地与几种基本训练算法集成以实现性能提升。

Generalization of Neural Networks is crucial for deploying them safely in the real world. Common training strategies to improve generalization involve the use of data augmentations, ensembling and model averaging. In this work, we first establish a surprisingly simple but strong benchmark for generalization which utilizes diverse augmentations within a training minibatch, and show that this can learn a more balanced distribution of features. Further, we propose Diversify-Aggregate-Repeat Training (DART) strategy that first trains diverse models using different augmentations (or domains) to explore the loss basin, and further Aggregates their weights to combine their expertise and obtain improved generalization. We find that Repeating the step of Aggregation throughout training improves the overall optimization trajectory and also ensures that the individual models have sufficiently low loss barrier to obtain improved generalization on combining them. We theoretically justify the proposed approach and show that it indeed generalizes better. In addition to improvements in In-Domain generalization, we demonstrate SOTA performance on the Domain Generalization benchmarks in the popular DomainBed framework as well. Our method is generic and can easily be integrated with several base training algorithms to achieve performance gains. Our code is available here: https://github.com/val-iisc/DART.

Finding Geometric Models by Clustering in the Consensus Space
Barath, DanielandRozumnyi, DenysandEichhardt, IvanandHajder, LeventeandMatas, Jiri



研究问题:提出一种新的算法,用于寻找未知数量的几何模型,如单应性。
动机:目前的问题是形式化为在不形成明确的点到模型分配的情况下逐步找到主导模型实例。
方法:通过类似RANSAC的采样和由考虑先前提出的实例的模型质量函数驱动的整合过程来找到主导实例。新的实例是通过在共识空间中进行聚类找到的。这种新的方法导致了一个简单而高效的迭代算法,同时在实时运行多个视觉问题上具有最先进的准确性。
效果:该算法比竞争对手在两个视图的运动估计上快至少两个数量级,且在多个应用中,使用多个几何模型可以提高准确性,包括从多个广义单应性进行姿态估计、快速移动物体的轨迹估计等。

We propose a new algorithm for finding an unknown number of geometric models, e.g., homographies. The problem is formalized as finding dominant model instances progressively without forming crisp point-to-model assignments. Dominant instances are found via a RANSAC-like sampling and a consolidation process driven by a model quality function considering previously proposed instances. New ones are found by clustering in the consensus space. This new formulation leads to a simple iterative algorithm with state-of-the-art accuracy while running in real-time on a number of vision problems -- at least two orders of magnitude faster than the competitors on two-view motion estimation. Also, we propose a deterministic sampler reflecting the fact that real-world data tend to form spatially coherent structures. The sampler returns connected components in a progressively densified neighborhood-graph. We present a number of applications where the use of multiple geometric models improves accuracy. These include pose estimation from multiple generalized homographies; trajectory estimation of fast-moving objects; and we also propose a way of using multiple homographies in global SfM algorithms. Source code: https://github.com/danini/clustering-in-consensus-space.

DaFKD: Domain-Aware Federated Knowledge Distillation
Wang, HaozhaoandLi, YichenandXu, WenchaoandLi, RuixuanandZhan, YufengandZeng, Zhigang



研究问题:现有的联邦蒸馏方法在处理来自分布式客户端的统计异构数据时,通常对所有本地模型一视同仁,这导致聚合模型的性能下降。
动机:为了解决现有联邦蒸馏方法中,由于忽视所有本地模型之间的多样性而导致的聚合模型性能下降的问题。
方法:提出了一种新的领域知识感知的联邦蒸馏方法(DaFKD),该方法将每个客户端的本地数据视为特定的领域,并设计了一个领域鉴别器来识别每个模型对蒸馏样本的重要性,从而优化来自不同模型的软预测集合。
效果:通过在各种数据集和设置上进行大量实验,发现与最先进的基线相比,该方法可以将模型精度提高多达6.02%。

Federated Distillation (FD) has recently attracted increasing attention for its efficiency in aggregating multiple diverse local models trained from statistically heterogeneous data of distributed clients. Existing FD methods generally treat these models equally by merely computing the average of their output soft predictions for some given input distillation sample, which does not take the diversity across all local models into account, thus leading to degraded performance of the aggregated model, especially when some local models learn little knowledge about the sample. In this paper, we propose a new perspective that treats the local data in each client as a specific domain and design a novel domain knowledge aware federated distillation method, dubbed DaFKD, that can discern the importance of each model to the distillation sample, and thus is able to optimize the ensemble of soft predictions from diverse models. Specifically, we employ a domain discriminator for each client, which is trained to identify the correlation factor between the sample and the corresponding domain. Then, to facilitate the training of the domain discriminator while saving communication costs, we propose sharing its partial parameters with the classification model. Extensive experiments on various datasets and settings show that the proposed method can improve the model accuracy by up to 6.02% compared to state-of-the-art baselines.

Spectral Bayesian Uncertainty for Image Super-Resolution
Liu, TaoandCheng, JunandTan, Shan



研究问题:如何量化图像超分辨率(SR)的重建不确定性?
动机:现有的SR不确定性估计方法主要关注空间域的像素级不确定性,而与图像SR高度相关的频域SR不确定性却很少被探索。
方法:提出了一种双域学习(DDL)框架,结合贝叶斯方法,能够准确估计频谱不确定性,从频域角度评估高频推理的可靠性。
效果:在非理想条件下进行的大量实验表明,所提出的频谱不确定性有效。此外,还提出了一种新的基于频谱不确定性的解耦频率(SUDF)训练方案,用于感知SR。实验结果表明,所提出的SUDF可以显著提高SR结果的感知质量,而不牺牲太多的像素精度。

Recently deep learning techniques have significantly advanced image super-resolution (SR). Due to the black-box nature, quantifying reconstruction uncertainty is crucial when employing these deep SR networks. Previous approaches for SR uncertainty estimation mostly focus on capturing pixel-wise uncertainty in the spatial domain. SR uncertainty in the frequency domain which is highly related to image SR is seldom explored. In this paper, we propose to quantify spectral Bayesian uncertainty in image SR. To achieve this, a Dual-Domain Learning (DDL) framework is first proposed. Combined with Bayesian approaches, the DDL model is able to estimate spectral uncertainty accurately, enabling a reliability assessment for high frequencies reasoning from the frequency domain perspective. Extensive experiments under non-ideal premises are conducted and demonstrate the effectiveness of the proposed spectral uncertainty. Furthermore, we propose a novel Spectral Uncertainty based Decoupled Frequency (SUDF) training scheme for perceptual SR. Experimental results show the proposed SUDF can evidently boost perceptual quality of SR results without sacrificing much pixel accuracy.

BiasBed - Rigorous Texture Bias Evaluation
Kalischek, NikolaiandDaudt, RodrigoCayeandPeters, TorbenandFurrer, ReinhardandWegner, JanD.andSchindler, Konrad



研究问题:现代卷积神经网络中普遍存在的纹理偏差导致了许多强调形状线索的算法,以支持对新领域的泛化。然而,常见的数据集、基准和通用模型选择策略缺失,也没有公认的严格评估协议。
动机:本研究调查了在减少纹理偏差训练网络时遇到的困难和限制,并指出适当的评估和有意义的方法比较并非易事。
方法:我们引入了BiasBed,这是一个用于纹理和风格偏差训练的测试平台,包括多个数据集和一系列现有算法。它附带了一个广泛的评估协议,包括严格的假设检验来衡量结果的重要性,尽管一些风格偏差方法的训练稳定性较差。
效果:我们的大量实验揭示了对风格偏差(及更广泛领域)进行仔细、基于统计的评估协议的必要性。例如,我们发现文献中提出的一些算法并未显著减轻风格偏差的影响。通过发布BiasBed,我们希望促进对一致且有意义的比较的共同理解,从而加快学习无纹理偏差的方法的进展。代码可在https://github.com/D1noFuzi/BiasBed获取。

The well-documented presence of texture bias in modern convolutional neural networks has led to a plethora of algorithms that promote an emphasis on shape cues, often to support generalization to new domains. Yet, common datasets, benchmarks and general model selection strategies are missing, and there is no agreed, rigorous evaluation protocol. In this paper, we investigate difficulties and limitations when training networks with reduced texture bias. In particular, we also show that proper evaluation and meaningful comparisons between methods are not trivial. We introduce BiasBed, a testbed for texture- and style-biased training, including multiple datasets and a range of existing algorithms. It comes with an extensive evaluation protocol that includes rigorous hypothesis testing to gauge the significance of the results, despite the considerable training instability of some style bias methods. Our extensive experiments, shed new light on the need for careful, statistically founded evaluation protocols for style bias (and beyond). E.g., we find that some algorithms proposed in the literature do not significantly mitigate the impact of style bias at all. With the release of BiasBed, we hope to foster a common understanding of consistent and meaningful comparisons, and consequently faster progress towards learning methods free of texture bias. Code is available at https://github.com/D1noFuzi/BiasBed.

Explicit Boundary Guided Semi-Push-Pull Contrastive Learning for Supervised Anomaly Detection
Yao, XinchengandLi, RuoqiandZhang, JingandSun, JunandZhang, Chongyang



研究问题:大多数异常检测模型仅使用正常样本进行无监督学习,可能导致模糊的决策边界和不足的可分辨性。
动机:在现实世界的应用中,异常样本通常很少,因此需要有效利用已知异常的有价值知识。然而,训练过程中只使用少量已知异常可能会使模型偏向这些已知异常,无法泛化到未见过的新异常。
方法:提出一种新颖的明确边界引导半推拉对比学习机制,旨在检测已见和未见的异常。该机制包括两个核心设计:首先,找到一个明确且紧凑的分离边界作为进一步特征学习的指导;其次,开发一个边界引导的半推拉损失函数,以在特定区域将正常特征拉近、将异常特征推离分离边界。
效果:实验结果表明,该方法能够形成更明确和可分辨的决策边界,从而更有效地区分已知和新未见的异常与正常样本。

Most anomaly detection (AD) models are learned using only normal samples in an unsupervised way, which may result in ambiguous decision boundary and insufficient discriminability. In fact, a few anomaly samples are often available in real-world applications, the valuable knowledge of known anomalies should also be effectively exploited. However, utilizing a few known anomalies during training may cause another issue that the model may be biased by those known anomalies and fail to generalize to unseen anomalies. In this paper, we tackle supervised anomaly detection, i.e., we learn AD models using a few available anomalies with the objective to detect both the seen and unseen anomalies. We propose a novel explicit boundary guided semi-push-pull contrastive learning mechanism, which can enhance model's discriminability while mitigating the bias issue. Our approach is based on two core designs: First, we find an explicit and compact separating boundary as the guidance for further feature learning. As the boundary only relies on the normal feature distribution, the bias problem caused by a few known anomalies can be alleviated. Second, a boundary guided semi-push-pull loss is developed to only pull the normal features together while pushing the abnormal features apart from the separating boundary beyond a certain margin region. In this way, our model can form a more explicit and discriminative decision boundary to distinguish known and also unseen anomalies from normal samples more effectively. Code will be available at https://github.com/xcyao00/BGAD.

Style Projected Clustering for Domain Generalized Semantic Segmentation
Huang, WeiandChen, ChangandLi, YongandLi, JiachengandLi, ChengandSong, FenglongandYan, YouliangandXiong, Zhiwei



研究问题:现有的语义分割方法通过将各种图像规范化到一个标准的特征空间来提高泛化能力,但这不可避免地削弱了表示。
动机:与现有方法不同,我们利用图像之间的差异来构建一个更好的表示空间,其中提取并存储独特的风格特征作为表示的基础。
方法:我们将风格投影实现为存储基础的加权组合,其中相似性距离被用作权重因子。基于相同的概念,我们将此过程扩展到模型的决策部分,并促进语义预测的泛化。
效果:综合实验表明,所提出的方法在未见过的情境上比现有技术有优势,平均而言mIoU提高了3.6%。

Existing semantic segmentation methods improve generalization capability, by regularizing various images to a canonical feature space. While this process contributes to generalization, it weakens the representation inevitably. In contrast to existing methods, we instead utilize the difference between images to build a better representation space, where the distinct style features are extracted and stored as the bases of representation. Then, the generalization to unseen image styles is achieved by projecting features to this known space. Specifically, we realize the style projection as a weighted combination of stored bases, where the similarity distances are adopted as the weighting factors. Based on the same concept, we extend this process to the decision part of model and promote the generalization of semantic prediction. By measuring the similarity distances to semantic bases (i.e., prototypes), we replace the common deterministic prediction with semantic clustering. Comprehensive experiments demonstrate the advantage of proposed method to the state of the art, up to 3.6% mIoU improvement in average on unseen scenarios.

DIP: Dual Incongruity Perceiving Network for Sarcasm Detection
Wen, ChangsongandJia, GuoliandYang, Jufeng



研究问题:本文旨在研究多模态讽刺检测任务,考虑到图像和文本数据的普及性和互补性。
动机:与其他多模态任务不同,讽刺数据在图像和文本之间存在内在不协调性,这在心理学理论中得到了证明。
方法:我们提出了一个双不协调感知(DIP)网络,由两个分支挖掘讽刺信息,从事实和情感层面出发。对于事实方面,我们引入了通道重权策略以获取语义区分的嵌入,并利用高斯分布来模拟不协调性引起的不确定关联。对于情感方面,我们使用共享参数的孪生层来学习跨模态的情感信息。
效果:大量实验表明,我们提出的方法优于最先进的方法。我们的代码已在GitHub上发布。

Sarcasm indicates the literal meaning is contrary to the real attitude. Considering the popularity and complementarity of image-text data, we investigate the task of multi-modal sarcasm detection. Different from other multi-modal tasks, for the sarcastic data, there exists intrinsic incongruity between a pair of image and text as demonstrated in psychological theories. To tackle this issue, we propose a Dual Incongruity Perceiving (DIP) network consisting of two branches to mine the sarcastic information from factual and affective levels. For the factual aspect, we introduce a channel-wise reweighting strategy to obtain semantically discriminative embeddings, and leverage gaussian distribution to model the uncertain correlation caused by the incongruity. The distribution is generated from the latest data stored in the memory bank, which can adaptively model the difference of semantic similarity between sarcastic and non-sarcastic data. For the affective aspect, we utilize siamese layers with shared parameters to learn cross-modal sentiment information. Furthermore, we use the polarity value to construct a relation graph for the mini-batch, which forms the continuous contrastive loss to acquire affective embeddings. Extensive experiments demonstrate that our proposed method performs favorably against state-of-the-art approaches. Our code is released on https://github.com/downdric/MSD.

PA\&DA: Jointly Sampling Path and Data for Consistent NAS
Lu, ShunandHu, YuandYang, LongxingandSun, ZihaoandMei, JilinandTan, JianchaoandSong, Chengru



研究问题:现有的一阶段NAS方法在训练超网络时,由于共享权重导致梯度下降方向研究问题:现有的一阶段NAS方法在训练超网络时,由于共享权重导致梯度下降方向不同以及训练过程中大的梯度方差,使得超网络的排名一致性降低。
动机:为了解决上述问题,提出了一种通过优化PAth和DAta(PA&DA)的采样分布来显式减小超网络训练中梯度方差的方法。
方法:理论推导出梯度方差与采样分布的关系,并发现最优采样概率与路径和训练数据的归一化梯度范数成正比。因此,使用归一化梯度范数作为路径和数据的重要性指标,并在超网络训练中采用重要性采样策略。
效果:与其他改进方法相比,该方法在各种搜索空间中都表现出更可靠的排名性能和更高的搜索架构准确性,证明了该方法的有效性。

Based on the weight-sharing mechanism, one-shot NAS methods train a supernet and then inherit the pre-trained weights to evaluate sub-models, largely reducing the search cost. However, several works have pointed out that the shared weights suffer from different gradient descent directions during training. And we further find that large gradient variance occurs during supernet training, which degrades the supernet ranking consistency. To mitigate this issue, we propose to explicitly minimize the gradient variance of the supernet training by jointly optimizing the sampling distributions of PAth and DAta (PA&DA). We theoretically derive the relationship between the gradient variance and the sampling distributions, and reveal that the optimal sampling probability is proportional to the normalized gradient norm of path and training data. Hence, we use the normalized gradient norm as the importance indicator for path and training data, and adopt an importance sampling strategy for the supernet training. Our method only requires negligible computation cost for optimizing the sampling distributions of path and data, but achieves lower gradient variance during supernet training and better generalization performance for the supernet, resulting in a more consistent NAS. We conduct comprehensive comparisons with other improved approaches in various search spaces. Results show that our method surpasses others with more reliable ranking performance and higher accuracy of searched architectures, showing the effectiveness of our method. Code is available at https://github.com/ShunLu91/PA-DA.

Bias Mimicking: A Simple Sampling Approach for Bias Mitigation
Qraitem, MaanandSaenko, KateandPlummer, BryanA.



研究问题:视觉识别数据集在类别标签中经常对偏见群体(如女性)进行少报,这可能导致模型学习到年龄、性别或种族等类别标签和偏见群体之间的虚假关联。
动机:目前的解决此问题的方法需要重大的架构改变或额外的损失函数,这需要更多的超参数调整。而数据采样基线方法虽然简单且无需超参数,但存在明显的缺点。
方法:提出一种新的类别条件采样方法——偏见模仿(Bias Mimicking)。该方法基于一个观察结果,即如果模仿每个类别c' != c的类别c的偏见分布,那么Y和B在统计上是独立的。通过这种新训练过程,BM确保模型在每个时期都接触到完整的分布,而不重复样本。
效果:实验结果表明,BM在四个基准测试中将少报群体的采样方法的准确性提高了3%,同时保持甚至在某些情况下超过了非采样方法的性能。

Prior work has shown that Visual Recognition datasets frequently underrepresent bias groups B (e.g. Female) within class labels Y (e.g. Programmers). This dataset bias can lead to models that learn spurious correlations between class labels and bias groups such as age, gender, or race. Most recent methods that address this problem require significant architectural changes or additional loss functions requiring more hyper-parameter tuning. Alternatively, data sampling baselines from the class imbalance literature (eg Undersampling, Upweighting), which can often be implemented in a single line of code and often have no hyperparameters, offer a cheaper and more efficient solution. However, these methods suffer from significant shortcomings. For example, Undersampling drops a significant part of the input distribution per epoch while Oversampling repeats samples, causing overfitting. To address these shortcomings, we introduce a new class-conditioned sampling method: Bias Mimicking. The method is based on the observation that if a class c bias distribution, i.e., P_D(B|Y=c) is mimicked across every c' != c, then Y and B are statistically independent. Using this notion, BM, through a novel training procedure, ensures that the model is exposed to the entire distribution per epoch without repeating samples. Consequently, Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3% over four benchmarks while maintaining and sometimes improving performance over nonsampling methods. Code: https://github.com/mqraitem/Bias-Mimicking

Efficient Loss Function by Minimizing the Detrimental Effect of Floating-Point Errors on Gradient-Based Attacks
Yu, YunruiandXu, Cheng-Zhong



研究问题:现有的深度学习网络易受攻击者通过在输入数据中添加人类无法察觉的扰动来欺骗,这揭示了当前深度神经网络的脆弱性和弱鲁棒性。
动机:许多攻击技术已被提出以评估模型的鲁棒性,但基于梯度的攻击由于严重高估了鲁棒性而失败。本文发现,浮点误差(包括浮点下溢和舍入错误)导致的计算梯度中的相对误差是导致基于梯度的攻击无法准确评估模型鲁棒性的根本原因。
方法:虽然很难消除梯度中的相对误差,但我们可以通过控制其对基于梯度的攻击的影响来应对这个问题。因此,我们提出了一种有效的损失函数,通过最小化浮点误差对攻击的不利影响。
效果:实验结果表明,当检查广泛的防御机制时,它比其他损失函数更有效、更可靠。

Attackers can deceive neural networks by adding human imperceptive perturbations to their input data; this reveals the vulnerability and weak robustness of current deep-learning networks. Many attack techniques have been proposed to evaluate the model's robustness. Gradient-based attacks suffer from severely overestimating the robustness. This paper identifies that the relative error in calculated gradients caused by floating-point errors, including floating-point underflow and rounding errors, is a fundamental reason why gradient-based attacks fail to accurately assess the model's robustness. Although it is hard to eliminate the relative error in the gradients, we can control its effect on the gradient-based attacks. Correspondingly, we propose an efficient loss function by minimizing the detrimental impact of the floating-point errors on the attacks. Experimental results show that it is more efficient and reliable than other loss functions when examined across a wide range of defence mechanisms.

Revisiting Prototypical Network for Cross Domain Few-Shot Learning
Zhou, FeiandWang, PengandZhang, LeiandWei, WeiandZhang, Yanning



研究问题:解决神经网络在面对新的领域时,性能显著下降的问题。
动机:这个问题源于神经网络的简单性偏见陷阱,网络倾向于关注一些特定的快捷特征,如颜色、形状等,这些特征只能区分少数类别,无法跨领域进行泛化。
方法:提出局部-全局蒸馏原型网络(LDP-net),通过建立两个分支对查询图像及其随机局部裁剪进行分类,然后在这两个分支之间进行知识蒸馏,强制他们的类别归属一致性。
效果:实验结果证明,这种方法能有效提高模型的泛化性能,并在八个跨领域的少样本分类基准测试中取得了最先进的结果。

Prototypical Network is a popular few-shot solver that aims at establishing a feature metric generalizable to novel few-shot classification (FSC) tasks using deep neural networks. However, its performance drops dramatically when generalizing to the FSC tasks in new domains. In this study, we revisit this problem and argue that the devil lies in the simplicity bias pitfall in neural networks. In specific, the network tends to focus on some biased shortcut features (e.g., color, shape, etc.) that are exclusively sufficient to distinguish very few classes in the meta-training tasks within a pre-defined domain, but fail to generalize across domains as some desirable semantic features. To mitigate this problem, we propose a Local-global Distillation Prototypical Network (LDP-net). Different from the standard Prototypical Network, we establish a two-branch network to classify the query image and its random local crops, respectively. Then, knowledge distillation is conducted among these two branches to enforce their class affiliation consistency. The rationale behind is that since such global-local semantic relationship is expected to hold regardless of data domains, the local-global distillation is beneficial to exploit some cross-domain transferable semantic features for feature metric establishment. Moreover, such local-global semantic consistency is further enforced among different images of the same class to reduce the intra-class semantic variation of the resultant feature. In addition, we propose to update the local branch as Exponential Moving Average (EMA) over training episodes, which makes it possible to better distill cross-episode knowledge and further enhance the generalization performance. Experiments on eight cross-domain FSC benchmarks empirically clarify our argument and show the state-of-the-art results of LDP-net. Code is available in https://github.com/NWPUZhoufei/LDP-Net

Perception and Semantic Aware Regularization for Sequential Confidence Calibration
Peng, ZhenghuaandLuo, YuandChen, TianshuiandXu, KekeandHuang, Shuangping



研究问题:深度序列识别(DSR)模型在各种应用中受到越来越多的关注,但大多数研究问题:深度序列识别(DSR)模型在各种应用中受到越来越多的关注,但大多数模型只使用目标序列作为监督,没有考虑其他相关序列,导致预测过于自信。
动机:目前的DSR模型通过等同且独立地平滑每个标记来规范标签,从而减轻过度自信的问题。然而,它们并没有考虑到标记/序列的相关性,这可能提供更有效的信息来进行训练规范化,从而导致次优的性能。
方法:我们提出了一个感知和语义感知的序列规范化框架,该框架探索与目标序列具有高度感知和语义相关性的标记/序列进行规范化。具体来说,我们引入了一个语义上下文自由识别和一个语言模型来获取具有高度感知相似性和语义相关性的相似序列。此外,由于不同样本的难度不同,过度自信的程度也会有所不同。因此,我们进一步设计了一个自适应校准强度模块来计算每个样本的难度分数,以获得更精细的规范化。
效果:我们在场景文本和语音识别等典型的序列识别任务上进行了广泛的实验,结果表明我们的方法取得了新的最先进的结果。

Deep sequence recognition (DSR) models receive increasing attention due to their superior application to various applications. Most DSR models use merely the target sequences as supervision without considering other related sequences, leading to over-confidence in their predictions. The DSR models trained with label smoothing regularize labels by equally and independently smoothing each token, reallocating a small value to other tokens for mitigating overconfidence. However, they do not consider tokens/sequences correlations that may provide more effective information to regularize training and thus lead to sub-optimal performance. In this work, we find tokens/sequences with high perception and semantic correlations with the target ones contain more correlated and effective information and thus facilitate more effective regularization. To this end, we propose a Perception and Semantic aware Sequence Regularization framework, which explore perceptively and semantically correlated tokens/sequences as regularization. Specifically, we introduce a semantic context-free recognition and a language model to acquire similar sequences with high perceptive similarities and semantic correlation, respectively. Moreover, over-confidence degree varies across samples according to their difficulties. Thus, we further design an adaptive calibration intensity module to compute a difficulty score for each samples to obtain finer-grained regularization. Extensive experiments on canonical sequence recognition tasks, including scene text and speech recognition, demonstrate that our method sets novel state-of-the-art results. Code is available at https://github.com/husterpzh/PSSR.

A Practical Upper Bound for the Worst-Case Attribution Deviations
Wang, FanandKong, AdamsWai-Kin



研究问题:现有的深度学习模型解释性方法易受攻击,生成具有显著不同解释但分类结果相同的图像。
动机:为了提高模型对此类攻击的鲁棒性,需要量化解释的最大差异。
方法:通过约束优化问题,提出一种基于欧几里得距离和余弦相似性的上限计算方法,以衡量在特定区域内添加任何噪声时,分类结果保持不变的情况下,解释的最大差异。
效果:实验验证了所提出的上限在各种数据集和两种不同类型的攻击(PGD攻击和IFIA属性攻击)上的效果。超过1000万次的攻击表明,所提出的最大上限能有效量化基于最坏情况解释差异的模型的鲁棒性。

Model attribution is a critical component of deep neural networks (DNNs) for its interpretability to complex models. Recent studies bring up attention to the security of attribution methods as they are vulnerable to attribution attacks that generate similar images with dramatically different attributions. Existing works have been investigating empirically improving the robustness of DNNs against those attacks; however, none of them explicitly quantifies the actual deviations of attributions. In this work, for the first time, a constrained optimization problem is formulated to derive an upper bound that measures the largest dissimilarity of attributions after the samples are perturbed by any noises within a certain region while the classification results remain the same. Based on the formulation, different practical approaches are introduced to bound the attributions above using Euclidean distance and cosine similarity under both L2 and Linf-norm perturbations constraints. The bounds developed by our theoretical study are validated on various datasets and two different types of attacks (PGD attack and IFIA attribution attack). Over 10 million attacks in the experiments indicate that the proposed upper bounds effectively quantify the robustness of models based on the worst-case attribution dissimilarities.

Exploring and Exploiting Uncertainty for Incomplete Multi-View Classification
Xie, MengyaoandHan, ZongboandZhang, ChangqingandBai, YichenandHu, Qinghua



研究问题:如何有效地对不完整的多视图数据进行分类。
动机:在现实应用中,任意视图的缺失是普遍存在的,而现有的不完整多视图方法由于缺失视图的高不确定性特性,难以获得可信的预测结果。
方法:提出了一种不确定性诱导的不完整多视图数据分类(UIMC)模型,通过构建分布并多次采样来描述缺失视图的不确定性,并根据采样质量自适应地利用这些不确定性。具体来说,我们将每个缺失的数据建模为一个条件分布,以引入不确定性,然后采用基于证据的融合策略来保证融合后的视图的可信性。
效果:在多个基准数据集上进行的大量实验表明,该方法在性能和可信度方面均达到了最先进的水平。

Classifying incomplete multi-view data is inevitable since arbitrary view missing widely exists in real-world applications. Although great progress has been achieved, existing incomplete multi-view methods are still difficult to obtain a trustworthy prediction due to the relatively high uncertainty nature of missing views. First, the missing view is of high uncertainty, and thus it is not reasonable to provide a single deterministic imputation. Second, the quality of the imputed data itself is of high uncertainty. To explore and exploit the uncertainty, we propose an Uncertainty-induced Incomplete Multi-View Data Classification (UIMC) model to classify the incomplete multi-view data under a stable and reliable framework. We construct a distribution and sample multiple times to characterize the uncertainty of missing views, and adaptively utilize them according to the sampling quality. Accordingly, the proposed method realizes more perceivable imputation and controllable fusion. Specifically, we model each missing data with a distribution conditioning on the available views and thus introducing uncertainty. Then an evidence-based fusion strategy is employed to guarantee the trustworthy integration of the imputed views. Extensive experiments are conducted on multiple benchmark data sets and our method establishes a state-of-the-art performance in terms of both performance and trustworthiness.

Learning Transformations To Reduce the Geometric Shift in Object Detection
Vidit, ViditandEngilberge, MartinandSalzmann, Mathieu



研究问题:现代物体检测器在测试分布与训练分布不同的情况下性能下降。
动机:大多数解决此问题的方法都集中在由不同光照条件或合成图像和真实图像之间的间隙引起的对象外观变化上,而本文则通过对比处理由于图像捕获过程中的变化或环境限制导致的内容本身的外观几何差异而产生的几何移位。
方法:引入了一种自我训练方法,该方法学习一组几何变换以最小化这些移位,而不利用新领域中的任何标记数据,也不涉及任何有关相机的信息。
效果:我们在两个不同的移位(即相机的视场(FoV)变化和视角变化)上评估了我们的方法,结果表明,学习几何变换有助于检测器在目标领域中表现更好。

The performance of modern object detectors drops when the test distribution differs from the training one. Most of the methods that address this focus on object appearance changes caused by, e.g., different illumination conditions, or gaps between synthetic and real images. Here, by contrast, we tackle geometric shifts emerging from variations in the image capture process, or due to the constraints of the environment causing differences in the apparent geometry of the content itself. We introduce a self-training approach that learns a set of geometric transformations to minimize these shifts without leveraging any labeled data in the new domain, nor any information about the cameras. We evaluate our method on two different shifts, i.e., a camera's field of view (FoV) change and a viewpoint change. Our results evidence that learning geometric transformations helps detectors to perform better in the target domains.

Revisiting Rotation Averaging: Uncertainties and Robust Losses
Zhang, GanlinandLarsson, ViktorandBarath, Daniel



研究问题:本文重新审视了全局结构从运动(SfM)管道中应用的旋转平均问题。
动机:当前方法的主要问题是,其最小化的成本函数仅通过估计的极几何与输入数据弱相关联。
方法:我们提出通过直接将点对应关系的不确定性传播到旋转平均中来更好地模拟底层噪声分布。这种不确定性通过考虑两视图细化的雅可比矩阵而获得。此外,我们还探索将MAGSAC损失的一个变体整合到旋转平均问题中,而不是使用当前框架中使用的经典鲁棒损失。
效果:所提出的方法在大型公共基准测试上的结果优于基线,无论是在准确性方面。

In this paper, we revisit the rotation averaging problem applied in global Structure-from-Motion pipelines. We argue that the main problem of current methods is the minimized cost function that is only weakly connected with the input data via the estimated epipolar geometries. We propose to better model the underlying noise distributions by directly propagating the uncertainty from the point correspondences into the rotation averaging. Such uncertainties are obtained for free by considering the Jacobians of two-view refinements. Moreover, we explore integrating a variant of the MAGSAC loss into the rotation averaging problem, instead of using classical robust losses employed in current frameworks. The proposed method leads to results superior to baselines, in terms of accuracy, on large-scale public benchmarks. The code is public. https://github.com/zhangganlin/GlobalSfMpy

Model Barrier: A Compact Un-Transferable Isolation Domain for Model Intellectual Property Protection
Wang, LianyuandWang, MengandZhang, DaoqiangandFu, Huazhu



研究问题:如何有效保护预训练模型的知识产权,防止其在未经授权的领域被使用。
动机:为了激发模型所有者和创建者的积极性,需要对由人类智力劳动和计算成本产生的科技成果进行模型知识产权保护。
方法:提出了一种新的紧凑型不可转移隔离领域(CUTI-domain),作为阻止模型从授权领域非法转移到未授权领域的障碍。
效果:在四个数字数据集、CIFAR10 & STL10和VisDA-2017数据集上的全面实验结果表明,我们的CUTI-domain可以很容易地与不同的骨干网络一起实施为即插即用模块,并为模型IP保护提供了一个有效的解决方案。

As the scientific and technological achievements produced by human intellectual labor and computation cost, model intellectual property (IP) protection, which refers to preventing the usage of the well-trained model on an unauthorized domain, deserves further attention, so as to effectively mobilize the enthusiasm of model owners and creators. To this end, we propose a novel compact un-transferable isolation domain (CUTI-domain), which acts as a model barrier to block illegal transferring from the authorized domain to the unauthorized domain. Specifically, CUTI-domain is investigated to block cross-domain transferring by highlighting private style features of the authorized domain and lead to the failure of recognition on unauthorized domains that contain irrelative private style features. Furthermore, depending on whether the unauthorized domain is known or not, two solutions of using CUTI-domain are provided: target-specified CUTI-domain and target-free CUTI-domain. Comprehensive experimental results on four digit datasets, CIFAR10 & STL10, and VisDA-2017 dataset, demonstrate that our CUTI-domain can be easily implemented with different backbones as a plug-and-play module and provides an efficient solution for model IP protection.

Bootstrap Your Own Prior: Towards Distribution-Agnostic Novel Class Discovery
Yang, MuliandWang, LianchengandDeng, ChengandZhang, Hanwang



研究问题:本文旨在解决无标注情况下未知类别的发现,通过利用已知类别的转移知识。
动机:现有的方法假设未知类别分布是均匀的,忽视了真实世界数据的不平衡性。
方法:提出一种新的挑战性任务——分布不可知的NCD,允许数据来自任意未知类别分布,并提出了一种新的方法“Bootstrapping Your Own Prior(BYOP)”,通过迭代估计类先验。
效果:实验表明,现有方法在不平衡的类别分布下表现不佳,而BYOP则通过鼓励对信心不足的样本进行更尖锐的预测,从而获得更准确的伪标签,并在各种分布场景下表现出色。

Novel Class Discovery (NCD) aims to discover unknown classes without any annotation, by exploiting the transferable knowledge already learned from a base set of known classes. Existing works hold an impractical assumption that the novel class distribution prior is uniform, yet neglect the imbalanced nature of real-world data. In this paper, we relax this assumption by proposing a new challenging task: distribution-agnostic NCD, which allows data drawn from arbitrary unknown class distributions and thus renders existing methods useless or even harmful. We tackle this challenge by proposing a new method, dubbed "Bootstrapping Your Own Prior (BYOP)", which iteratively estimates the class prior based on the model prediction itself. At each iteration, we devise a dynamic temperature technique that better estimates the class prior by encouraging sharper predictions for less-confident samples. Thus, BYOP obtains more accurate pseudo-labels for the novel samples, which are beneficial for the next training iteration. Extensive experiments show that existing methods suffer from imbalanced class distributions, while BYOP outperforms them by clear margins, demonstrating its effectiveness across various distribution scenarios.

MOT: Masked Optimal Transport for Partial Domain Adaptation
Luo, You-WeiandRen, Chuan-Xian



研究问题:如何更有效地在现实世界的场景中应用最优传输(OT)模型,特别是在部分领域适应等具有挑战性的情况下。
动机:现有的OT模型在实际应用中存在严格的先验假设和隐含对齐的问题,可能导致学习到的传输计划有偏差,并可能产生负面转移。
方法:提出了一种新的条件分布匹配和标签移位校正的严格OT建模方法,即掩蔽OT(MOT)方法。通过定义带有标签信息的掩蔽操作来改进。
效果:理论证明条件OT与MOT等价,表明定义明确的MOT可以作为计算友好的代理。大量实验验证了理论结果和提出的模型的有效性。

As an important methodology to measure distribution discrepancy, optimal transport (OT) has been successfully applied to learn generalizable visual models under changing environments. However, there are still limitations, including strict prior assumption and implicit alignment, for current OT modeling in challenging real-world scenarios like partial domain adaptation, where the learned transport plan may be biased and negative transfer is inevitable. Thus, it is necessary to explore a more feasible OT methodology for real-world applications. In this work, we focus on the rigorous OT modeling for conditional distribution matching and label shift correction. A novel masked OT (MOT) methodology on conditional distributions is proposed by defining a mask operation with label information. Further, a relaxed and reweighting formulation is proposed to improve the robustness of OT in extreme scenarios. We prove the theoretical equivalence between conditional OT and MOT, which implies the well-defined MOT serves as a computation-friendly proxy. Extensive experiments validate the effectiveness of theoretical results and proposed model.

Adaptive Sparse Pairwise Loss for Object Re-Identification
Zhou, XiaoandZhong, YujieandCheng, ZhenandLiang, FanandMa, Lin



研究问题:对象重识别(ReID)旨在从大量图库中找到与给定探针相同身份的实例。
动机:现有的对重识别网络的训练中,成对损失起着重要的作用。现有的成对损失密集地将每个实例作为锚点,并在一个小批次中采样其三元组。这种密集的采样机制不可避免地引入了共享少量视觉相似性的正样本对,这可能对训练有害。
方法:我们提出了一种名为稀疏成对(SP)损失的新的损失范式,该范式仅在一个小批次中为每个类别利用少量的适当对,并经验性地证明这对ReID任务是足够的。基于提出的损失框架,我们提出了一种自适应的正样本挖掘策略,该策略可以动态适应不同的类内变化。
效果:实验表明,SP损失及其自适应变体AdaSP损失优于其他成对损失,并在几个ReID基准测试中实现了最先进的性能。代码可在https://github.com/Astaxanthin/AdaSP获取。

Object re-identification (ReID) aims to find instances with the same identity as the given probe from a large gallery. Pairwise losses play an important role in training a strong ReID network. Existing pairwise losses densely exploit each instance as an anchor and sample its triplets in a mini-batch. This dense sampling mechanism inevitably introduces positive pairs that share few visual similarities, which can be harmful to the training. To address this problem, we propose a novel loss paradigm termed Sparse Pairwise (SP) loss that only leverages few appropriate pairs for each class in a mini-batch, and empirically demonstrate that it is sufficient for the ReID tasks. Based on the proposed loss framework, we propose an adaptive positive mining strategy that can dynamically adapt to diverse intra-class variations. Extensive experiments show that SP loss and its adaptive variant AdaSP loss outperform other pairwise losses, and achieve state-of-the-art performance across several ReID benchmarks. Code is available at https://github.com/Astaxanthin/AdaSP.

Progressive Open Space Expansion for Open-Set Model Attribution
Yang, TianyunandWang, DandingandTang, FanandZhao, XinyingandCao, JuanandTang, Sheng



研究问题:尽管在生成技术上取得了显著进步,但知识产权保护和恶意内容监管的双重问题已经出现。
动机:目前的研究主要通过将合成图像归因于一组可能的源模型来管理合成图像,但这种封闭的分类设置限制了在处理由任意模型生成的内容方面的实际应用。
方法:本研究专注于一项具有挑战性的任务,即开放集模型归属(OSMA),以同时将图像归属于已知模型并识别出那些来自未知模型的图像。
效果:与现有的关注语义新颖性的开放集识别(OSR)任务相比,OSMA更具挑战性,因为来自已知和未知模型的图像之间的差异可能仅在于视觉上难以察觉的痕迹。为此,我们提出了一种渐进式开放空间扩展(POSE)解决方案,该方案模拟了保持与封闭集样本相同语义但嵌入有不同难以察觉痕迹的开放集样本。

Despite the remarkable progress in generative technology, the Janus-faced issues of intellectual property protection and malicious content supervision have arisen. Efforts have been paid to manage synthetic images by attributing them to a set of potential source models. However, the closed-set classification setting limits the application in real-world scenarios for handling contents generated by arbitrary models. In this study, we focus on a challenging task, namely Open-Set Model Attribution (OSMA), to simultaneously attribute images to known models and identify those from unknown ones. Compared to existing open-set recognition (OSR) tasks focusing on semantic novelty, OSMA is more challenging as the distinction between images from known and unknown models may only lie in visually imperceptible traces. To this end, we propose a Progressive Open Space Expansion (POSE) solution, which simulates open-set samples that maintain the same semantics as closed-set samples but embedded with different imperceptible traces. Guided by a diversity constraint, the open space is simulated progressively by a set of lightweight augmentation models. We consider three real-world scenarios and construct an OSMA benchmark dataset, including unknown models trained with different random seeds, architectures, and datasets from known ones. Extensive experiments on the dataset demonstrate POSE is superior to both existing model attribution methods and off-the-shelf OSR methods.

Improving Generalization With Domain Convex Game
Lv, FangruiandLiang, JianandLi, ShuangandZhang, JinmingandLiu, Di



研究问题:本文旨在解决深度学习网络在面对不同源领域时,泛化能力差的问题。
动机:虽然人们普遍认为通过增加源领域的多样性可以提高模型的泛化能力,但这种观点缺乏数学理论的支持。
方法:作者提出了一种新的视角,将领域泛化视为领域之间的凸博弈。设计了一个基于超模态的正则化项来鼓励每个多样化的领域提高模型的泛化能力,同时构建了一个样本过滤器来消除低质量的样本。
效果:通过形式分析、启发式分析和大量实验,证明了该框架的合理性和有效性。

Domain generalization (DG) tends to alleviate the poor generalization capability of deep neural networks by learning model with multiple source domains. A classical solution to DG is domain augmentation, the common belief of which is that diversifying source domains will be conducive to the out-of-distribution generalization. However, these claims are understood intuitively, rather than mathematically. Our explorations empirically reveal that the correlation between model generalization and the diversity of domains may be not strictly positive, which limits the effectiveness of domain augmentation. This work therefore aim to guarantee and further enhance the validity of this strand. To this end, we propose a new perspective on DG that recasts it as a convex game between domains. We first encourage each diversified domain to enhance model generalization by elaborately designing a regularization term based on supermodularity. Meanwhile, a sample filter is constructed to eliminate low-quality samples, thereby avoiding the impact of potentially harmful information. Our framework presents a new avenue for the formal analysis of DG, heuristic analysis and extensive experiments demonstrate the rationality and effectiveness.

Unsupervised Deep Probabilistic Approach for Partial Point Cloud Registration
Mei, GuofengandTang, HaoandHuang, XiaoshuiandWang, WeijieandLiu, JuanandZhang, JianandVanGool, LucandWu, Qiang



研究问题:点云配准方法面临部分重叠和依赖标记数据的挑战。
动机:为了解决这些问题,我们提出了UDPreg,一种用于部分重叠点云的无监督深度概率配准框架。
方法:首先,我们采用网络从点云中学习高斯混合模型(GMMs)的后验概率分布。然后,为了处理部分点云配准,我们在GMMs的混合权重约束下应用了Sinkhorn算法来预测分布级别的对应关系。最后,为了实现无监督学习,我们设计了三种基于分布一致性的损失函数:自一致性、交叉一致性和局部对比损失。
效果:我们的UDPreg在3DMatch/3DLoMatch和ModelNet/ModelLoNet基准测试上取得了有竞争力的性能。

Deep point cloud registration methods face challenges to partial overlaps and rely on labeled data. To address these issues, we propose UDPReg, an unsupervised deep probabilistic registration framework for point clouds with partial overlaps. Specifically, we first adopt a network to learn posterior probability distributions of Gaussian mixture models (GMMs) from point clouds. To handle partial point cloud registration, we apply the Sinkhorn algorithm to predict the distribution-level correspondences under the constraint of the mixing weights of GMMs. To enable unsupervised learning, we design three distribution consistency-based losses: self-consistency, cross-consistency, and local contrastive. The self-consistency loss is formulated by encouraging GMMs in Euclidean and feature spaces to share identical posterior distributions. The cross-consistency loss derives from the fact that the points of two partially overlapping point clouds belonging to the same clusters share the cluster centroids. The cross-consistency loss allows the network to flexibly learn a transformation-invariant posterior distribution of two aligned point clouds. The local contrastive loss facilitates the network to extract discriminative local features. Our UDPReg achieves competitive performance on the 3DMatch/3DLoMatch and ModelNet/ModelLoNet benchmarks.

Learning Adaptive Dense Event Stereo From the Image Domain
Cho, HoonheeandCho, JegyeongandYoon, Kuk-Jin



研究问题:现有的事件驱动立体匹配在领域转移时性能严重下降。
动机:传统的无监督领域适应需要源领域的输入事件数据和目标领域的地面真值,这比图像数据更具挑战性和成本。
方法:提出一种新的无监督领域适应密集事件立体匹配(ADES)框架,通过图像重建训练网络,同时利用源领域的辅助网络消除重构图像中的间歇性伪影。
效果:实验表明,该方法在事件驱动立体匹配的领域适应能力上取得了显著的成果。

Recently, event-based stereo matching has been studied due to its robustness in poor light conditions. However, existing event-based stereo networks suffer severe performance degradation when domains shift. Unsupervised domain adaptation (UDA) aims at resolving this problem without using the target domain ground-truth. However, traditional UDA still needs the input event data with ground-truth in the source domain, which is more challenging and costly to obtain than image data. To tackle this issue, we propose a novel unsupervised domain Adaptive Dense Event Stereo (ADES), which resolves gaps between the different domains and input modalities. The proposed ADES framework adapts event-based stereo networks from abundant image datasets with ground-truth on the source domain to event datasets without ground-truth on the target domain, which is a more practical setup. First, we propose a self-supervision module that trains the network on the target domain through image reconstruction, while an artifact prediction network trained on the source domain assists in removing intermittent artifacts in the reconstructed image. Secondly, we utilize the feature-level normalization scheme to align the extracted features along the epipolar line. Finally, we present the motion-invariant consistency module to impose the consistent output between the perturbed motion. Our experiments demonstrate that our approach achieves remarkable results in the adaptation ability of event-based stereo matching from the image domain.

Conjugate Product Graphs for Globally Optimal 2D-3D Shape Matching
Roetzer, PaulandL\"ahner, ZorahandBernard, Florian



研究问题:寻找二维轮廓和三维网格之间的连续和非刚性匹配。
动机:现有的解决方案严重依赖于不切实际的先验假设来避免退化的解决方案,如知道2D轮廓的每个点与3D形状的哪个区域匹配的知识。
方法:提出一种新的基于二维轮廓和三维形状共轭积图的2D-3D形状匹配形式化方法,首次考虑了定义在边链上的高阶成本,而非单个边的成本。
效果:该方法能够找到全局最优和连续的2D-3D匹配,具有与以往解决方案相同的渐进复杂性,产生形状匹配的最新成果,甚至能够匹配部分形状。

We consider the problem of finding a continuous and non-rigid matching between a 2D contour and a 3D mesh. While such problems can be solved to global optimality by finding a shortest path in the product graph between both shapes, existing solutions heavily rely on unrealistic prior assumptions to avoid degenerate solutions (e.g. knowledge to which region of the 3D shape each point of the 2D contour is matched). To address this, we propose a novel 2D-3D shape matching formalism based on the conjugate product graph between the 2D contour and the 3D shape. Doing so allows us for the first time to consider higher-order costs, i.e. defined for edge chains, as opposed to costs defined for single edges. This offers substantially more flexibility, which we utilise to incorporate a local rigidity prior. By doing so, we effectively circumvent degenerate solutions and thereby obtain smoother and more realistic matchings, even when using only a one-dimensional feature descriptor. Overall, our method finds globally optimal and continuous 2D-3D matchings, has the same asymptotic complexity as previous solutions, produces state-of-the-art results for shape matching and is even capable of matching partial shapes. Our code is publicly available (https://github.com/paul0noah/sm-2D3D).

Train/Test-Time Adaptation With Retrieval
Zancato, LucaandAchille, AlessandroandLiu, TianYuandTrager, MatthewandPerera, PramudithaandSoatto, Stefano



研究问题:如何通过检索模块和可搜索的外部样本库在训练和测试时调整模型?
动机:现有的模型调整方法主要依赖合成数据增强来弥补缺乏适应数据的问题,而T3AR通过检索真实图像进行模型调整,提高了特征适应性。
方法:T3AR采用检索模块和可搜索的外部样本库,利用检索到的真实样本改进目标数据流形上的特征适应,并在推理前使用精炼的伪标签和自监督对比目标函数对给定模型进行下游任务调整。
效果:实验表明,T3AR可以在训练时提高下游细粒度分类性能,尤其是在适应数据较少的情况下(最高达13%);在测试时,利用外部图像池进行模型调整,使模型在DomainNet-126和VISDA-C上的表现优于现有方法,特别是在适应数据较少的情况下(最高达8%)。

We introduce Train/Test-Time Adaptation with Retrieval (T3AR), a method to adapt models both at train and test time by means of a retrieval module and a searchable pool of external samples. Before inference, T3AR adapts a given model to the downstream task using refined pseudo-labels and a self-supervised contrastive objective function whose noise distribution leverages retrieved real samples to improve feature adaptation on the target data manifold. The retrieval of real images is key to T3AR since it does not rely solely on synthetic data augmentations to compensate for the lack of adaptation data, as typically done by other adaptation algorithms. Furthermore, thanks to the retrieval module, our method gives the user or service provider the possibility to improve model adaptation on the downstream task by incorporating further relevant data or to fully remove samples that may no longer be available due to changes in user preference after deployment. First, we show that T3AR can be used at training time to improve downstream fine-grained classification over standard fine-tuning baselines, and the fewer the adaptation data the higher the relative improvement (up to 13%). Second, we apply T3AR for test-time adaptation and show that exploiting a pool of external images at test-time leads to more robust representations over existing methods on DomainNet-126 and VISDA-C, especially when few adaptation data are available (up to 8%).

Best of Both Worlds: Multimodal Contrastive Learning With Tabular and Imaging Data
Hager, PaulandMenten, MartinJ.andRueckert, Daniel



研究问题:如何利用图像和表格数据进行自我监督的对比学习,以训练单模态编码器。
动机:医疗数据集和生物库中含有大量丰富的临床信息,但医生的数据量和多样性有限,且标注成本高昂。因此,需要一种能从多模态预训练并单模态预测的自我监督方法。
方法:提出了第一个利用图像和表格数据进行自我监督对比学习框架,该框架结合了SimCLR和SCARF两种领先的对比学习策略。
效果:通过心脏MR图像和40,000个UK Biobank主题的120个临床特征预测心肌梗死和冠状动脉疾病的风险,实验表明该方法的有效性。同时,通过对DVM汽车广告数据集的应用,证明了该方法对自然图像的泛化能力。此外,通过实验发现形态学表格特征在对比学习过程中的重要性,并提高了学习嵌入的质量。最后,通过附加真实标签作为表格特征的形式引入了一种新的监督对比学习方法,该方法在所有监督对比基线上表现优越。

Medical datasets and especially biobanks, often contain extensive tabular data with rich clinical information in addition to images. In practice, clinicians typically have less data, both in terms of diversity and scale, but still wish to deploy deep learning solutions. Combined with increasing medical dataset sizes and expensive annotation costs, the necessity for unsupervised methods that can pretrain multimodally and predict unimodally has risen. To address these needs, we propose the first self-supervised contrastive learning framework that takes advantage of images and tabular data to train unimodal encoders. Our solution combines SimCLR and SCARF, two leading contrastive learning strategies, and is simple and effective. In our experiments, we demonstrate the strength of our framework by predicting risks of myocardial infarction and coronary artery disease (CAD) using cardiac MR images and 120 clinical features from 40,000 UK Biobank subjects. Furthermore, we show the generalizability of our approach to natural images using the DVM car advertisement dataset. We take advantage of the high interpretability of tabular data and through attribution and ablation experiments find that morphometric tabular features, describing size and shape, have outsized importance during the contrastive learning process and improve the quality of the learned embeddings. Finally, we introduce a novel form of supervised contrastive learning, label as a feature (LaaF), by appending the ground truth label as a tabular feature during multimodal pretraining, outperforming all supervised contrastive baselines.

Masked Images Are Counterfactual Samples for Robust Fine-Tuning
Xiao, YaoandTang, ZiyiandWei, PengxuandLiu, CongandLin, Liang



研究问题:深度学习模型在训练数据和测试数据分布转移上面临挑战,尤其是在微调过程中的内分布(ID)性能与外分布(OOD)鲁棒性之间的权衡。
动机:现有的方法并未明确解决OOD鲁棒性问题,因此需要一种新的微调方法来改善这个问题。
方法:本文提出了一种新颖的微调方法,使用被遮蔽的图像作为反事实样本,以增强微调模型的鲁棒性。具体来说,根据类激活图,遮蔽图像中语义相关或无关的补丁,打破虚假相关性,并用其他图像中的补丁填充被遮蔽的补丁。生成的反事实样本用于特征基的蒸馏与预训练模型。
效果:实验证明,用提出的遮蔽图像进行正则化的微调可以在ID和OOD性能之间实现更好的权衡,超越以往的方法在OOD性能上的表现。

Deep learning models are challenged by the distribution shift between the training data and test data. Recently, the large models pre-trained on diverse data have demonstrated unprecedented robustness to various distribution shifts. However, fine-tuning these models can lead to a trade-off between in-distribution (ID) performance and out-of-distribution (OOD) robustness. Existing methods for tackling this trade-off do not explicitly address the OOD robustness problem. In this paper, based on causal analysis of the aforementioned problems, we propose a novel fine-tuning method, which uses masked images as counterfactual samples that help improve the robustness of the fine-tuning model. Specifically, we mask either the semantics-related or semantics-unrelated patches of the images based on class activation map to break the spurious correlation, and refill the masked patches with patches from other images. The resulting counterfactual samples are used in feature-based distillation with the pre-trained model. Extensive experiments verify that regularizing the fine-tuning with the proposed masked images can achieve a better trade-off between ID and OOD performance, surpassing previous methods on the OOD performance. Our code is available at https://github.com/Coxy7/robust-finetuning.

CLIP the Gap: A Single Domain Generalization Approach for Object Detection
Vidit, ViditandEngilberge, MartinandSalzmann, Mathieu



研究问题:如何训练一个模型使其能从一个源领域泛化到任何未见过的目标领域。
动机:尽管在图像分类中已经对单领域泛化(SDG)进行了深入研究,但在目标检测领域的相关文献却几乎不存在。为了解决同时学习稳健的目标定位和表示的挑战,我们提出利用预训练的视觉-语言模型通过文本提示引入语义领域概念。
方法:我们通过作用于检测器主干提取的特征的语义增强策略以及基于文本的分类损失来实现这一目标。
效果:我们的实验证明了该方法的益处,在他们自己多样化的天气驱动基准上,比现有的唯一SDG目标检测方法Single-DGOD[49]提高了10%的性能。

Single Domain Generalization (SDG) tackles the problem of training a model on a single source domain so that it generalizes to any unseen target domain. While this has been well studied for image classification, the literature on SDG object detection remains almost non-existent. To address the challenges of simultaneously learning robust object localization and representation, we propose to leverage a pre-trained vision-language model to introduce semantic domain concepts via textual prompts. We achieve this via a semantic augmentation strategy acting on the features extracted by the detector backbone, as well as a text-based classification loss. Our experiments evidence the benefits of our approach, outperforming by 10% the only existing SDG object detection method, Single-DGOD[49], on their own diverse weather-driving benchmark.

Unbalanced Optimal Transport: A Unified Framework for Object Detection
DePlaen, HenriandDePlaen, Pierre-Fran\c{c



研究问题:如何有效地匹配预测的边界框和相关分类分数与地面实况,以优化对象检测模型的训练。
动机:目前流行的匹配策略包括匹配最近的地面实况框和通过匈牙利算法进行匹配,但这些方法各有优缺点。
方法:提出使用不平衡最优传输(Unbalanced Optimal Transport)来统一这些不同的方法,并在其间形成一系列的新方法。
效果:实验证明,使用不平衡最优传输训练的对象检测模型在平均精度、平均召回率上均达到最先进的水平,同时能够更快地收敛,且适合大规模模型的GPU实现。

During training, supervised object detection tries to correctly match the predicted bounding boxes and associated classification scores to the ground truth. This is essential to determine which predictions are to be pushed towards which solutions, or to be discarded. Popular matching strategies include matching to the closest ground truth box (mostly used in combination with anchors), or matching via the Hungarian algorithm (mostly used in anchor-free methods). Each of these strategies comes with its own properties, underlying losses, and heuristics. We show how Unbalanced Optimal Transport unifies these different approaches and opens a whole continuum of methods in between. This allows for a finer selection of the desired properties. Experimentally, we show that training an object detection model with Unbalanced Optimal Transport is able to reach the state-of-the-art both in terms of Average Precision and Average Recall as well as to provide a faster initial convergence. The approach is well suited for GPU implementation, which proves to be an advantage for large-scale models.

MMANet: Margin-Aware Distillation and Modality-Aware Regularization for Incomplete Multimodal Learning
Wei, ShicaiandLuo, ChunboandLuo, Yang



研究问题:多模态学习在许多场景中具有巨大潜力,但在实践中经常遇到模态数据缺失的问题,导致性能严重下降。
动机:为了解决这个问题,我们提出了一个名为MMANet的通用框架来辅助不完整的多模态学习。
方法:MMANet由三个组件组成:用于推理的部署网络、将全面多模态信息传递给部署网络的教师网络,以及引导部署网络平衡弱模态组合的正则化网络。我们还提出了一种新的边际感知蒸馏(MAD)方法,通过权衡样本贡献和分类不确定性来协助信息传递。此外,我们还设计了一种模态感知正则化(MAR)算法,以挖掘弱模态组合并指导正则化网络为它们计算预测损失。
效果:我们在多模态分类和分割任务上进行了大量实验,结果表明我们的MMANet显著优于最先进的技术。

Multimodal learning has shown great potentials in numerous scenes and attracts increasing interest recently. However, it often encounters the problem of missing modality data and thus suffers severe performance degradation in practice. To this end, we propose a general framework called MMANet to assist incomplete multimodal learning. It consists of three components: the deployment network used for inference, the teacher network transferring comprehensive multimodal information to the deployment network, and the regularization network guiding the deployment network to balance weak modality combinations. Specifically, we propose a novel margin-aware distillation (MAD) to assist the information transfer by weighing the sample contribution with the classification uncertainty. This encourages the deployment network to focus on the samples near decision boundaries and acquire the refined inter-class margin. Besides, we design a modality-aware regularization (MAR) algorithm to mine the weak modality combinations and guide the regularization network to calculate prediction loss for them. This forces the deployment network to improve its representation ability for the weak modality combinations adaptively. Finally, extensive experiments on multimodal classification and segmentation tasks demonstrate that our MMANet outperforms the state-of-the-art significantly.

Regularized Vector Quantization for Tokenized Image Synthesis
Zhang, JiahuiandZhan, FangnengandTheobalt, ChristianandLu, Shijian



研究问题:如何将图像量化为离散表示,以解决统一生成模型中的基本问题。
动机:现有的主要方法要么通过选择最佳匹配的标记进行确定性量化,要么通过从预测分布中采样进行随机量化,但都存在问题。
方法:本文提出了一种正则化向量量化框架,通过两个方面的正则化来有效缓解上述问题。一是先验分布正则化,测量先验标记分布与预测标记分布之间的差异,以避免码本崩溃和低码本利用率;二是随机掩码正则化,在量化过程中引入随机性,以在推理阶段失配和未受干扰的重建目标之间取得良好平衡。此外,设计了一种概率对比损失作为校准度量,进一步减轻了受干扰的重建目标。
效果:大量实验表明,所提出的量化框架在不同生成模型上始终优于现有的向量量化器,包括自回归模型和扩散模型。

Quantizing images into discrete representations has been a fundamental problem in unified generative modeling. Predominant approaches learn the discrete representation either in a deterministic manner by selecting the best-matching token or in a stochastic manner by sampling from a predicted distribution. However, deterministic quantization suffers from severe codebook collapse and misaligned inference stage while stochastic quantization suffers from low codebook utilization and perturbed reconstruction objective. This paper presents a regularized vector quantization framework that allows to mitigate above issues effectively by applying regularization from two perspectives. The first is a prior distribution regularization which measures the discrepancy between a prior token distribution and predicted token distribution to avoid codebook collapse and low codebook utilization. The second is a stochastic mask regularization that introduces stochasticity during quantization to strike a good balance between inference stage misalignment and unperturbed reconstruction objective. In addition, we design a probabilistic contrastive loss which serves as a calibrated metric to further mitigate the perturbed reconstruction objective. Extensive experiments show that the proposed quantization framework outperforms prevailing vector quantizers consistently across different generative models including auto-regressive models and diffusion models.

Deep Factorized Metric Learning
Wang, ChengkunandZheng, WenzhaoandLi, JunlongandZhou, JieandLu, Jiwen



研究问题:如何学习一个具有泛化性和全面性的相似度度量,以描绘图像之间的语义差异。
动机:现有的方法通过学习具有不同目标的嵌入集合来解决这个问题,但主干网络仍然接收所有训练信号的混合。
方法:提出一种深度分解的度量学习方法(DFML),将训练信号进行分解,并使用不同的样本来训练主干网络的不同部分。我们还将网络分解为不同的子块,并设计一个可学习的路由器,以自适应地将训练样本分配给每个子块,目标是捕获最多的信息。
效果:DFML在CUB-200-2011、Cars196和Stanford Online Products等三个深度度量学习基准测试中实现了最先进的性能。我们还将DFML推广到ImageNet-1K的图像分类任务,并在准确率/计算负载权衡方面观察到了一致的改进。

Learning a generalizable and comprehensive similarity metric to depict the semantic discrepancies between images is the foundation of many computer vision tasks. While existing methods approach this goal by learning an ensemble of embeddings with diverse objectives, the backbone network still receives a mix of all the training signals. Differently, we propose a deep factorized metric learning method (DFML) to factorize the training signal and employ different samples to train various components of the backbone network. We factorize the network to different sub-blocks and devise a learnable router to adaptively allocate the training samples to each sub-block with the objective to capture the most information. The metric model trained by DFML captures different characteristics with different sub-blocks and constitutes a generalizable metric when using all the sub-blocks. The proposed DFML achieves state-of-the-art performance on all three benchmarks for deep metric learning including CUB-200-2011, Cars196, and Stanford Online Products. We also generalize DFML to the image classification task on ImageNet-1K and observe consistent improvement in accuracy/computation trade-off. Specifically, we improve the performance of ViT-B on ImageNet (+0.2% accuracy) with less computation load (-24% FLOPs).

Multi-Level Logit Distillation
Jin, YingandWang, JiaqiandLin, Dahua



研究问题:本文旨在解决预训练语言模型对结构化知识的利用不足,以及知识蒸馏方法在性能和实际应用中的局限性。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,而主流的知识蒸馏方法存在性能和隐私安全问题。
方法:本文提出了一种增强的logit蒸馏方法,通过多级别的预测对齐,使学生模型同时学习实例预测、输入关联和类别关联。并通过基于模型校准的预测增强机制进一步提升性能。
效果:实验结果表明,该方法在各种任务上的性能均优于现有的logit蒸馏方法,甚至达到与主流特征蒸馏方法相当的水平。

Knowledge Distillation (KD) aims at distilling the knowledge from the large teacher model to a lightweight student model. Mainstream KD methods can be divided into two categories, logit distillation, and feature distillation. The former is easy to implement, but inferior in performance, while the latter is not applicable to some practical circumstances due to concerns such as privacy and safety. Towards this dilemma, in this paper, we explore a stronger logit distillation method via making better utilization of logit outputs. Concretely, we propose a simple yet effective approach to logit distillation via multi-level prediction alignment. Through this framework, the prediction alignment is not only conducted at the instance level, but also at the batch and class level, through which the student model learns instance prediction, input correlation, and category correlation simultaneously. In addition, a prediction augmentation mechanism based on model calibration further boosts the performance. Extensive experiment results validate that our method enjoys consistently higher performance than previous logit distillation methods, and even reaches competitive performance with mainstream feature distillation methods. We promise to release our code and models to ensure reproducibility.

Dual-Path Adaptation From Image to Video Transformers
Park, JunginandLee, JiyoungandSohn, Kwanghoon



研究问题:如何有效地将视觉基础模型(如ViT和Swin)的强大表示能力转移到视频理解中,同时只增加少量可训练参数。
动机:现有的适应方法虽然同时考虑了空间和时间建模,但并未充分利用图像转换器的代表性能力。
方法:提出一种新的DUALPATH适应方法,分为空间和时间适应路径,在每个转换器模块中使用轻量级的瓶颈适配器。对于时间动态建模,将连续的帧合并到一个类似网格的帧集中,以精确模仿图像转换器在标记之间推断关系的能力。
效果:在四个动作识别基准测试上的实验结果表明,使用DUALPATH预训练的图像转换器可以在数据领域之外进行有效泛化。

In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters. Previous adaptation methods have simultaneously considered spatial and temporal modeling with a unified learnable module but still suffered from fully leveraging the representative capabilities of image transformers. We argue that the popular dual-path (two-stream) architecture in video models can mitigate this problem. We propose a novel DUALPATH adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block. Especially for temporal dynamic modeling, we incorporate consecutive frames into a grid-like frameset to precisely imitate vision transformers' capability that extrapolates relationships between tokens. In addition, we extensively investigate the multiple baselines from a unified perspective in video understanding and compare them with DUALPATH. Experimental results on four action recognition benchmarks prove that pretrained image transformers with DUALPATH can be effectively generalized beyond the data domain.

Transfer Knowledge From Head to Tail: Uncertainty Calibration Under Long-Tailed Distribution
Chen, JiahaoandSu, Bing



研究问题:如何估计给定模型的不确定性是一个关键问题。
动机:现有的校准技术假设训练数据的分布是平衡的,忽视了真实世界数据往往遵循长尾分布的事实。
方法:提出了一种基于知识转移的校准方法,通过为尾部类别的样本估计重要性权重来实现长尾校准。
效果:在CIFAR-10-LT、MNIST-LT、CIFAR-100-LT和ImageNet-LT数据集上的大量实验证明了该方法的有效性。

How to estimate the uncertainty of a given model is a crucial problem. Current calibration techniques treat different classes equally and thus implicitly assume that the distribution of training data is balanced, but ignore the fact that real-world data often follows a long-tailed distribution. In this paper, we explore the problem of calibrating the model trained from a long-tailed distribution. Due to the difference between the imbalanced training distribution and balanced test distribution, existing calibration methods such as temperature scaling can not generalize well to this problem. Specific calibration methods for domain adaptation are also not applicable because they rely on unlabeled target domain instances which are not available. Models trained from a long-tailed distribution tend to be more overconfident to head classes. To this end, we propose a novel knowledge-transferring-based calibration method by estimating the importance weights for samples of tail classes to realize long-tailed calibration. Our method models the distribution of each class as a Gaussian distribution and views the source statistics of head classes as a prior to calibrate the target distributions of tail classes. We adaptively transfer knowledge from head classes to get the target probability density of tail classes. The importance weight is estimated by the ratio of the target probability density over the source probability density. Extensive experiments on CIFAR-10-LT, MNIST-LT, CIFAR-100-LT, and ImageNet-LT datasets demonstrate the effectiveness of our method.

Class-Conditional Sharpness-Aware Minimization for Deep Long-Tailed Recognition
Zhou, ZhipengandLi, LanqingandZhao, PeilinandHeng, Pheng-AnnandGong, Wei



研究问题:深度长尾识别(DLTR)中,模型需要在高度不平衡的标签分布下对所有类别进行同等的泛化,但现有的平缓损失景观属性在这一领域尚未得到充分探索。
动机:尽管深度学习模型在损失景观中存在更平缓的极小值时具有更好的泛化能力,但在长尾识别任务中,尖峰极小值更为普遍。然而,简单地将现有的平缓操作整合到长尾学习算法中并未带来显著改进。
方法:提出了一种基于解耦范式的两级锐度感知优化方法。在第一阶段,特征提取器和分类器都在类条件尺度的参数扰动下进行训练;在第二阶段,通过类平衡采样生成对抗性特征,进一步固定骨干网络并强化分类器。
效果:在多个长尾视觉识别基准测试上进行的大量实验表明,所提出的类条件锐度感知最小化(CC-SAM)方法在性能上与最先进的方法相当。

It's widely acknowledged that deep learning models with flatter minima in its loss landscape tend to generalize better. However, such property is under-explored in deep long-tailed recognition (DLTR), a practical problem where the model is required to generalize equally well across all classes when trained on highly imbalanced label distribution. In this paper, through empirical observations, we argue that sharp minima are in fact prevalent in deep longtailed models, whereas naive integration of existing flattening operations into long-tailed learning algorithms brings little improvement. Instead, we propose an effective twostage sharpness-aware optimization approach based on the decoupling paradigm in DLTR. In the first stage, both the feature extractor and classifier are trained under parameter perturbations at a class-conditioned scale, which is theoretically motivated by the characteristic radius of flat minima under the PAC-Bayesian framework. In the second stage, we generate adversarial features with classbalanced sampling to further robustify the classifier with the backbone frozen. Extensive experiments on multiple longtailed visual recognition benchmarks show that, our proposed Class-Conditional Sharpness-Aware Minimization (CC-SAM), achieves competitive performance compared to the state-of-the-arts. Code is available at https:// github.com/zzpustc/CC-SAM.

CUDA: Convolution-Based Unlearnable Datasets
Sadasivan, VinuSankarandSoltanolkotabi, MahdiandFeizi, Soheil



研究问题:如何通过添加特殊设计的噪声,使深度学习模型无法学习到网络数据,以解决在线数据的潜在未授权使用和数据隐私问题。
动机:现有的方法在处理这个问题时容易受到对抗性训练的影响,或者计算量过大。因此,本文提出了一种新的、无需模型的卷积无可学习数据集(CUDA)生成技术。
方法:CUDA是通过受控的类别卷积生成的,其过滤器是通过私钥随机生成的。这种方法鼓励网络学习过滤器和标签之间的关系,而不是学习用于分类干净数据的有信息的特征。
效果:实验结果表明,CUDA对各种数据集(如CIFAR-10、CIFAR-100、ImageNet-100和Tiny-ImageNet)和架构(如ResNet-18、VGG-16、Wide ResNet-34-10、DenseNet-121、DeIT、EfficientNetV2-S和MobileNetV2)具有强大的鲁棒性。例如,在ImageNet-100 CUDA上训练ResNet-18,其在经验风险最小化(ERM)、L_infinity对抗性训练和L_2对抗性训练下的清洁测试精度分别仅为8.96%、40.08%和20.58%。此外,即使只有一小部分训练数据集被干扰,CUDA也显示出了与ERM的经验风险最小化相同的不可学习效果。

Large-scale training of modern deep learning models heavily relies on publicly available data on the web. This potentially unauthorized usage of online data leads to concerns regarding data privacy. Recent works aim to make unlearnable data for deep learning models by adding small, specially designed noises to tackle this issue. However, these methods are vulnerable to adversarial training (AT) and/or are computationally heavy. In this work, we propose a novel, model-free, Convolution-based Unlearnable DAtaset (CUDA) generation technique. CUDA is generated using controlled class-wise convolutions with filters that are randomly generated via a private key. CUDA encourages the network to learn the relation between filters and labels rather than informative features for classifying the clean data. We develop some theoretical analysis demonstrating that CUDA can successfully poison Gaussian mixture data by reducing the clean data performance of the optimal Bayes classifier. We also empirically demonstrate the effectiveness of CUDA with various datasets (CIFAR-10, CIFAR-100, ImageNet-100, and Tiny-ImageNet), and architectures (ResNet-18, VGG-16, Wide ResNet-34-10, DenseNet-121, DeIT, EfficientNetV2-S, and MobileNetV2). Our experiments show that CUDA is robust to various data augmentations and training approaches such as smoothing, AT with different budgets, transfer learning, and fine-tuning. For instance, training a ResNet-18 on ImageNet-100 CUDA achieves only 8.96%, 40.08%, and 20.58% clean test accuracies with empirical risk minimization (ERM), L_infinity AT, and L_2 AT, respectively. Here, ERM on the clean training data achieves a clean test accuracy of 80.66%. CUDA exhibits unlearnability effect with ERM even when only a fraction of the training dataset is perturbed. Furthermore, we also show that CUDA is robust to adaptive defenses designed specifically to break it.

No One Left Behind: Improving the Worst Categories in Long-Tailed Learning
Du, YingxiaoandWu, Jianxin



研究问题:使用不平衡数据集训练的神经网络在类别间的准确率变化大,如何改善这种情况?
动机:传统的长尾巴识别方法会将所有类别手动分为三个子集,然后报告每个子集的平均准确率,但这种方法会导致一些类别被牺牲。
方法:提出一种简单的插值方法,通过重新训练预训练模型的分类器并使用我们的损失函数,以及可选的将两个分类器的预测结果进行组合的集成技巧,使得所有类别的召回率分布更均匀,从而提高谐平均准确率。
效果:该方法在广泛使用的基准数据集上证明了其有效性。

Unlike the case when using a balanced training dataset, the per-class recall (i.e., accuracy) of neural networks trained with an imbalanced dataset are known to vary a lot from category to category. The convention in long-tailed recognition is to manually split all categories into three subsets and report the average accuracy within each subset. We argue that under such an evaluation setting, some categories are inevitably sacrificed. On one hand, focusing on the average accuracy on a balanced test set incurs little penalty even if some worst performing categories have zero accuracy. On the other hand, classes in the "Few" subset do not necessarily perform worse than those in the "Many" or "Medium" subsets. We therefore advocate to focus more on improving the lowest recall among all categories and the harmonic mean of all recall values. Specifically, we propose a simple plug-in method that is applicable to a wide range of methods. By simply re-training the classifier of an existing pre-trained model with our proposed loss function and using an optional ensemble trick that combines the predictions of the two classifiers, we achieve a more uniform distribution of recall values across categories, which leads to a higher harmonic mean accuracy while the (arithmetic) average accuracy is still high. The effectiveness of our method is justified on widely used benchmark datasets.

Deep Fair Clustering via Maximizing and Minimizing Mutual Information: Theory, Algorithm and Metric
Zeng, PengxinandLi, YunfanandHu, PengandPeng, DezhongandLv, JianchengandPeng, Xi



研究问题:如何防止敏感属性主导聚类,同时将数据划分为不同的集群。
动机:尽管最近已经进行了许多工作并取得了巨大的成功,但大多数都是基于启发式的,并且缺乏统一的算法设计理论。
方法:通过最大化和最小化互信息,开发了一种深度公平聚类的互信息理论,并相应地设计了一种新的算法,称为FCMI。
效果:在六个基准测试中,包括与11种最先进的方法相比的单细胞RNA测序图谱,通过五个指标验证了所提出的FCMI的有效性。

Fair clustering aims to divide data into distinct clusters while preventing sensitive attributes (e.g., gender, race, RNA sequencing technique) from dominating the clustering. Although a number of works have been conducted and achieved huge success recently, most of them are heuristical, and there lacks a unified theory for algorithm design. In this work, we fill this blank by developing a mutual information theory for deep fair clustering and accordingly designing a novel algorithm, dubbed FCMI. In brief, through maximizing and minimizing mutual information, FCMI is designed to achieve four characteristics highly expected by deep fair clustering, i.e., compact, balanced, and fair clusters, as well as informative features. Besides the contributions to theory and algorithm, another contribution of this work is proposing a novel fair clustering metric built upon information theory as well. Unlike existing evaluation metrics, our metric measures the clustering quality and fairness as a whole instead of separate manner. To verify the effectiveness of the proposed FCMI, we conduct experiments on six benchmarks including a single-cell RNA-seq atlas compared with 11 state-of-the-art methods in terms of five metrics. The code could be accessed from https://pengxi.me.

COT: Unsupervised Domain Adaptation With Clustering and Optimal Transport
Liu, YangandZhou, ZhipengandSun, Baigui



研究问题:如何将有标签源领域知识迁移到无标签目标领域,特别是在处理类不平衡和大规模训练情况下的计算开销问题。
动机:现有的无监督领域适应方法主要关注全局级别的分布对齐,而忽视了局部级别的实例对齐。同时,现有的基于最优传输的方法在处理类不平衡和大规模训练时存在计算开销大的问题。
方法:提出一种基于聚类的最优传输(COT)算法,该方法将对齐过程形式化为最优传输问题,并通过端到端的方式在源领域和目标领域的聚类中心之间构建映射。
效果:实验结果表明,COT在几个权威基准数据集上取得了最先进的性能,有效地解决了类不平衡和大规模训练下的计算开销问题。

Unsupervised domain adaptation (UDA) aims to transfer the knowledge from a labeled source domain to an unlabeled target domain. Typically, to guarantee desirable knowledge transfer, aligning the distribution between source and target domain from a global perspective is widely adopted in UDA. Recent researchers further point out the importance of local-level alignment and propose to construct instance-pair alignment by leveraging on Optimal Transport (OT) theory. However, existing OT-based UDA approaches are limited to handling class imbalance challenges and introduce a heavy computation overhead when considering a large-scale training situation. To cope with two aforementioned issues, we propose a Clustering-based Optimal Transport (COT) algorithm, which formulates the alignment procedure as an Optimal Transport problem and constructs a mapping between clustering centers in the source and target domain via an end-to-end manner. With this alignment on clustering centers, our COT eliminates the negative effect caused by class imbalance and reduces the computation cost simultaneously. Empirically, our COT achieves state-of-the-art performance on several authoritative benchmark datasets.

TIPI: Test Time Adaptation With Transformation Invariance
Nguyen, A.TuanandNguyen-Tang, ThanhandLim, Ser-NamandTorr, PhilipH.S.



研究问题:在将机器学习模型部署到新环境时,经常会遇到分布偏移问题,即目标数据分布与模型的训练分布不同。
动机:当标签在新领域不可用且源数据无法存储(如出于隐私原因)时,如何使模型适应新的数据分布。
方法:提出一种在测试时间使用变换不变性正则化器作为辅助损失进行测试时间适应的方法(TIPI)。
效果:通过大量实验验证,TIPI对小批量大小具有鲁棒性,并在所有设置中始终优于TENT。

When deploying a machine learning model to a new environment, we often encounter the distribution shift problem -- meaning the target data distribution is different from the model's training distribution. In this paper, we assume that labels are not provided for this new domain, and that we do not store the source data (e.g., for privacy reasons). It has been shown that even small shifts in the data distribution can affect the model's performance severely. Test Time Adaptation offers a means to combat this problem, as it allows the model to adapt during test time to the new data distribution, using only unlabeled test data batches. To achieve this, the predominant approach is to optimize a surrogate loss on the test-time unlabeled target data. In particular, minimizing the prediction's entropy on target samples has received much interest as it is task-agnostic and does not require altering the model's training phase (e.g., does not require adding a self-supervised task during training on the source domain). However, as the target data's batch size is often small in real-world scenarios (e.g., autonomous driving models process each few frames in real-time), we argue that this surrogate loss is not optimal since it often collapses with small batch sizes. To tackle this problem, in this paper, we propose to use an invariance regularizer as the surrogate loss during test-time adaptation, motivated by our theoretical results regarding the model's performance under input transformations. The resulting method (TIPI -- Test tIme adaPtation with transformation Invariance) is validated with extensive experiments in various benchmarks (Cifar10-C, Cifar100-C, ImageNet-C, DIGITS, and VisDA17). Remarkably, TIPI is robust against small batch sizes (as small as 2 in our experiments), and consistently outperforms TENT in all settings. Our code is released at https://github.com/atuannguyen/TIPI.

CFA: Class-Wise Calibrated Fair Adversarial Training
Wei, ZemingandWang, YifeiandGuo, YiwenandWang, Yisen



研究问题:如何提高深度神经网络对对抗性例子的对抗鲁棒性,并实现类别间的公平性。
动机:大多数现有工作关注提高整体模型的鲁棒性,但在训练和测试阶段等同对待每个类别,忽视了类别间在对抗配置(包括扰动边界、正则化和权重平均)上的偏好差异。
方法:首次从理论和实证两方面研究了不同类别对对抗配置的偏好,并据此提出了一种类别级校准的公平对抗训练框架(CFA),该框架能自动为每个类别定制特定的训练配置。
效果:实验证明,提出的CFA在提高总体鲁棒性和公平性方面优于其他最先进的方法。

Adversarial training has been widely acknowledged as the most effective method to improve the adversarial robustness against adversarial examples for Deep Neural Networks (DNNs). So far, most existing works focus on enhancing the overall model robustness, treating each class equally in both the training and testing phases. Although revealing the disparity in robustness among classes, few works try to make adversarial training fair at the class level without sacrificing overall robustness. In this paper, we are the first to theoretically and empirically investigate the preference of different classes for adversarial configurations, including perturbation margin, regularization, and weight averaging. Motivated by this, we further propose a Class-wise calibrated Fair Adversarial training framework, named CFA, which customizes specific training configurations for each class automatically. Experiments on benchmark datasets demonstrate that our proposed CFA can improve both overall robustness and fairness notably over other state-of-the-art methods. Code is available at https://github.com/PKU-ML/CFA.

Glocal Energy-Based Learning for Few-Shot Open-Set Recognition
Wang, HaoyuandPang, GuansongandWang, PengandZhang, LeiandWei, WeiandZhang, Yanning



研究问题:本文旨在解决少样本开放集识别(FSOR)任务,即在只有少量示例的情况下,将样本分类到预定义的闭集中,同时拒绝未知类别的样本。
动机:FSOR是一项具有重要实用价值的挑战性任务,但现有的方法往往难以实现对开放集样本的全面检测。
方法:本文提出了一种新颖的能量基混合模型,该模型由两个分支组成,一个分支用于学习将样本分类到闭集中的度量,另一个分支用于显式估计开放集概率。为了实现对开放集样本的全局检测,模型利用类特征和像素特征分别学习全局能量得分和局部能量得分。
效果:实验结果表明,本文提出的能量基混合模型在三个标准的FSOR数据集上表现出优越的性能。

Few-shot open-set recognition (FSOR) is a challenging task of great practical value. It aims to categorize a sample to one of the pre-defined, closed-set classes illustrated by few examples while being able to reject the sample from unknown classes. In this work, we approach the FSOR task by proposing a novel energy-based hybrid model. The model is composed of two branches, where a classification branch learns a metric to classify a sample to one of closed-set classes and the energy branch explicitly estimates the open-set probability. To achieve holistic detection of open-set samples, our model leverages both class-wise and pixel-wise features to learn a glocal energy-based score, in which a global energy score is learned using the class-wise features, while a local energy score is learned using the pixel-wise features. The model is enforced to assign large energy scores to samples that are deviated from the few-shot examples in either the class-wise features or the pixel-wise features, and to assign small energy scores otherwise. Experiments on three standard FSOR datasets show the superior performance of our model.

AutoLabel: CLIP-Based Framework for Open-Set Video Domain Adaptation
Zara, GiacomoandRoy, SubhankarandRota, PaoloandRicci, Elisa



研究问题:如何将一个有标签的源领域的动作识别模型适应到一个包含“目标专用”类别的无标签的目标领域。
动机:现有的开放集无监督视频领域适应方法需要专门的开放集分类器或加权对抗学习,我们提出使用预训练的语言和视觉模型CLIP来解决这个问题。
方法:我们提出了AutoLabel,它自动发现并生成以对象为中心的组合候选目标专用类名,使CLIP能够拒绝目标专用实例,从而更好地对齐两个领域的共享类别。
效果:实验结果表明,配备AutoLabel的CLIP可以满意地拒绝目标专用实例,从而实现更好的领域适应效果。

Open-set Unsupervised Video Domain Adaptation (OUVDA) deals with the task of adapting an action recognition model from a labelled source domain to an unlabelled target domain that contains "target-private" categories, which are present in the target but absent in the source. In this work we deviate from the prior work of training a specialized open-set classifier or weighted adversarial learning by proposing to use pre-trained Language and Vision Models (CLIP). The CLIP is well suited for OUVDA due to its rich representation and the zero-shot recognition capabilities. However, rejecting target-private instances with the CLIP's zero-shot protocol requires oracle knowledge about the target-private label names. To circumvent the impossibility of the knowledge of label names, we propose AutoLabel that automatically discovers and generates object-centric compositional candidate target-private class names. Despite its simplicity, we show that CLIP when equipped with AutoLabel can satisfactorily reject the target-private instances, thereby facilitating better alignment between the shared classes of the two domains. The code is available.

Instant Domain Augmentation for LiDAR Semantic Segmentation
Ryu, KwonyoungandHwang, SoonminandPark, Jaesik



研究问题:现有的使用3D激光雷达数据的感知算法在面对'传感器偏差问题'时表现不佳,即当在测试时间应用未见过的具体激光雷达传感器规格时,由于领域差异,算法性能会显著下降。
动机:为了解决这个问题,本文提出了一种快速灵活的激光雷达增强方法,称为'LiDomAug'。该方法通过聚合原始激光雷达扫描并考虑到动态畸变和遮挡来创建任何配置的激光雷达扫描,从而实现即时领域增强。
方法:LiDomAug模块可以无缝集成到学习框架的数据加载器中,运行速度为330FPS。在实验中,我们使用LiDomAug对基于学习的方案进行辅助,发现这些方案受到传感器偏差问题的影响较小,并在SemanticKITTI和nuScenes数据集上实现了新的最先进的领域适应性能,而无需使用目标领域数据。
效果:我们还展示了一个传感器无关模型,该模型能够在不同的激光雷达配置上忠实地工作。

Despite the increasing popularity of LiDAR sensors, perception algorithms using 3D LiDAR data struggle with the 'sensor-bias problem'. Specifically, the performance of perception algorithms significantly drops when an unseen specification of LiDAR sensor is applied at test time due to the domain discrepancy. This paper presents a fast and flexible LiDAR augmentation method for the semantic segmentation task, called 'LiDomAug'. It aggregates raw LiDAR scans and creates a LiDAR scan of any configurations with the consideration of dynamic distortion and occlusion, resulting in instant domain augmentation. Our on-demand augmentation module runs at 330 FPS, so it can be seamlessly integrated into the data loader in the learning framework. In our experiments, learning-based approaches aided with the proposed LiDomAug are less affected by the sensor-bias issue and achieve new state-of-the-art domain adaptation performances on SemanticKITTI and nuScenes dataset without the use of the target domain data. We also present a sensor-agnostic model that faithfully works on the various LiDAR configurations.

Robust Test-Time Adaptation in Dynamic Scenarios
Yuan, LonghuiandXie, BinhuiandLi, Shuang



研究问题:如何将预训练模型适应到测试分布,特别是在现实世界应用的动态场景中。
动机:现有的TTA方法在简单的测试数据流上取得了成功,但在如自动驾驶等环境逐渐变化、测试数据随时间相关采样的现实应用中可能失败。
方法:提出了一种针对复杂数据流的实际测试时适应(PTTA)的鲁棒测试时适应(RoTTA)方法。具体包括:设计了一种鲁棒的批量归一化方案来估计归一化统计数据;利用记忆库采样考虑时效性和不确定性的类别平衡数据;开发了一种带有教师-学生模型的时间感知重加权策略以稳定训练过程。
效果:实验证明,RoTTA能够在相关采样的数据流上进行持续的测试时适应,且该方法易于实施,非常适合快速部署。

Test-time adaptation (TTA) intends to adapt the pretrained model to test distributions with only unlabeled test data streams. Most of the previous TTA methods have achieved great success on simple test data streams such as independently sampled data from single or multiple distributions. However, these attempts may fail in dynamic scenarios of real-world applications like autonomous driving, where the environments gradually change and the test data is sampled correlatively over time. In this work, we explore such practical test data streams to deploy the model on the fly, namely practical test-time adaptation (PTTA). To do so, we elaborate a Robust Test-Time Adaptation (RoTTA) method against the complex data stream in PTTA. More specifically, we present a robust batch normalization scheme to estimate the normalization statistics. Meanwhile, a memory bank is utilized to sample category-balanced data with consideration of timeliness and uncertainty. Further, to stabilize the training procedure, we develop a time-aware reweighting strategy with a teacher-student model. Extensive experiments prove that RoTTA enables continual testtime adaptation on the correlatively sampled data streams. Our method is easy to implement, making it a good choice for rapid deployment. The code is publicly available at https://github.com/BIT-DA/RoTTA

Global and Local Mixture Consistency Cumulative Learning for Long-Tailed Visual Recognitions
Du, FeiandYang, PengandJia, QiandNan, FengtaoandChen, XiaotingandYang, Yun



研究问题:设计一种简单学习模式,提高长尾视觉识别的鲁棒性,减少训练技巧和开销,同时减轻分类器对头部类别的偏见。
动机:现有的方法在处理长尾视觉识别问题上存在一些挑战,如特征提取器的鲁棒性不足,分类器对头部类别的偏见等。
方法:提出一种名为全局与局部混合一致性累积学习的高效单阶段训练策略(GLMC)。主要包括两部分:(1) 通过全局和局部混合一致性损失来提高特征提取器的鲁棒性;(2) 通过累积的头尾软标签重加权损失来减轻头部类别偏见问题。
效果:在CIFAR10-LT、CIFAR100-LT和ImageNet-LT数据集上,该方法取得了最先进的精度。在平衡的ImageNet和CIFAR上的额外实验表明,GLMC可以显著提高主干网络的泛化能力。

In this paper, our goal is to design a simple learning paradigm for long-tail visual recognition, which not only improves the robustness of the feature extractor but also alleviates the bias of the classifier towards head classes while reducing the training skills and overhead. We propose an efficient one-stage training strategy for long-tailed visual recognition called Global and Local Mixture Consistency cumulative learning (GLMC). Our core ideas are twofold: (1) a global and local mixture consistency loss improves the robustness of the feature extractor. Specifically, we generate two augmented batches by the global MixUp and local CutMix from the same batch data, respectively, and then use cosine similarity to minimize the difference. (2) A cumulative head-tail soft label reweighted loss mitigates the head class bias problem. We use empirical class frequencies to reweight the mixed label of the head-tail class for long-tailed data and then balance the conventional loss and the rebalanced loss with a coefficient accumulated by epochs. Our approach achieves state-of-the-art accuracy on CIFAR10-LT, CIFAR100-LT, and ImageNet-LT datasets. Additional experiments on balanced ImageNet and CIFAR demonstrate that GLMC can significantly improve the generalization of backbones. Code is made publicly available at https://github.com/ynu-yangpeng/GLMC

MHPL: Minimum Happy Points Learning for Active Source Free Domain Adaptation
Wang, FanandHan, ZhongyiandZhang, ZhiyanandHe, RundongandYin, Yilong



研究问题:如何将预训练的源模型迁移到无标签的目标领域,而不访问源数据。
动机:源自由领域适应(SFDA)设置面临性能瓶颈,因为缺乏源数据和目标监督信息。
方法:通过主动学习探索和利用少量有信息量的样本进行活动源自由领域适应(ASFDA)。提出最小快乐点学习(MHPL)来主动探索和利用最小快乐点。设计了三种独特的策略:邻居环境不确定性、邻居多样性放松和一次查询,以探索最小快乐点。为了在学习过程中充分利用最小快乐点,设计了一个邻居焦点损失,将加权的邻居纯度分配给最小快乐点的交叉熵损失,使模型更关注它们。
效果:实验证明,MHPL显著超过了各种类型的基线,并在小的标签成本下实现了显著的性能提升。

Source free domain adaptation (SFDA) aims to transfer a trained source model to the unlabeled target domain without accessing the source data. However, the SFDA setting faces a performance bottleneck due to the absence of source data and target supervised information, as evidenced by the limited performance gains of the newest SFDA methods. Active source free domain adaptation (ASFDA) can break through the problem by exploring and exploiting a small set of informative samples via active learning. In this paper, we first find that those satisfying the properties of neighbor-chaotic, individual-different, and source-dissimilar are the best points to select. We define them as the minimum happy (MH) points challenging to explore with existing methods. We propose minimum happy points learning (MHPL) to explore and exploit MH points actively. We design three unique strategies: neighbor environment uncertainty, neighbor diversity relaxation, and one-shot querying, to explore the MH points. Further, to fully exploit MH points in the learning process, we design a neighbor focal loss that assigns the weighted neighbor purity to the cross entropy loss of MH points to make the model focus more on them. Extensive experiments verify that MHPL remarkably exceeds the various types of baselines and achieves significant performance gains at a small cost of labeling.

Diversity-Aware Meta Visual Prompting
Huang, QidongandDong, XiaoyiandChen, DongdongandZhang, WeimingandWang, FeifeiandHua, GangandYu, Nenghai



研究问题:如何有效地将预训练模型转移到下游任务,同时保持模型的固定主干。
动机:视觉提示的一个挑战是,图像数据集的数据多样性有时很大,而每个数据集的通用提示很难正确处理向原始预训练数据分布的复杂分布转移。
方法:提出了一种具有多样性感知能力的元视觉提示(DAM-VP)方法,通过元提示初始化,将下游数据集划分为小的同质子集,每个子集都有自己的优化提示。
效果:实验表明,DAM-VP在一系列不同的预训练模型下游数据集上,明显优于以往的提示方法,表现出更高的效率和效果。

We present Diversity-Aware Meta Visual Prompting (DAM-VP), an efficient and effective prompting method for transferring pre-trained models to downstream tasks with frozen backbone. A challenging issue in visual prompting is that image datasets sometimes have a large data diversity whereas a per-dataset generic prompt can hardly handle the complex distribution shift toward the original pretraining data distribution properly. To address this issue, we propose a dataset Diversity-Aware prompting strategy whose initialization is realized by a Meta-prompt. Specifically, we cluster the downstream dataset into small homogeneity subsets in a diversity-adaptive way, with each subset has its own prompt optimized separately. Such a divide-and-conquer design reduces the optimization difficulty greatly and significantly boosts the prompting performance. Furthermore, all the prompts are initialized with a meta-prompt, which is learned across several datasets. It is a bootstrapped paradigm, with the key observation that the prompting knowledge learned from previous datasets could help the prompt to converge faster and perform better on a new dataset. During inference, we dynamically select a proper prompt for each input, based on the feature distance between the input and each subset. Through extensive experiments, our DAM-VP demonstrates superior efficiency and effectiveness, clearly surpassing previous prompting methods in a series of downstream datasets for different pretraining models. Our code is available at: https://github.com/shikiw/DAM-VP.

Real-Time Evaluation in Online Continual Learning: A New Hope
Ghunaim, YasirandBibi, AdelandAlhamoud, KumailandAlfarra, MotasemandAlKaderHammoud, HasanAbedandPrabhu, AmeyaandTorr, PhilipH.S.andGhanem, Bernard



研究问题:当前连续学习(CL)方法的评估通常假设训练时间和计算没有限制,这在真实世界环境中是不现实的。
动机:我们提出了一种实用的实时评估连续学习方法,其中数据流不会等待模型完成训练再显示下一个预测数据。
方法:我们在包含3900万带有地理位置标签的时间戳图像的大型数据集CLOC上进行广泛的实验,对现有的CL方法进行计算成本评估。
效果:结果显示,一个简单的基线在所有考虑的方法中表现最好,这表明大多数现有的CL文献都是针对一类特定的、不现实的流进行优化的。我们希望这个评估能推动在线连续学习方法的发展,使其更加考虑计算成本。

Current evaluations of Continual Learning (CL) methods typically assume that there is no constraint on training time and computation. This is an unrealistic assumption for any real-world setting, which motivates us to propose: a practical real-time evaluation of continual learning, in which the stream does not wait for the model to complete training before revealing the next data for predictions. To do this, we evaluate current CL methods with respect to their computational costs. We conduct extensive experiments on CLOC, a large-scale dataset containing 39 million time-stamped images with geolocation labels. We show that a simple baseline outperforms state-of-the-art CL methods under this evaluation, questioning the applicability of existing methods in realistic settings. In addition, we explore various CL components commonly used in the literature, including memory sampling strategies and regularization approaches. We find that all considered methods fail to be competitive against our simple baseline. This surprisingly suggests that the majority of existing CL literature is tailored to a specific class of streams that is not practical. We hope that the evaluation we provide will be the first step towards a paradigm shift to consider the computational cost in the development of online continual learning methods.

Equiangular Basis Vectors
Shen, YangandSun, XuhaoandWei, Xiu-Shen



研究问题:提出一种用于分类任务的等角基向量(EBVs)。
动机:目前的深度神经网络模型在处理不同分类任务时,通常采用k-way全连接层和softmax进行处理,而度量学习方法的主要目标是学习一个将训练数据点从原始空间映射到新空间的转换函数,使得相似点更近,不相似点更远。
方法:与以往的方法不同,我们的EBVs生成标准化的向量嵌入作为“预定义的分类器”,不仅要求它们之间地位相等,而且尽可能正交。通过最小化输入在其训练中的类别EBV之间的球面距离,可以在推理过程中通过识别具有最小距离的类别EBV来获得预测结果。
效果:在ImageNet-1K数据集和其他下游任务上进行的各种实验表明,我们的方法优于一般的全连接分类器,同时与经典的度量学习方法相比,没有引入巨大的额外计算。我们的EBVs在2022年DIGIX全球AI挑战赛中获得第一名,我们的代码是开源的,可以在https://github.com/NJUST-VIPGroup/Equiangular-Basis-Vectors上找到。

We propose Equiangular Basis Vectors (EBVs) for classification tasks. In deep neural networks, models usually end with a k-way fully connected layer with softmax to handle different classification tasks. The learning objective of these methods can be summarized as mapping the learned feature representations to the samples' label space. While in metric learning approaches, the main objective is to learn a transformation function that maps training data points from the original space to a new space where similar points are closer while dissimilar points become farther apart. Different from previous methods, our EBVs generate normalized vector embeddings as "predefined classifiers" which are required to not only be with the equal status between each other, but also be as orthogonal as possible. By minimizing the spherical distance of the embedding of an input between its categorical EBV in training, the predictions can be obtained by identifying the categorical EBV with the smallest distance during inference. Various experiments on the ImageNet-1K dataset and other downstream tasks demonstrate that our method outperforms the general fully connected classifier while it does not introduce huge additional computation compared with classical metric learning methods. Our EBVs won the first place in the 2022 DIGIX Global AI Challenge, and our code is open-source and available at https://github.com/NJUST-VIPGroup/Equiangular-Basis-Vectors.

Rethinking Domain Generalization for Face Anti-Spoofing: Separability and Alignment
Sun, YiyouandLiu, YaojieandLiu, XiaomingandLi, YixuanandChu, Wen-Sheng



研究问题:本文研究了面部反欺诈(FAS)模型在图像分辨率、模糊度和传感器变化等领域差距上的泛化问题。
动机:大多数先前的工作都将领域特定信号视为负面影响,并应用度量学习或对抗性损失来从特征表示中移除它。尽管学习一个领域不变的特征空间对于训练数据是可行的,但我们发现在未见过测试领域中,特征偏移仍然存在,这会对分类器的泛化能力产生反效果。
方法:我们没有构建一个领域不变的特征空间,而是鼓励领域分离性,同时将实况到欺诈的转变(即从实况到欺诈的轨迹)对所有领域进行对齐。我们将这种反欺诈策略的分离性和对齐(SA-FAS)问题形式化为不变风险最小化(IRM),并学习领域可变的但领域不变的分类器的特征表示。
效果:我们在具有挑战性的跨领域反欺诈数据集上展示了SA-FAS的有效性,并建立了最先进的性能。

This work studies the generalization issue of face anti-spoofing (FAS) models on domain gaps, such as image resolution, blurriness and sensor variations. Most prior works regard domain-specific signals as a negative impact, and apply metric learning or adversarial losses to remove it from feature representation. Though learning a domain-invariant feature space is viable for the training data, we show that the feature shift still exists in an unseen test domain, which backfires on the generalizability of the classifier. In this work, instead of constructing a domain-invariant feature space, we encourage domain separability while aligning the live-to-spoof transition (i.e., the trajectory from live to spoof) to be the same for all domains. We formulate this FAS strategy of separability and alignment (SA-FAS) as a problem of invariant risk minimization (IRM), and learn domain-variant feature representation but domain-invariant classifier. We demonstrate the effectiveness of SA-FAS on challenging cross-domain FAS datasets and establish state-of-the-art performance.

Learning Imbalanced Data With Vision Transformers
Xu, ZhengzhuoandLiu, RuikangandYang, ShuoandChai, ZenghaoandYuan, Chun



研究问题:现实世界的数据往往严重不平衡,这使长尾巴识别(LTR)成为一个巨大挑战。
动机:现有的LTR方法很少使用视觉转换器(ViTs)和长尾巴(LT)数据进行训练,而预训练的ViTs权重往往会带来不公平的比较。
方法:我们系统地研究了ViTs在LTR中的表现,并提出了LiVT,该方法只使用LT数据从头开始训练ViTs。通过观察发现ViTs在LTR问题上面临更严重的问题,我们进行了掩蔽生成预训练(MGP),以学习通用特征。
效果:大量实验证明,通过MGP和平衡BCE,LiVT成功地训练了ViTs,无需任何额外数据,并在没有任何附加功能的情况下显著超越了最先进的方法。例如,我们的ViT-B在iNaturalist 2018上达到了81.0%的Top-1准确率。

The real-world data tends to be heavily imbalanced and severely skew the data-driven deep neural networks, which makes Long-Tailed Recognition (LTR) a massive challenging task. Existing LTR methods seldom train Vision Transformers (ViTs) with Long-Tailed (LT) data, while the off-the-shelf pretrain weight of ViTs always leads to unfair comparisons. In this paper, we systematically investigate the ViTs' performance in LTR and propose LiVT to train ViTs from scratch only with LT data. With the observation that ViTs suffer more severe LTR problems, we conduct Masked Generative Pretraining (MGP) to learn generalized features. With ample and solid evidence, we show that MGP is more robust than supervised manners. Although Binary Cross Entropy (BCE) loss performs well with ViTs, it struggles on the LTR tasks. We further propose the balanced BCE to ameliorate it with strong theoretical groundings. Specially, we derive the unbiased extension of Sigmoid and compensate extra logit margins for deploying it. Our Bal-BCE contributes to the quick convergence of ViTs in just a few epochs. Extensive experiments demonstrate that with MGP and Bal-BCE, LiVT successfully trains ViTs well without any additional data and outperforms comparable state-of-the-art methods significantly, e.g., our ViT-B achieves 81.0% Top-1 accuracy in iNaturalist 2018 without bells and whistles. Code is available at https://github.com/XuZhengzhuo/LiVT.

LINe: Out-of-Distribution Detection by Leveraging Important Neurons
Ahn, YongHyunandPark, Gyeong-MoonandKim, SeongTae



研究问题:如何量化输入样本的不确定性,特别是在自动驾驶和医疗等关键领域,对未知数据的预测失败可能会带来大问题。
动机:模型无法表达其不知道的内容,这是OOD检测问题的根源。由于不需要额外的再训练过程,后验OOD检测方法受到了广泛的探索。
方法:从模型深层神经元代表高级特征的角度出发,提出了一种新的分析模型在分布内数据和OOD数据输出差异的方法。提出了一种新颖的方法,即利用重要神经元(LINe)进行后验OOD检测。通过基于Shapley值的剪枝,只选择高贡献的神经元来预测特定输入数据类别,屏蔽其余部分,从而降低噪声输出的影响。激活剪辑将所有高于一定阈值的值固定为相同值,使LINe平等对待所有类特定特征,只考虑分布内和OOD数据之间激活特征数量的差异。
效果:通过在CIFAR-10、CIFAR-100和ImageNet数据集上超越最先进的后验OOD检测方法,全面的实验验证了所提出方法的有效性。

It is important to quantify the uncertainty of input samples, especially in mission-critical domains such as autonomous driving and healthcare, where failure predictions on out-of-distribution (OOD) data are likely to cause big problems. OOD detection problem fundamentally begins in that the model cannot express what it is not aware of. Post-hoc OOD detection approaches are widely explored because they do not require an additional re-training process which might degrade the model's performance and increase the training cost. In this study, from the perspective of neurons in the deep layer of the model representing high-level features, we introduce a new aspect for analyzing the difference in model outputs between in-distribution data and OOD data. We propose a novel method, Leveraging Important Neurons (LINe), for post-hoc Out of distribution detection. Shapley value-based pruning reduces the effects of noisy outputs by selecting only high-contribution neurons for predicting specific classes of input data and masking the rest. Activation clipping fixes all values above a certain threshold into the same value, allowing LINe to treat all the class-specific features equally and just consider the difference between the number of activated feature differences between in-distribution and OOD data. Comprehensive experiments verify the effectiveness of the proposed method by outperforming state-of-the-art post-hoc OOD detection methods on CIFAR-10, CIFAR-100, and ImageNet datasets.

Exploring Data Geometry for Continual Learning
Gao, ZhiandXu, ChenandLi, FengandJia, YundeandHarandi, MehrtashandWu, Yuwei



研究问题:本文旨在通过探索数据几何来解决非平稳数据流的持续学习问题,并防止对旧数据的遗忘。
动机:在许多实际应用中,数据符合非欧几里得几何。因此,常用的欧几里得空间无法优雅地捕获数据非欧几里得几何结构,导致结果较差。
方法:我们的方法通过动态扩展基础空间的几何结构来匹配新数据引发的增长几何结构,并通过考虑旧数据的几何结构来防止遗忘。为此,我们利用混合曲率空间并提出一种增量搜索方案,以编码不断增长的几何结构。然后,我们引入角正则化损失和邻居鲁棒性损失来训练模型,使其能够惩罚全局几何结构和局部几何结构的变化。
效果:实验表明,我们的方法在性能上优于在欧几里得空间设计的基线方法。

Continual learning aims to efficiently learn from a non-stationary stream of data while avoiding forgetting the knowledge of old data. In many practical applications, data complies with non-Euclidean geometry. As such, the commonly used Euclidean space cannot gracefully capture non-Euclidean geometric structures of data, leading to inferior results. In this paper, we study continual learning from a novel perspective by exploring data geometry for the non-stationary stream of data. Our method dynamically expands the geometry of the underlying space to match growing geometric structures induced by new data, and prevents forgetting by keeping geometric structures of old data into account. In doing so, we make use of the mixed-curvature space and propose an incremental search scheme, through which the growing geometric structures are encoded. Then, we introduce an angular-regularization loss and a neighbor-robustness loss to train the model, capable of penalizing the change of global geometric structures and local geometric structures. Experiments show that our method achieves better performance than baseline methods designed in Euclidean space.

Visual DNA: Representing and Comparing Images Using Distributions of Neuron Activations
Ramtoula, BenjaminandGadd, MatthewandNewman, PaulandDeMartini, Daniele



研究问题:在现代计算机视觉中,选择合适的数据集至关重要,但目前没有通用的工具来评估两个数据集的差异程度。
动机:为了解决这个问题,我们提出了使用神经元激活分布(DNAs)来表示图像和扩展数据集的方法。
方法:通过预训练的特征提取器将图像传递给DNAs进行表示,该提取器在所有数据集中都是固定的。通过比较两个DNAs,我们可以评估两个数据集的差异程度,并具有对感兴趣的比较属性的精细控制能力。
效果:我们证明了DNAs在不同任务和多样化数据集上的适用性,包括条件数据集比较、合成图像评估和迁移学习等。DNAs具有紧凑性,可以表示任何规模的数据集,且大小不超过15兆字节。

Selecting appropriate datasets is critical in modern computer vision. However, no general-purpose tools exist to evaluate the extent to which two datasets differ. For this, we propose representing images -- and by extension datasets -- using Distributions of Neuron Activations (DNAs). DNAs fit distributions, such as histograms or Gaussians, to activations of neurons in a pre-trained feature extractor through which we pass the image(s) to represent. This extractor is frozen for all datasets, and we rely on its generally expressive power in feature space. By comparing two DNAs, we can evaluate the extent to which two datasets differ with granular control over the comparison attributes of interest, providing the ability to customise the way distances are measured to suit the requirements of the task at hand. Furthermore, DNAs are compact, representing datasets of any size with less than 15 megabytes. We demonstrate the value of DNAs by evaluating their applicability on several tasks, including conditional dataset comparison, synthetic image evaluation, and transfer learning, and across diverse datasets, ranging from synthetic cat images to celebrity faces and urban driving scenes.

Neuron Structure Modeling for Generalizable Remote Physiological Measurement
Lu, HaoandYu, ZitongandNiu, XuesongandChen, Ying-Cong



研究问题:近年来,远程光体积描记术(rPPG)技术引起了广泛关注。然而,由于血液容积脉搏信号容易受到环境变化的影响,现有的方法在未见过的环境中泛化效果不佳。
动机:针对这一问题,本文提出了一种无需领域标签的神经结构建模(NEST)方法,通过最大化训练过程中特征空间的覆盖范围来提高泛化能力。
方法:NEST方法通过减少推理过程中特征激活不足的机会,增强跨多领域的领域不变特征。
效果:实验表明,NEST方法在跨数据集和 intra-dataset 设置上都优于现有方法。

Remote photoplethysmography (rPPG) technology has drawn increasing attention in recent years. It can extract Blood Volume Pulse (BVP) from facial videos, making many applications like health monitoring and emotional analysis more accessible. However, as the BVP signal is easily affected by environmental changes, existing methods struggle to generalize well for unseen domains. In this paper, we systematically address the domain shift problem in the rPPG measurement task. We show that most domain generalization methods do not work well in this problem, as domain labels are ambiguous in complicated environmental changes. In light of this, we propose a domain-label-free approach called NEuron STructure modeling (NEST). NEST improves the generalization capacity by maximizing the coverage of feature space during training, which reduces the chance for under-optimized feature activation during inference. Besides, NEST can also enrich and enhance domain invariant features across multi-domain. We create and benchmark a large-scale domain generalization protocol for the rPPG measurement task. Extensive experiments show that our approach outperforms the state-of-the-art methods on both cross-dataset and intra-dataset settings.

Enhancing Multiple Reliability Measures via Nuisance-Extended Information Bottleneck
Jeong, JongheonandYu, SihyunandLee, HankookandShin, Jinwoo



研究问题:在训练数据有限的情况下,如何提高模型的鲁棒性,防止模型过度依赖数据获取中的偏差信号。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。同时,通过扩展标准信息瓶颈以额外模拟干扰信息,并使用基于自编码器的培训来实现目标。
效果:实验结果表明,该方法提高了学习到的表示的鲁棒性(显著地无需使用任何特定领域的知识),并在多个具有挑战性的可靠性测量方面取得了良好的效果。例如,该方法可以在最新的具有挑战性的OBJECTS基准测试中,将新颖性检测的性能从78.4%提高到87.2%。

In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition (i.e., less generalizable), so that one cannot prevent a model from co-adapting on such (so-called) "shortcut" signals: this makes the model fragile in various distribution shifts. To bypass such failure modes, we consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training. This motivates us to extend the standard information bottleneck to additionally model the nuisance information. We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training concerning both convolutional- and Transformer-based architectures. Our experimental results show that the proposed scheme improves robustness of learned representations (remarkably without using any domain-specific knowledge), with respect to multiple challenging reliability measures. For example, our model could advance the state-of-the-art on a recent challenging OBJECTS benchmark in novelty detection by 78.4% -> 87.2% in AUROC, while simultaneously enjoying improved corruption, background and (certified) adversarial robustness. Code is available at https://github.com/jh-jeong/nuisance_ib.

Image Quality-Aware Diagnosis via Meta-Knowledge Co-Embedding
Che, HaoxuanandChen, SiyuandChen, Hao



研究问题:医疗图像在临床实践中通常会出现图像退化,导致基于深度学习的模型性能下降。
动机:大多数先前的研究都集中在过滤掉引起退化的低质量图像上,而忽视了它们对模型的潜在价值。
方法:通过有效地学习和利用退化的知识,模型可以更好地抵抗其负面影响,避免误诊。本文提出了一个图像质量感知诊断的问题,旨在利用低质量图像和图像质量标签来实现更准确和鲁棒的诊断。为此,我们提出了一个新的元知识共嵌入网络,包括任务网络和元学习器两个子网。
效果:在五个数据集上进行的实验表明,我们的方法在四种常用的医学成像模态上都表现出了优越的性能和泛化能力。

Medical images usually suffer from image degradation in clinical practice, leading to decreased performance of deep learning-based models. To resolve this problem, most previous works have focused on filtering out degradation-causing low-quality images while ignoring their potential value for models. Through effectively learning and leveraging the knowledge of degradations, models can better resist their adverse effects and avoid misdiagnosis. In this paper, we raise the problem of image quality-aware diagnosis, which aims to take advantage of low-quality images and image quality labels to achieve a more accurate and robust diagnosis. However, the diversity of degradations and superficially unrelated targets between image quality assessment and disease diagnosis makes it still quite challenging to effectively leverage quality labels to assist diagnosis. Thus, to tackle these issues, we propose a novel meta-knowledge co-embedding network, consisting of two subnets: Task Net and Meta Learner. Task Net constructs an explicit quality information utilization mechanism to enhance diagnosis via knowledge co-embedding features, while Meta Learner ensures the effectiveness and constrains the semantics of these features via meta-learning and joint-encoding masking. Superior performance on five datasets with four widely-used medical imaging modalities demonstrates the effectiveness and generalizability of our method.

Domain Generalized Stereo Matching via Hierarchical Visual Transformation
Chang, TianyuandYang, XunandZhang, TianzhuandWang, Meng



研究问题:现有的深度立体匹配网络容易学习到依赖于数据集的捷径,无法在未见过的现实数据集上进行良好的泛化。
动机:为了解决这一问题,本文提出了一种针对领域泛化的立体匹配任务训练稳健模型的方法,主要关注从合成数据中学习捷径不变的表示以减轻领域偏移。
方法:具体来说,我们提出了一个分层视觉转换(HVT)网络,首先将训练样本分层次地转换为来自三个级别的新领域:全局、局部和像素,然后最大化源领域和新领域之间的视觉差异,最小化跨领域特征不一致性,以捕获领域不变的特征。
效果:通过在几个公共立体匹配基准数据集上集成我们的HVT网络并与最先进的立体匹配网络进行评估,大量实验清楚地表明,HVT网络可以显著提高现有立体匹配网络在合成到现实领域的泛化性能。

Recently, deep Stereo Matching (SM) networks have shown impressive performance and attracted increasing attention in computer vision. However, existing deep SM networks are prone to learn dataset-dependent shortcuts, which fail to generalize well on unseen realistic datasets. This paper takes a step towards training robust models for the domain generalized SM task, which mainly focuses on learning shortcut-invariant representation from synthetic data to alleviate the domain shifts. Specifically, we propose a Hierarchical Visual Transformation (HVT) network to 1) first transform the training sample hierarchically into new domains with diverse distributions from three levels: Global, Local, and Pixel, 2) then maximize the visual discrepancy between the source domain and new domains, and minimize the cross-domain feature inconsistency to capture domain-invariant features. In this way, we can prevent the model from exploiting the artifacts of synthetic stereo images as shortcut features, thereby estimating the disparity maps more effectively based on the learned robust and shortcut-invariant representation. We integrate our proposed HVT network with SOTA SM networks and evaluate its effectiveness on several public SM benchmark datasets. Extensive experiments clearly show that the HVT network can substantially enhance the performance of existing SM networks in synthetic-to-realistic domain generalization.

Deep Semi-Supervised Metric Learning With Mixed Label Propagation
Zhuang, FurenandMoulin, Pierre



研究问题:如何通过无标签数据进行有效的度量学习,特别是在寻找远隔相似对和近邻不相似对时。
动机:传统的度量学习方法在无标签数据上难以找到远隔相似对和近邻不相似对,因为通常假设相近的数据对是相似的。
方法:提出一种新的度量学习方法,通过在亲和矩阵中移除连接一对数据的边来获取不相似的标签,从而识别出困难负对,即使它们很接近也可以被识别出来。
效果:这种方法显著提高了标签传播在识别远隔正对和近邻负对方面的能力,从而提高了半监督度量学习的性能,如内容基于信息检索(CBIR)应用的召回率、精度和归一化互信息(NMI)性能指标。

Metric learning requires the identification of far-apart similar pairs and close dissimilar pairs during training, and this is difficult to achieve with unlabeled data because pairs are typically assumed to be similar if they are close. We present a novel metric learning method which circumvents this issue by identifying hard negative pairs as those which obtain dissimilar labels via label propagation (LP), when the edge linking the pair of data is removed in the affinity matrix. In so doing, the negative pairs can be identified despite their proximity, and we are able to utilize this information to significantly improve LP's ability to identify far-apart positive pairs and close negative pairs. This results in a considerable improvement in semi-supervised metric learning performance as evidenced by recall, precision and Normalized Mutual Information (NMI) performance metrics on Content-based Information Retrieval (CBIR) applications.

Unpaired Image-to-Image Translation With Shortest Path Regularization
Xie, ShaoanandXu, YanwuandGong, MingmingandZhang, Kun



研究问题:如何通过无对图像翻译学习适当的映射,将一个领域的图像映射到另一个领域,同时保留输入图像的内容。
动机:现有的方法将两个领域视为离散的,并提出了不同的假设来解决这个问题。本文从不同的角度出发,考虑了连接两个领域的路径。
方法:假设输入和输出图像之间的最优路径长度应该是所有可能路径中最短的。基于这个假设,我们提出了一种新的方法,允许沿着路径生成图像,并提出了一种简单的方法来鼓励网络在没有配对信息的情况下找到最短路径。
效果:大量的实验表明,我们的方法在各种任务上都表现出优越性。

Unpaired image-to-image translation aims to learn proper mappings that can map images from one domain to another domain while preserving the content of the input image. However, with large enough capacities, the network can learn to map the inputs to any random permutation of images in another domain. Existing methods treat two domains as discrete and propose different assumptions to address this problem. In this paper, we start from a different perspective and consider the paths connecting the two domains. We assume that the optimal path length between the input and output image should be the shortest among all possible paths. Based on this assumption, we propose a new method to allow generating images along the path and present a simple way to encourage the network to find the shortest path without pair information. Extensive experiments on various tasks demonstrate the superiority of our approach.

MotionDiffuser: Controllable Multi-Agent Motion Prediction Using Diffusion
Jiang, Chiyu{\textquotedblleft



研究问题:如何有效地预测多智能体的未来运动轨迹?
动机:现有的模型在预测多智能体未来运动轨迹时,存在学习到的分布模式单一、需要依赖轨迹锚点以及无法对多智能体的运动进行置换不变性学习等问题。
方法:提出了一种基于扩散的运动表示——MotionDiffuser,该模型通过学习高度多元的分布来捕捉未来的多种可能结果,采用简单的预测器设计,只需一个L2损失训练目标,无需依赖轨迹锚点,并能以置换不变的方式学习多个智能体的运动联合分布。同时,利用PCA压缩轨迹表示,提高了模型性能并允许高效计算精确样本概率。进一步提出了一种通用的约束采样框架,使得可以根据不同的可微成本函数进行受控的轨迹采样。
效果:MotionDiffuser在Waymo开放运动数据集上的多智能体运动预测任务上取得了最先进的结果。

We present MotionDiffuser, a diffusion based representation for the joint distribution of future trajectories over multiple agents. Such representation has several key advantages: first, our model learns a highly multimodal distribution that captures diverse future outcomes. Second, the simple predictor design requires only a single L2 loss training objective, and does not depend on trajectory anchors. Third, our model is capable of learning the joint distribution for the motion of multiple agents in a permutation-invariant manner. Furthermore, we utilize a compressed trajectory representation via PCA, which improves model performance and allows for efficient computation of the exact sample log probability. Subsequently, we propose a general constrained sampling framework that enables controlled trajectory sampling based on differentiable cost functions. This strategy enables a host of applications such as enforcing rules and physical priors, or creating tailored simulation scenarios. MotionDiffuser can be combined with existing backbone architectures to achieve top motion forecasting results. We obtain state-of-the-art results for multi-agent motion prediction on the Waymo Open Motion Dataset.

TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization
Liu, ZiquanandXu, YiandJi, XiangyangandChan, AntoniB.



研究问题:如何更好地利用预训练模型在对抗性鲁棒性上的潜在价值,特别是在各种分类任务中的微调。
动机:现有的研究表明,由于健壮的预训练模型已经学习了一个健壮的特征提取器,因此关键问题是如何在学习下游任务时保持预训练模型的鲁棒性。
方法:我们研究了基于模型和数据的方法来实现这个目标,并发现这两种常见的方法无法同时提高泛化能力和对抗性鲁棒性。因此,我们提出了一种新的基于统计的方法——TWINS微调框架,它由两个神经网络组成,其中一个在批量归一化层中保持预训练数据的种群均值和方差。
效果:TWINS不仅有效地转移了健壮信息,还提高了有效学习率,因为标准批量归一化层中的权重范数和梯度范数之间的关系被打破,从而更快地跳出次优初始化并减轻健壮过拟合。最后,TWINS在广泛的图像分类数据集上显示出在泛化和鲁棒性方面的有效性。

Recent years have seen the ever-increasing importance of pre-trained models and their downstream training in deep learning research and applications. At the same time, the defense for adversarial examples has been mainly investigated in the context of training from random initialization on simple classification tasks. To better exploit the potential of pre-trained models in adversarial robustness, this paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks. Existing research has shown that since the robust pre-trained model has already learned a robust feature extractor, the crucial question is how to maintain the robustness in the pre-trained model when learning the downstream task. We study the model-based and data-based approaches for this goal and find that the two common approaches cannot achieve the objective of improving both generalization and adversarial robustness. Thus, we propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework, which consists of two neural networks where one of them keeps the population means and variances of pre-training data in the batch normalization layers. Besides the robust information transfer, TWINS increases the effective learning rate without hurting the training stability since the relationship between a weight norm and its gradient norm in standard batch normalization layer is broken, resulting in a faster escape from the sub-optimal initialization and alleviating the robust overfitting. Finally, TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.

Open-Set Semantic Segmentation for Point Clouds via Adversarial Prototype Framework
Li, JiananandDong, Qiulei



研究问题:如何识别训练集中未出现的3D对象类别,同时保持对已出现类别的分割性能。
动机:大部分现有文献假设训练和测试点云具有相同的对象类别,但在许多真实场景中,这通常是无效的。
方法:提出一种对抗原型框架(APF)来处理开放集3D语义分割任务,该框架由特征提取模块、原型约束模块和特征对抗模块组成。
效果:实验结果表明,提出的APF在大多数情况下比比较方法有显著改进。

Recently, point cloud semantic segmentation has attracted much attention in computer vision. Most of the existing works in literature assume that the training and testing point clouds have the same object classes, but they are generally invalid in many real-world scenarios for identifying the 3D objects whose classes are not seen in the training set. To address this problem, we propose an Adversarial Prototype Framework (APF) for handling the open-set 3D semantic segmentation task, which aims to identify 3D unseen-class points while maintaining the segmentation performance on seen-class points. The proposed APF consists of a feature extraction module for extracting point features, a prototypical constraint module, and a feature adversarial module. The prototypical constraint module is designed to learn prototypes for each seen class from point features. The feature adversarial module utilizes generative adversarial networks to estimate the distribution of unseen-class features implicitly, and the synthetic unseen-class features are utilized to prompt the model to learn more effective point features and prototypes for discriminating unseen-class samples from the seen-class ones. Experimental results on two public datasets demonstrate that the proposed APF outperforms the comparative methods by a large margin in most cases.

CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability
Boutros, FadiandFang, MeilingandKlemt, MarcelandFu, BiyingandDamer, Naser



研究问题:本文旨在提出一种新颖的面部图像质量评估(FIQA)方法,通过学习预测样本的相对可分类性来估计面部图像的质量。
动机:现有的面部图像质量评估方法无法准确反映图像在实现可靠和准确识别性能方面的效用。
方法:该方法通过在学习过程中探测内部网络观察结果并利用其预测未见过样本的质量,来学习训练样本特征表示在角度空间中相对于其类别中心和最近的负类别中心的分配与面部图像质量之间的相关性。
效果:通过在八个基准和四个人脸识别模型上进行广泛的评估实验,证明了所提出的CR-FIQA方法优于最先进的FIQA算法。

Face image quality assessment (FIQA) estimates the utility of the captured image in achieving reliable and accurate recognition performance. This work proposes a novel FIQA method, CR-FIQA, that estimates the face image quality of a sample by learning to predict its relative classifiability. This classifiability is measured based on the allocation of the training sample feature representation in angular space with respect to its class center and the nearest negative class center. We experimentally illustrate the correlation between the face image quality and the sample relative classifiability. As such property is only observable for the training dataset, we propose to learn this property by probing internal network observations during the training process and utilizing it to predict the quality of unseen samples. Through extensive evaluation experiments on eight benchmarks and four face recognition models, we demonstrate the superiority of our proposed CR-FIQA over state-of-the-art (SOTA) FIQA algorithms.

MetaViewer: Towards a Unified Multi-View Representation
Wang, RenandSun, HaoliangandMa, YulingandXi, XiaomingandYin, Yilong



研究问题:现有的多视角表示学习方法通常遵循从特定到统一的管道,提取每个视角的潜在特征,然后融合或对齐它们以获得统一的对象表示。然而,手动预指定的融合函数和对齐标准可能会降低所得到表示的质量。
动机:为了克服这个问题,我们提出了一种新的从元学习角度看的从统一到特定的多视角学习框架。在这个框架中,统一表示不再涉及手动操作,而是由一个名为MetaViewer的元学习器自动生成。
方法:我们将视图特定潜在特征的提取和融合形式化为一个嵌套优化问题,并使用双层优化方案来解决它。通过这种方式,MetaViewer自动将视图特定特征融合成一个统一的表示,并通过观察所有视图从统一到特定的重构过程来学习最优的融合方案。
效果:在下游分类和聚类任务中的大量实验结果表明了所提出方法的效率和有效性。

Existing multi-view representation learning methods typically follow a specific-to-uniform pipeline, extracting latent features from each view and then fusing or aligning them to obtain the unified object representation. However, the manually pre-specified fusion functions and aligning criteria could potentially degrade the quality of the derived representation. To overcome them, we propose a novel uniform-to-specific multi-view learning framework from a meta-learning perspective, where the unified representation no longer involves manual manipulation but is automatically derived from a meta-learner named MetaViewer. Specifically, we formulated the extraction and fusion of view-specific latent features as a nested optimization problem and solved it by using a bi-level optimization scheme. In this way, MetaViewer automatically fuses view-specific features into a unified one and learns the optimal fusion scheme by observing reconstruction processes from the unified to the specific over all views. Extensive experimental results in downstream classification and clustering tasks demonstrate the efficiency and effectiveness of the proposed method.

Unsupervised Sampling Promoting for Stochastic Human Trajectory Prediction
Chen, GuangyiandChen, ZhenhaoandFan, ShunxingandZhang, Kun



研究问题:人类运动的不确定性需要轨迹预测系统使用概率模型来形成多模态现象,并推断出有限的未来轨迹。
动机:大多数现有方法的推理过程依赖于蒙特卡罗随机采样,由于预测分布的长尾效应,这不足以覆盖有限样本的现实路径。
方法:我们提出了一种名为BOsampler的新方法,通过贝叶斯优化在无监督的方式下自适应地挖掘潜在路径,作为一种新的序列设计策略,新的预测依赖于先前抽取的样本。
效果:我们在各种基线方法上进行实验,结果表明我们的方法有效。源代码已在此链接中发布。

The indeterminate nature of human motion requires trajectory prediction systems to use a probabilistic model to formulate the multi-modality phenomenon and infer a finite set of future trajectories. However, the inference processes of most existing methods rely on Monte Carlo random sampling, which is insufficient to cover the realistic paths with finite samples, due to the long tail effect of the predicted distribution. To promote the sampling process of stochastic prediction, we propose a novel method, called BOsampler, to adaptively mine potential paths with Bayesian optimization in an unsupervised manner, as a sequential design strategy in which new prediction is dependent on the previously drawn samples. Specifically, we model the trajectory sampling as a Gaussian process and construct an acquisition function to measure the potential sampling value. This acquisition function applies the original distribution as prior and encourages exploring paths in the long-tail region. This sampling method can be integrated with existing stochastic predictive models without retraining. Experimental results on various baseline methods demonstrate the effectiveness of our method. The source code is released in this link.

Robust Generalization Against Photon-Limited Corruptions via Worst-Case Sharpness Minimization
Huang, ZhuoandZhu, MiaoxiandXia, XiaoboandShen, LiandYu, JunandGong, ChenandHan, BoandDu, BoandLiu, Tongliang



研究问题:如何使预训练语言模型更好地利用结构化知识,以提升语言理解能力?
动机:目前的预训练语言模型在处理丰富的结构化知识方面存在不足,而知识图谱中的有信息量的实体可以增强语言表示。
方法:本文提出了一种增强的语言表示模型ERNIE,该模型通过联合训练大规模文本语料库和知识图谱来捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Robust generalization aims to tackle the most challenging data distributions which are rare in the training set and contain severe noises, i.e., photon-limited corruptions. Common solutions such as distributionally robust optimization (DRO) focus on the worst-case empirical risk to ensure low training error on the uncommon noisy distributions. However, due to the over-parameterized model being optimized on scarce worst-case data, DRO fails to produce a smooth loss landscape, thus struggling on generalizing well to the test set. Therefore, instead of focusing on the worst-case risk minimization, we propose SharpDRO by penalizing the sharpness of the worst-case distribution, which measures the loss changes around the neighbor of learning parameters. Through worst-case sharpness minimization, the proposed method successfully produces a flat loss curve on the corrupted distributions, thus achieving robust generalization. Moreover, by considering whether the distribution annotation is available, we apply SharpDRO to two problem settings and design a worst-case selection process for robust generalization. Theoretically, we show that SharpDRO has a great convergence guarantee. Experimentally, we simulate photon-limited corruptions using CIFAR10/100 and ImageNet30 datasets and show that SharpDRO exhibits a strong generalization ability against severe corruptions and exceeds well-known baseline methods with large performance gains.

NICO++: Towards Better Benchmarking for Domain Generalization
Zhang, XingxuanandHe, YueandXu, RenzheandYu, HanandShen, ZheyanandCui, Peng



研究问题:尽管现代深度神经网络在独立同分布数据上取得了显著的性能,但在分布转移下可能会崩溃。
动机:目前大多数领域泛化(DG)的评估方法采用留一策略作为对有限数量领域的妥协。
方法:提出了一个大规模的标记领域基准测试NICO++以及更合理的评估方法,用于全面评估DG算法。
效果:通过大量实验,NICO++显示出其优越的评估能力,优于目前的DG数据集,并在减轻模型选择中泄露的先知知识引起的不公平性方面做出了贡献。

Despite the remarkable performance that modern deep neural networks have achieved on independent and identically distributed (I.I.D.) data, they can crash under distribution shifts. Most current evaluation methods for domain generalization (DG) adopt the leave-one-out strategy as a compromise on the limited number of domains. We propose a large-scale benchmark with extensive labeled domains named NICO++ along with more rational evaluation methods for comprehensively evaluating DG algorithms. To evaluate DG datasets, we propose two metrics to quantify covariate shift and concept shift, respectively. Two novel generalization bounds from the perspective of data construction are proposed to prove that limited concept shift and significant covariate shift favor the evaluation capability for generalization. Through extensive experiments, NICO++ shows its superior evaluation capability compared with current DG datasets and its contribution in alleviating unfairness caused by the leak of oracle knowledge in model selection.

Neural Dependencies Emerging From Learning Massive Categories
Feng, RuiliandZheng, KechengandZhu, KaiandShen, YujunandZhao, JianandHuang, YukunandZhao, DeliandZhou, JingrenandJordan, MichaelandZha, Zheng-Jun



研究问题:本文研究了大规模图像分类神经网络中的两个惊人的发现,即神经依赖性的存在及其在模型间和模型内的表现。
动机:作者发现,在一个训练良好的模型中,某些类别的预测结果可以通过线性组合其他几个类别的预测结果直接获得,这种现象被称为神经依赖性。并且这种神经依赖性不仅存在于单个模型中,也存在于独立学习的两个模型之间。
方法:通过将识别神经依赖性的问题等价为解决协方差Lasso回归问题,作者们对此现象进行了理论分析。并通过研究问题解决方案的性质,确认了神经依赖性是由冗余对数协方差矩阵保证的。
效果:实验结果表明,神经依赖性在理解内部数据关联、推广模型到未见过的类别以及通过依赖性导出的正则化提高模型鲁棒性方面具有潜力。作者还计划公开发布能够精确重现这项工作结果的代码。

This work presents two astonishing findings on neural networks learned for large-scale image classification. 1) Given a well-trained model, the logits predicted for some category can be directly obtained by linearly combining the predictions of a few other categories, which we call neural dependency. 2) Neural dependencies exist not only within a single model, but even between two independently learned models, regardless of their architectures. Towards a theoretical analysis of such phenomena, we demonstrate that identifying neural dependencies is equivalent to solving the Covariance Lasso (CovLasso) regression problem proposed in this paper. Through investigating the properties of the problem solution, we confirm that neural dependency is guaranteed by a redundant logit covariance matrix, which condition is easily met given massive categories, and that neural dependency is sparse, which implies one category relates to only a few others. We further empirically show the potential of neural dependencies in understanding internal data correlations, generalizing models to unseen categories, and improving model robustness with a dependency-derived regularize. Code to exactly reproduce the results in this work will be released publicly.

Constrained Evolutionary Diffusion Filter for Monocular Endoscope Tracking
Luo, Xiongbiao



研究问题:如何改善现有随机滤波方法在非线性优化问题上的探索与利用之间的不平衡。
动机:现有的随机滤波方法由于粒子退化和贫化,导致局部最优解的问题,需要解决探索与利用的平衡问题。
方法:提出一种新的约束进化扩散滤波器,通过开发空间状态约束和自适应历史回忆差异进化嵌入的进化随机扩散,以解决退化和贫化问题。
效果:在单眼内窥镜3D跟踪的应用中,实验结果表明,提出的滤波器显著改善了探索与利用之间的平衡,比近期的3D跟踪方法效果更好,手术跟踪误差从4.03mm降低到2.59mm。

Stochastic filtering is widely used to deal with nonlinear optimization problems such as 3-D and visual tracking in various computer vision and augmented reality applications. Many current methods suffer from an imbalance between exploration and exploitation due to their particle degeneracy and impoverishment, resulting in local optimums. To address this imbalance, this work proposes a new constrained evolutionary diffusion filter for nonlinear optimization. Specifically, this filter develops spatial state constraints and adaptive history-recall differential evolution embedded evolutionary stochastic diffusion instead of sequential resampling to resolve the degeneracy and impoverishment problem. With application to monocular endoscope 3-D tracking, the experimental results show that the proposed filtering significantly improves the balance between exploration and exploitation and certainly works better than recent 3-D tracking methods. Particularly, the surgical tracking error was reduced from 4.03 mm to 2.59 mm.

Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective
Zhu, JinjingandBai, HaotianandWang, Lin



研究问题:如何有效地进行无监督领域适应(UDA)任务。
动机:现有的使用视觉转换器(ViT)进行UDA的方法,在目标样本的伪标签质量不高时,效果会大打折扣。
方法:提出一种名为PMTrans的新模型,通过构建一个中间域来连接源域和目标域。具体来说,提出了一种新的基于ViT的模块PatchMix,通过学习从两个域中采样补丁,根据博弈论模型建立中间域,即概率分布。
效果:在四个基准数据集上进行的大量实验表明,PMTrans显著优于基于ViT和CNN的最新方法,分别在Office-Home、Office-31和DomainNet上提高了+3.6%、+1.4%和+17.7%。

Endeavors have been recently made to leverage the vision transformer (ViT) for the challenging unsupervised domain adaptation (UDA) task. They typically adopt the cross-attention in ViT for direct domain alignment. However, as the performance of cross-attention highly relies on the quality of pseudo labels for targeted samples, it becomes less effective when the domain gap becomes large. We solve this problem from a game theory's perspective with the proposed model dubbed as PMTrans, which bridges source and target domains with an intermediate domain. Specifically, we propose a novel ViT-based module called PatchMix that effectively builds up the intermediate domain, i.e., probability distribution, by learning to sample patches from both domains based on the game-theoretical models. This way, it learns to mix the patches from the source and target domains to maximize the cross entropy (CE), while exploiting two semi-supervised mixup losses in the feature and label spaces to minimize it. As such, we interpret the process of UDA as a min-max CE game with three players, including the feature extractor, classifier, and PatchMix, to find the Nash Equilibria. Moreover, we leverage attention maps from ViT to re-weight the label of each patch by its importance, making it possible to obtain more domain-discriminative feature representations. We conduct extensive experiments on four benchmark datasets, and the results show that PMTrans significantly surpasses the ViT-based and CNN-based SoTA methods by +3.6% on Office-Home, +1.4% on Office-31, and +17.7% on DomainNet, respectively. https://vlis2022.github.io/cvpr23/PMTrans

Improving Selective Visual Question Answering by Learning From Your Peers
Dancette, CorentinandWhitehead, SpencerandMaheshwary, RishabhandVedantam, RamakrishnaandScherer, StefanandChen, XinleiandCord, MatthieuandRohrbach, Marcus



研究问题:尽管视觉问答(VQA)取得了进展,但模型评估自身正确性的能力仍未得到充分探索。
动机:最近的研究表明,VQA模型在面对错误时往往难以选择不回答。这种选择性预测(Selective Prediction)在部署系统给用户时非常重要,例如为视障人士提供VQA助手。对于这些场景,当用户可能提供分布外(OOD)或对抗性输入使得错误答案更有可能时,不回答的选择尤为重要。
方法:本研究在分布内(ID)和OOD场景中探索了选择性VQA,其中模型被呈现ID和OOD数据的混合。我们提出了一种简单而有效的从你的同龄人(LYP)学习的方法来训练多模态选择函数以做出不回答的决定。这种方法使用从不同训练数据子集上训练的模型的预测作为优化选择性VQA模型的目标,无需额外的手动标签或保留的数据。
效果:在我们的广泛评估中,我们在选择性预测指标覆盖度上达到了32.92% C@1%,这比该任务上之前的最佳覆盖度15.79%提高了一倍。对于混合ID/OOD,即使只有10%的OOD示例,使用模型的softmax置信度进行不回答决策的性能非常差,C@1%的风险仅为5%,但使用LYP学习的选择函数可以将这一比例提高到25.38%。

Despite advances in Visual Question Answering (VQA), the ability of models to assess their own correctness remains underexplored. Recent work has shown that VQA models, out-of-the-box, can have difficulties abstaining from answering when they are wrong. The option to abstain, also called Selective Prediction, is highly relevant when deploying systems to users who must trust the system's output (e.g., VQA assistants for users with visual impairments). For such scenarios, abstention can be especially important as users may provide out-of-distribution (OOD) or adversarial inputs that make incorrect answers more likely. In this work, we explore Selective VQA in both in-distribution (ID) and OOD scenarios, where models are presented with mixtures of ID and OOD data. The goal is to maximize the number of questions answered while minimizing the risk of error on those questions. We propose a simple yet effective Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions. Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model. It does not require additional manual labels or held-out data and provides a signal for identifying examples that are easy/difficult to generalize to. In our extensive evaluations, we show this benefits a number of models across different architectures and scales. Overall, for ID, we reach 32.92% in the selective prediction metric coverage at 1% risk of error (C@1%) which doubles the previous best coverage of 15.79% on this task. For mixed ID/OOD, using models' softmax confidences for abstention decisions performs very poorly, answering <5% of questions at 1% risk of error even when faced with only 10% OOD examples, but a learned selection function with LYP can increase that to 25.38% C@1%.

On Calibrating Semantic Segmentation Models: Analyses and an Algorithm
Wang, DongdongandGong, BoqingandWang, Liqiang



研究问题:语义分割模型的校准问题。
动机:尽管图像分类模型的置信度校准问题已有许多解决方案,但语义分割模型的置信度校准研究仍然有限。
方法:我们提出了一种简单而有效的方法——选择性缩放,通过区分正确/错误的预测进行缩放,并更关注误预测的逻辑平滑。
效果:在各种基准测试中,无论是在领域内还是领域转移的校准上,我们的选择性缩放方法都优于其他方法,表现出了一致的优秀性能。

We study the problem of semantic segmentation calibration. Lots of solutions have been proposed to approach model miscalibration of confidence in image classification. However, to date, confidence calibration research on semantic segmentation is still limited. We provide a systematic study on the calibration of semantic segmentation models and propose a simple yet effective approach. First, we find that model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration. Among them, prediction correctness, especially misprediction, is more important to miscalibration due to over-confidence. Next, we propose a simple, unifying, and effective approach, namely selective scaling, by separating correct/incorrect prediction for scaling and more focusing on misprediction logit smoothing. Then, we study popular existing calibration methods and compare them with selective scaling on semantic segmentation calibration. We conduct extensive experiments with a variety of benchmarks on both in-domain and domain-shift calibration and show that selective scaling consistently outperforms other methods.

Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement
Zhu, HaoandKoniusz, Piotr



研究问题:如何改进少次学习(Few-shot learning)中的原型和图结构方法,以提高其在新类别适应上的性能。
动机:现有的原型和图结构方法在原型估计和图构建中存在不准确和次优的问题,这影响了性能。
方法:提出一种新的原型传播标签方法,通过原型与样本之间的关系而非样本之间的关系来构建图,并在更新原型时改变图。同时,对每个原型的标签进行估计,而不是将原型视为类中心。
效果:在Mini-ImageNet、Tiered-ImageNet、CIFAR-FS和CUB数据集上,该方法在转导式FSL和半监督FSL中优于其他最先进的方法,特别是在有未标记数据伴随新少数几次任务时。

Few-shot learning (FSL) is popular due to its ability to adapt to novel classes. Compared with inductive few-shot learning, transductive models typically perform better as they leverage all samples of the query set. The two existing classes of methods, prototype-based and graph-based, have the disadvantages of inaccurate prototype estimation and sub-optimal graph construction with kernel functions, respectively. %, which hurt the performance. In this paper, we propose a novel prototype-based label propagation to solve these issues. Specifically, our graph construction is based on the relation between prototypes and samples rather than between samples. As prototypes are being updated, the graph changes.We also estimate the label of each prototype instead of considering a prototype be the class centre. On mini-ImageNet, tiered-ImageNet, CIFAR-FS and CUB datasets, we show the proposed method outperforms other state-of-the-art methods in transductive FSL and semi-supervised FSL when some unlabeled data accompanies the novel few-shot task.

Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations
Michaeli, HagayandMichaeli, TomerandSoudry, Daniel



研究问题:尽管卷积神经网络(CNN)被认为是对翻译不变的,但最近的研究表明,由于下采样层产生的混叠效应,这并非如此。
动机:现有的防止混叠效应的架构解决方案是部分的,因为它们没有解决源于非线性层的混叠效应。
方法:我们提出了一种扩展的反混叠方法,该方法解决了下采样和非线性层的问题,从而创建了真正无混叠、平移不变的CNNs。
效果:我们展示出这种模型对于整数以及分数(即亚像素)的平移都是不变的,因此在对抗性平移方面的鲁棒性上超过了其他平移不变的方法。

Although CNNs are believed to be invariant to translations, recent works have shown this is not the case due to aliasing effects that stem from down-sampling layers. The existing architectural solutions to prevent the aliasing effects are partial since they do not solve those effects that originate in non-linearities. We propose an extended anti-aliasing method that tackles both down-sampling and non-linear layers, thus creating truly alias-free, shift-invariant CNNs. We show that the presented model is invariant to integer as well as fractional (i.e., sub-pixel) translations, thus outperforming other shift-invariant methods in terms of robustness to adversarial translations.

Initialization Noise in Image Gradients and Saliency Maps
Woerl, Ann-ChristinandDisselhoff, JanandWand, Michael



研究问题:本文研究了图像分类CNNs的logits梯度与输入像素值的关系。
动机:我们发现这些梯度会因训练随机性(如网络的随机初始化)而发生大幅度波动。
方法:我们将研究扩展到通过GradCAM获得的中间层梯度,以及流行的网络显著性估计器,如DeepLIFT、SHAP、LIME、Integrated Gradients和SmoothGrad。
效果:虽然经验噪声水平有所不同,但所有这些都可以对图像特征进行定性不同的归因,这对解释这些归因具有影响,特别是在寻求数据驱动的解释时。最后,我们证明可以通过简单的随机积分对初始化分布进行边缘化来消除观察到的artifacts。

In this paper, we examine gradients of logits of image classification CNNs by input pixel values. We observe that these fluctuate considerably with training randomness, such as the random initialization of the networks. We extend our study to gradients of intermediate layers, obtained via GradCAM, as well as popular network saliency estimators such as DeepLIFT, SHAP, LIME, Integrated Gradients, and SmoothGrad. While empirical noise levels vary, qualitatively different attributions to image features are still possible with all of these, which comes with implications for interpreting such attributions, in particular when seeking data-driven explanations of the phenomenon generating the data. Finally, we demonstrate that the observed artefacts can be removed by marginalization over the initialization distribution by simple stochastic integration.

Curricular Object Manipulation in LiDAR-Based Object Detection
Zhu, ZiyueandMeng, QiangandWang, XiaoandWang, KeandYan, LiujiangandYang, Jian



研究问题:本文旨在探索课程学习在基于激光雷达的3D物体检测中的潜力。
动机:通过提出一个课程物体操作(COM)框架,将课程训练策略嵌入损失设计和增强过程中,以提高模型性能和泛化能力。
方法:在损失设计中,提出了COMLoss来动态预测物体级别的难度,并根据训练阶段强调不同难度的物体。在常用的LiDAR检测任务增强技术GT-Aug的基础上,提出了一种新的COMAug策略,该策略首先根据精心设计的启发式方法对地面真实数据库中的物体进行聚类,然后预测并更新训练期间的组级别难度,以获得稳定的结果。
效果:通过逐步增加更困难物体的采样和增强到训练点,可以改善模型性能和泛化能力。广泛的实验和消融研究表明了所提出的框架的优势和通用性。

This paper explores the potential of curriculum learning in LiDAR-based 3D object detection by proposing a curricular object manipulation (COM) framework. The framework embeds the curricular training strategy into both the loss design and the augmentation process. For the loss design, we propose the COMLoss to dynamically predict object-level difficulties and emphasize objects of different difficulties based on training stages. On top of the widely-used augmentation technique called GT-Aug in LiDAR detection tasks, we propose a novel COMAug strategy which first clusters objects in ground-truth database based on well-designed heuristics. Group-level difficulties rather than individual ones are then predicted and updated during training for stable results. Model performance and generalization capabilities can be improved by sampling and augmenting progressively more difficult objects into the training points. Extensive experiments and ablation studies reveal the superior and generality of the proposed framework. The code is available at https://github.com/ZZY816/COM.

Learning With Noisy Labels via Self-Supervised Adversarial Noisy Masking
Tu, YuanpengandZhang, BoshenandLi, YuxiandLiu, LiangandLi, JianandZhang, JiangningandWang, YabiaoandWang, ChengjieandZhao, CaiRong



研究问题:如何有效地处理训练深度学习模型时由标注数据产生的噪声标签。
动机:目前的处理噪声标签的方法主要是识别和移除噪声样本或根据训练样本的统计特性(如损失值)来修正其标签,但这些方法效果有限。
方法:提出一种名为对抗性噪声掩蔽的新型鲁棒训练方法。该方法通过一个以标签质量为导向的掩蔽方案对深层特征进行正则化,自适应地同时调整输入数据和标签,防止模型过拟合噪声样本。此外,设计了一个辅助任务来重建输入数据,为深度模型提供无噪声的自我监督信号,从而增强其泛化能力。
效果:在合成和真实世界的噪声数据集上测试了该方法,结果证明其在性能上显著优于现有的最先进方法。

Collecting large-scale datasets is crucial for training deep models, annotating the data, however, inevitably yields noisy labels, which poses challenges to deep learning algorithms. Previous efforts tend to mitigate this problem via identifying and removing noisy samples or correcting their labels according to the statistical properties (e.g., loss values) among training samples. In this paper, we aim to tackle this problem from a new perspective, delving into the deep feature maps, we empirically find that models trained with clean and mislabeled samples manifest distinguishable activation feature distributions. From this observation, a novel robust training approach termed adversarial noisy masking is proposed. The idea is to regularize deep features with a label quality guided masking scheme, which adaptively modulates the input data and label simultaneously, preventing the model to overfit noisy samples. Further, an auxiliary task is designed to reconstruct input data, it naturally provides noise-free self-supervised signals to reinforce the generalization ability of deep models. The proposed method is simple and flexible, it is tested on both synthetic and real-world noisy datasets, where significant improvements are achieved over previous state-of-the-art methods.

Instance-Aware Domain Generalization for Face Anti-Spoofing
Zhou, QianyuandZhang, Ke-YueandYao, TaipingandLu, XuequanandYi, RanandDing, ShouhongandMa, Lizhuang



研究问题:如何提高面部反欺诈系统在未见过的场景中的泛化能力。
动机:现有的基于领域泛化的面部反欺诈方法主要依赖人工标注的领域标签来对齐每个领域的分布,但这种方法存在粗糙和主观的问题,无法准确反映真实的领域分布。
方法:提出了一种无需领域标签,在实例级别对齐特征的新视角。具体来说,提出了一个名为“实例感知领域泛化”的框架,通过减弱特征对实例特定风格的敏感性来学习可泛化的特征。
效果:实验结果和分析表明,该方法优于最先进的竞争对手。

Face anti-spoofing (FAS) based on domain generalization (DG) has been recently studied to improve the generalization on unseen scenarios. Previous methods typically rely on domain labels to align the distribution of each domain for learning domain-invariant representations. However, artificial domain labels are coarse-grained and subjective, which cannot reflect real domain distributions accurately. Besides, such domain-aware methods focus on domain-level alignment, which is not fine-grained enough to ensure that learned representations are insensitive to domain styles. To address these issues, we propose a novel perspective for DG FAS that aligns features on the instance level without the need for domain labels. Specifically, Instance-Aware Domain Generalization framework is proposed to learn the generalizable feature by weakening the features' sensitivity to instance-specific styles. Concretely, we propose Asymmetric Instance Adaptive Whitening to adaptively eliminate the style-sensitive feature correlation, boosting the generalization. Moreover, Dynamic Kernel Generator and Categorical Style Assembly are proposed to first extract the instance-specific features and then generate the style-diversified features with large style shifts, respectively, further facilitating the learning of style-insensitive features. Extensive experiments and analysis demonstrate the superiority of our method over state-of-the-art competitors. Code will be publicly available at this link: https://github.com/qianyuzqy/IADG.

Towards Domain Generalization for Multi-View 3D Object Detection in Bird-Eye-View
Wang, ShuoandZhao, XinhaiandXu, Hai-MingandChen, ZehuiandYu, DamengandChang, JiahaoandYang, ZhenandZhao, Feng



研究问题:如何降低多视角三维物体检测(MV3D-Det)在输入图像领域与训练领域不同的情况下的性能下降。
动机:大多数现有的仅依赖摄像头的三维物体检测算法,当输入图像领域与训练领域不同时,可能会面临性能急剧下降的风险。
方法:通过将深度预测从相机的内在参数(即焦距)中解耦,并执行动态透视增强以增加外在参数(即相机姿态)的多样性,来获取稳健的深度预测。此外,修改焦距值以创建多个伪领域,并构建对抗性训练损失以鼓励特征表示更具领域不变性。
效果:在未看到的目标领域中成功减轻了性能下降,同时没有损害源领域的精度。在Waymo、nuScenes和Lyft等数据集上的大量实验证明了该方法的泛化性和有效性。

Multi-view 3D object detection (MV3D-Det) in Bird-Eye-View (BEV) has drawn extensive attention due to its low cost and high efficiency. Although new algorithms for camera-only 3D object detection have been continuously proposed, most of them may risk drastic performance degradation when the domain of input images differs from that of training. In this paper, we first analyze the causes of the domain gap for the MV3D-Det task. Based on the covariate shift assumption, we find that the gap mainly attributes to the feature distribution of BEV, which is determined by the quality of both depth estimation and 2D image's feature representation. To acquire a robust depth prediction, we propose to decouple the depth estimation from the intrinsic parameters of the camera (i.e. the focal length) through converting the prediction of metric depth to that of scale-invariant depth and perform dynamic perspective augmentation to increase the diversity of the extrinsic parameters (i.e. the camera poses) by utilizing homography. Moreover, we modify the focal length values to create multiple pseudo-domains and construct an adversarial training loss to encourage the feature representation to be more domain-agnostic. Without bells and whistles, our approach, namely DG-BEV, successfully alleviates the performance drop on the unseen target domain without impairing the accuracy of the source domain. Extensive experiments on Waymo, nuScenes, and Lyft, demonstrate the generalization and effectiveness of our approach.

Robust and Scalable Gaussian Process Regression and Its Applications
Lu, YifanandMa, JiayiandFang, LeyuanandTian, XinandJiang, Junjun



研究问题:如何将高斯过程回归(GPR)模型应用于大规模真实数据,特别是那些被异常值污染的数据。
动机:现有的GPR模型在处理大规模和含有异常值的真实数据时存在困难。
方法:本文提出了一种通过变分学习实现的鲁棒且可扩展的高斯过程回归模型。该模型采用混合似然模型来处理异常值,并推导出一种变分形式,通过最大化真实对数边际似然的下界来联合推断数据的模态(内点或外点)和超参数。
效果:在两个具有挑战性的真实世界应用——特征匹配和密集基因表达填充中,实验表明,相比于现有的鲁棒GPR模型,新模型在鲁棒性和速度上都有显著优势。特别是在匹配4k个特征点时,其推理仅需几毫秒,且几乎没有错误匹配。

This paper introduces a robust and scalable Gaussian process regression (GPR) model via variational learning. This enables the application of Gaussian processes to a wide range of real data, which are often large-scale and contaminated by outliers. Towards this end, we employ a mixture likelihood model where outliers are assumed to be sampled from a uniform distribution. We next derive a variational formulation that jointly infers the mode of data, i.e., inlier or outlier, as well as hyperparameters by maximizing a lower bound of the true log marginal likelihood. Compared to previous robust GPR, our formulation approximates the exact posterior distribution. The inducing variable approximation and stochastic variational inference are further introduced to our variational framework, extending our model to large-scale data. We apply our model to two challenging real-world applications, namely feature matching and dense gene expression imputation. Extensive experiments demonstrate the superiority of our model in terms of robustness and speed. Notably, when matching 4k feature points, its inference is completed in milliseconds with almost no false matches. The code is at https://github.com/YifanLu2000/Robust-Scalable-GPR.

Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization
Wang, ZifanandDing, NanandLevinboim, TomerandChen, XiandSoricut, Radu



研究问题:现有的对抗攻击模型在训练集上表现出更高的鲁棒性,但在测试集上却表现出过拟合的现象。
动机:尽管以前的工作通过对抗性测试误差的鲁棒PAC-贝叶斯边界提供了理论解释,但相关的算法推导与这个边界的联系并不紧密,这意味着他们的实证成功和我们对对抗性鲁棒性理论的理解之间存在差距。
方法:本文考虑了鲁棒PAC-贝叶斯边界的不同形式,并直接最小化模型后验的期望。最优解的推导将PAC-贝叶斯学习与鲁棒损失表面的几何形状通过测量表面平坦度的迹Hessian(TrH)正则化器连接起来。
效果:实验结果表明,TrH正则化可以提升ViT的鲁棒性,其性能要么与、要么超过了现有的最先进方法,同时需要的内存和计算成本更低。

Recent research in robust optimization has shown an overfitting-like phenomenon in which models trained against adversarial attacks exhibit higher robustness on the training set compared to the test set. Although previous work provided theoretical explanations for this phenomenon using a robust PAC-Bayesian bound over the adversarial test error, related algorithmic derivations are at best only loosely connected to this bound, which implies that there is still a gap between their empirical success and our understanding of adversarial robustness theory. To close this gap, in this paper we consider a different form of the robust PAC-Bayesian bound and directly minimize it with respect to the model posterior. The derivation of the optimal solution connects PAC-Bayesian learning to the geometry of the robust loss surface through a Trace of Hessian (TrH) regularizer that measures the surface flatness. In practice, we restrict the TrH regularizer to the top layer only, which results in an analytical solution to the bound whose computational cost does not depend on the network depth. Finally, we evaluate our TrH regularization approach over CIFAR-10/100 and ImageNet using Vision Transformers (ViT) and compare against baseline adversarial robustness algorithms. Experimental results show that TrH regularization leads to improved ViT robustness that either matches or surpasses previous state-of-the-art approaches while at the same time requires less memory and computational cost.

A Data-Based Perspective on Transfer Learning
Jain, SaachiandSalman, HadiandKhaddaj, AlaaandWong, EricandPark, SungMinandM\k{a



研究问题:预训练数据对迁移学习性能的影响,以及如何通过改变源数据集的构成来提高迁移学习的性能。
动机:尽管人们普遍认为更多的预训练数据可以提高迁移学习的性能,但最近的研究表明,从源数据集中删除某些数据实际上也可以帮助提高迁移学习的性能。
方法:提出了一个框架来探索源数据集的构成对迁移学习性能的影响。该框架可以识别迁移学习的脆弱性,并检测源数据集中的数据泄露和误导性示例等问题。
效果:实验表明,通过删除由该框架识别出的有害数据点,可以从ImageNet上的各种迁移任务中提高迁移学习的性能。

It is commonly believed that more pre-training data leads to better transfer learning performance. However, recent evidence suggests that removing data from the source dataset can actually help too. In this work, we present a framework for probing the impact of the source dataset's composition on transfer learning performance. Our framework facilitates new capabilities such as identifying transfer learning brittleness and detecting pathologies such as data-leakage and the presence of misleading examples in the source dataset. In particular, we demonstrate that removing detrimental datapoints identified by our framework improves transfer performance from ImageNet on a variety of transfer tasks.

Improved Test-Time Adaptation for Domain Generalization
Chen, LiangandZhang, YongandSong, YibingandShan, YingandLiu, Lingqiao



研究问题:领域泛化(DG)的主要挑战在于处理训练数据和测试数据之间的分布偏移问题。
动机:最近的研究表明,通过测试数据对已学习模型进行适应的测试时训练(TTT)可能是解决这个问题的一个有希望的方法。
方法:本文提出了一种改进的测试时适应(ITTA)方法,该方法通过为TTT任务定义一个可学习的一致性损失来选择适当的辅助TTT任务,并引入额外的自适应参数,仅在测试阶段更新这些参数。
效果:实验表明,这两种策略对已学习模型有益,ITTA可以在几个DG基准上实现优于当前最先进技术的性能。

The main challenge in domain generalization (DG) is to handle the distribution shift problem that lies between the training and test data. Recent studies suggest that test-time training (TTT), which adapts the learned model with test data, might be a promising solution to the problem. Generally, a TTT strategy hinges its performance on two main factors: selecting an appropriate auxiliary TTT task for updating and identifying reliable parameters to update during the test phase. Both previous arts and our experiments indicate that TTT may not improve but be detrimental to the learned model if those two factors are not properly considered. This work addresses those two factors by proposing an Improved Test-Time Adaptation (ITTA) method. First, instead of heuristically defining an auxiliary objective, we propose a learnable consistency loss for the TTT task, which contains learnable parameters that can be adjusted toward better alignment between our TTT task and the main prediction task. Second, we introduce additional adaptive parameters for the trained model, and we suggest only updating the adaptive parameters during the test phase. Through extensive experiments, we show that the proposed two strategies are beneficial for the learned model (see Figure 1), and ITTA could achieve superior performance to the current state-of-the-arts on several DG benchmarks.

Adjustment and Alignment for Unbiased Open Set Domain Adaptation
Li, WuyangandLiu, JieandHan, BoandYuan, Yixuan



研究问题:如何将模型从标签丰富的领域转移到包含新类别样本的无标签领域,同时避免在新类别样本不可用的情况下产生的语义偏差。
动机:现有的开放集领域适应(OSDA)工作忽视了源领域中隐藏的丰富新类别语义,导致模型学习偏颇和转移效果不佳。
方法:提出了一种新颖的基于因果关系的解决方案,并利用前门调整理论进行实现,构建了一个名为Adjustment and Alignment (ANNA)的理论框架,以实现无偏的OSDA。ANNA由前门调整(FDA)和去耦因果对齐(DCA)两部分组成,前者通过深入细致的视觉块来发现隐藏在基本类别图像中的新类别区域,并通过实施因果关系去偏来纠正有偏的模型优化;后者则使用正交掩码分离基本类别和新类别区域,并对解耦分布进行调整,以实现无偏的模型转移。
效果:大量实验表明,ANNA取得了最先进的结果。

Open Set Domain Adaptation (OSDA) transfers the model from a label-rich domain to a label-free one containing novel-class samples. Existing OSDA works overlook abundant novel-class semantics hidden in the source domain, leading to a biased model learning and transfer. Although the causality has been studied to remove the semantic-level bias, the non-available novel-class samples result in the failure of existing causal solutions in OSDA. To break through this barrier, we propose a novel causality-driven solution with the unexplored front-door adjustment theory, and then implement it with a theoretically grounded framework, coined AdjustmeNt aNd Alignment (ANNA), to achieve an unbiased OSDA. In a nutshell, ANNA consists of Front-Door Adjustment (FDA) to correct the biased learning in the source domain and Decoupled Causal Alignment (DCA) to transfer the model unbiasedly. On the one hand, FDA delves into fine-grained visual blocks to discover novel-class regions hidden in the base-class image. Then, it corrects the biased model optimization by implementing causal debiasing. On the other hand, DCA disentangles the base-class and novel-class regions with orthogonal masks, and then adapts the decoupled distribution for an unbiased model transfer. Extensive experiments show that ANNA achieves state-of-the-art results. The code is available at https://github.com/CityU-AIM-Group/Anna.

Balancing Logit Variation for Long-Tailed Semantic Segmentation
Wang, YuchaoandFei, JingjingandWang, HaochenandLi, WeiandBao, TianpengandWu, LiweiandZhao, RuiandShen, Yujun



研究问题:语义分割中长尾数据分布的问题,即不同类别样本数量不平衡导致尾部类别的特征在特征空间中被挤压。
动机:为了解决长尾数据分布问题,提出了一种在训练阶段引入类别间变化的网络预测方法,使实例不再被映射到一个特征点,而是一个小区域。
方法:根据类别规模,为头部类别分配较小的变化,为尾部类别分配较大的变化,以此缩小不同类别特征区域的差距,实现更平衡的特征表示。值得注意的是,这种引入的变化在推理阶段会被丢弃,以便于进行准确的预测。
效果:尽管这种方法的实现方式简单,但它在各种数据集和任务设置中表现出强大的泛化能力。大量实验表明,该方法可以很好地适用于一系列最新的方法,并在这些方法的基础上提高了性能。

Semantic segmentation usually suffers from a long tail data distribution. Due to the imbalanced number of samples across categories, the features of those tail classes may get squeezed into a narrow area in the feature space. Towards a balanced feature distribution, we introduce category-wise variation into the network predictions in the training phase such that an instance is no longer projected to a feature point, but a small region instead. Such a perturbation is highly dependent on the category scale, which appears as assigning smaller variation to head classes and larger variation to tail classes. In this way, we manage to close the gap between the feature areas of different categories, resulting in a more balanced representation. It is noteworthy that the introduced variation is discarded at the inference stage to facilitate a confident prediction. Although with an embarrassingly simple implementation, our method manifests itself in strong generalizability to various datasets and task settings. Extensive experiments suggest that our plug-in design lends itself well to a range of state-of-the-art approaches and boosts the performance on top of them.

Prompt-Guided Zero-Shot Anomaly Action Recognition Using Pretrained Deep Skeleton Features
Sato, FumiakiandHachiuma, RyoandSekii, Taiki



研究问题:本文旨在解决无监督异常动作识别的问题,即在没有异常样本的情况下,以无监督的方式识别视频级别的异常人类行为事件。
动机:传统的基于骨架的方法存在三个局限性:目标领域依赖的DNN训练、对骨架错误的鲁棒性以及对正常样本的缺乏。
方法:提出了一个统一的用户提示引导的零样本学习框架,使用目标领域独立的骨架特征提取器,该提取器在大规模动作识别数据集上进行预训练。特别是在使用正常样本的训练阶段,该方法在冻结DNN权重的同时,对正常动作的骨架特征分布进行建模,并在推理阶段使用此分布估计异常分数。此外,为了提高对骨架错误的鲁棒性,引入了一种受点云深度学习范式启发的DNN架构,该架构稀疏地在关节之间传播特征。另外,为了防止未观察到的正常动作被误识别为异常动作,将用户提示嵌入和在公共空间中对齐的骨架特征之间的相似度分数纳入异常分数,从而间接补充了正常动作。
效果:在两个公开可用的数据集上进行实验,测试了所提出方法对于上述局限性的有效性。

This study investigates unsupervised anomaly action recognition, which identifies video-level abnormal-human-behavior events in an unsupervised manner without abnormal samples, and simultaneously addresses three limitations in the conventional skeleton-based approaches: target domain-dependent DNN training, robustness against skeleton errors, and a lack of normal samples. We present a unified, user prompt-guided zero-shot learning framework using a target domain-independent skeleton feature extractor, which is pretrained on a large-scale action recognition dataset. Particularly, during the training phase using normal samples, the method models the distribution of skeleton features of the normal actions while freezing the weights of the DNNs and estimates the anomaly score using this distribution in the inference phase. Additionally, to increase robustness against skeleton errors, we introduce a DNN architecture inspired by a point cloud deep learning paradigm, which sparsely propagates the features between joints. Furthermore, to prevent the unobserved normal actions from being misidentified as abnormal actions, we incorporate a similarity score between the user prompt embeddings and skeleton features aligned in the common space into the anomaly score, which indirectly supplements normal actions. On two publicly available datasets, we conduct experiments to test the effectiveness of the proposed method with respect to abovementioned limitations.

Dynamic Coarse-To-Fine Learning for Oriented Tiny Object Detection
Xu, ChangandDing, JianandWang, JinwangandYang, WenandYu, HuaiandYu, LeiandXia, Gui-Song



研究问题:如何检测任意方向的微小物体,特别是在标签分配方面。
动机:现有的检测器在面对极端几何形状和有限特征的微小定向物体时,存在严重的不匹配和不平衡问题。
方法:提出一种动态先验和粗到细分配器的DCFL方法,通过动态建模先验、标签分配和物体表示来缓解不匹配问题,利用粗先验匹配和细后约束进行动态标签分配,为不同实例提供适当且相对平衡的监督。
效果:在六个数据集上的大量实验表明,DCFL方法对基线有显著改进,并在DOTA-v1.5、DOTA-v2.0和DIOR-R数据集上获得了单尺度训练和测试下的最新性能。

Detecting arbitrarily oriented tiny objects poses intense challenges to existing detectors, especially for label assignment. Despite the exploration of adaptive label assignment in recent oriented object detectors, the extreme geometry shape and limited feature of oriented tiny objects still induce severe mismatch and imbalance issues. Specifically, the position prior, positive sample feature, and instance are mismatched, and the learning of extreme-shaped objects is biased and unbalanced due to little proper feature supervision. To tackle these issues, we propose a dynamic prior along with the coarse-to-fine assigner, dubbed DCFL. For one thing, we model the prior, label assignment, and object representation all in a dynamic manner to alleviate the mismatch issue. For another, we leverage the coarse prior matching and finer posterior constraint to dynamically assign labels, providing appropriate and relatively balanced supervision for diverse instances. Extensive experiments on six datasets show substantial improvements to the baseline. Notably, we obtain the state-of-the-art performance for one-stage detectors on the DOTA-v1.5, DOTA-v2.0, and DIOR-R datasets under single-scale training and testing. Codes are available at https://github.com/Chasel-Tsui/mmrotate-dcfl.

The Enemy of My Enemy Is My Friend: Exploring Inverse Adversaries for Improving Adversarial Training
Dong, JunhaoandMoosavi-Dezfooli, Seyed-MohsenandLai, JianhuangandXie, Xiaohua



研究问题:目前的深度学习技术在计算机视觉任务上表现优秀,但仍易受对抗性示例影响。
动机:对抗性训练及其变体被认为是防御对抗性示例的最有效方法。然而,如果自然示例被误分类,这种方法可能会产生负面影响。
方法:提出一种新的对抗性训练方案,鼓励模型为对抗性示例和其"逆对抗性"对应物产生相似的输出概率。特别是,对应物是通过最大化自然示例附近的似然生成的。
效果:在各种视觉数据集和架构上的大量实验表明,我们的训练方法在鲁棒性和自然准确性方面都达到了最先进的水平。此外,使用通用的逆对抗性示例,我们以低计算成本提高了单步对抗性训练技术的性能。

Although current deep learning techniques have yielded superior performance on various computer vision tasks, yet they are still vulnerable to adversarial examples. Adversarial training and its variants have been shown to be the most effective approaches to defend against adversarial examples. A particular class of these methods regularize the difference between output probabilities for an adversarial and its corresponding natural example. However, it may have a negative impact if a natural example is misclassified. To circumvent this issue, we propose a novel adversarial training scheme that encourages the model to produce similar output probabilities for an adversarial example and its "inverse adversarial" counterpart. Particularly, the counterpart is generated by maximizing the likelihood in the neighborhood of the natural example. Extensive experiments on various vision datasets and architectures demonstrate that our training method achieves state-of-the-art robustness as well as natural accuracy among robust models. Furthermore, using a universal version of inverse adversarial examples, we improve the performance of single-step adversarial training techniques at a low computational cost.

Exploring Motion Ambiguity and Alignment for High-Quality Video Frame Interpolation
Zhou, KunandLi, WenboandHan, XiaoguangandLu, Jiangbo



研究问题:视频帧插值(VFI)中,现有的深度学习方法过于依赖地面真实(GT)的中间帧,忽视了从相邻帧判断运动并非唯一的可能性,导致生成的解不够清晰。
动机:为了解决这个问题,我们提出了一种纹理一致性损失(TCL),放松了重构中间帧必须尽可能接近GT的要求。
方法:我们开发了一个基于假设插值内容应与其在给定帧中的对应部分保持相似结构的TCL。尽管预测结果可能与预定义的GT不同,但我们鼓励满足这个约束的预测。此外,我们还设计了一种简单、高效且强大的O(N)引导跨尺度金字塔对齐(GCSPA)模块,充分利用多尺度信息。
效果:实验证明,我们的策略在效率和有效性上都表现出色,能够持续提升现有VFI框架的性能。

For video frame interpolation(VFI), existing deep-learning-based approaches strongly rely on the ground-truth (GT) intermediate frames, which sometimes ignore the non-unique nature of motion judging from the given adjacent frames. As a result, these methods tend to produce averaged solutions that are not clear enough. To alleviate this issue, we propose to relax the requirement of reconstructing an intermediate frame as close to the GT as possible. Towards this end, we develop a texture consistency loss (TCL) upon the assumption that the interpolated content should maintain similar structures with their counterparts in the given frames. Predictions satisfying this constraint are encouraged, though they may differ from the predefined GT. Without the bells and whistles, our plug-and-play TCL is capable of improving the performance of existing VFI frameworks consistently. On the other hand, previous methods usually adopt the cost volume or correlation map to achieve more accurate image or feature warping. However, the O(N^2) (N refers to the pixel count) computational complexity makes it infeasible for high-resolution cases. In this work, we design a simple, efficient O(N) yet powerful guided cross-scale pyramid alignment(GCSPA) module, where multi-scale information is highly exploited. Extensive experiments justify the efficiency and effectiveness of the proposed strategy.

Adaptive Annealing for Robust Geometric Estimation
Sidhartha, ChitturiandManam, LalitandGovindu, VenuMadhav



研究问题:视觉中的几何估计问题通常通过最小化统计损失函数来解决,这些损失函数考虑了观察中的异常值。对应的能量景观通常有许多局部极小值。
动机:许多方法试图通过使用诸如渐进非凸性(GNC)的方法来调整损失函数的规模参数来避免局部极小值。然而,对退火计划的关注往往被忽视,这通常是以固定的方式进行的,导致速度-准确性权衡不佳和无法可靠地收敛到全局最小值。
方法:本文提出了一种自适应调整GNC规模的方法,通过跟踪成本函数海森矩阵的正定性(即局部凸性)。我们使用存在噪声和异常值的3D对应点注册的经典问题来说明我们的方法。我们还开发了显著加快我们方法的海森矩阵近似值。
效果:我们的方法在一系列合成和真实数据集上与最先进的3D注册方法进行了比较,验证了其有效性。我们的方法准确且高效,比最先进的方法更可靠地收敛到全局解。

Geometric estimation problems in vision are often solved via minimization of statistical loss functions which account for the presence of outliers in the observations. The corresponding energy landscape often has many local minima. Many approaches attempt to avoid local minima by annealing the scale parameter of loss functions using methods such as graduated non-convexity (GNC). However, little attention has been paid to the annealing schedule, which is often carried out in a fixed manner, resulting in a poor speed-accuracy trade-off and unreliable convergence to the global minimum. In this paper, we propose a principled approach for adaptively annealing the scale for GNC by tracking the positive-definiteness (i.e. local convexity) of the Hessian of the cost function. We illustrate our approach using the classic problem of registering 3D correspondences in the presence of noise and outliers. We also develop approximations to the Hessian that significantly speeds up our method. The effectiveness of our approach is validated by comparing its performance with state-of-the-art 3D registration approaches on a number of synthetic and real datasets. Our approach is accurate and efficient and converges to the global solution more reliably than the state-of-the-art methods.

Upcycling Models Under Domain and Category Shift
Qu, SanqingandZou, TianpeiandR\"ohrbein, FlorianandLu, CewuandChen, GuangandTao, DachengandJiang, Changjun



研究问题:深度神经网络在面对领域和类别转移时表现不佳,如何改进并适应目标任务仍是一个重要未解决的问题。
动机:现有的无监督领域适应(UDA)技术,特别是最近提出的源自由领域适应(SFDA),已成为解决这一问题的有希望的技术。然而,大多数现有的SFDA方法要求源领域和目标领域共享相同的标签空间,因此仅适用于普通的封闭设置。
方法:我们进一步探索了源自由通用领域适应(SF-UniDA)。目标是识别出在领域和类别转移下的“已知”数据样本,并拒绝那些“未知”的数据样本(不在源类别中),只需要标准预训练的源模型的知识。为此,我们引入了一种创新的全局和局部聚类学习技术(GLC)。
效果:我们在多个不同的类别转移场景下,包括部分集、开放集和开放部分集DA,对GLC的优势进行了检验。更值得注意的是,在最具挑战性的开放部分集DA场景中,GLC在VisDA基准测试上比UMAD高出14.8%。

Deep neural networks (DNNs) often perform poorly in the presence of domain shift and category shift. How to upcycle DNNs and adapt them to the target task remains an important open problem. Unsupervised Domain Adaptation (UDA), especially recently proposed Source-free Domain Adaptation (SFDA), has become a promising technology to address this issue. Nevertheless, most existing SFDA methods require that the source domain and target domain share the same label space, consequently being only applicable to the vanilla closed-set setting. In this paper, we take one step further and explore the Source-free Universal Domain Adaptation (SF-UniDA). The goal is to identify "known" data samples under both domain and category shift, and reject those "unknown" data samples (not present in source classes), with only the knowledge from standard pre-trained source model. To this end, we introduce an innovative global and local clustering learning technique (GLC). Specifically, we design a novel, adaptive one-vs-all global clustering algorithm to achieve the distinction across different target classes and introduce a local k-NN clustering strategy to alleviate negative transfer. We examine the superiority of our GLC on multiple benchmarks with different category shift scenarios, including partial-set, open-set, and open-partial-set DA. More remarkably, in the most challenging open-partial-set DA scenario, GLC outperforms UMAD by 14.8% on the VisDA benchmark.

Single Domain Generalization for LiDAR Semantic Segmentation
Kim, HyeonseongandKang, YoonsuandOh, ChanggyoonandYoon, Kuk-Jin



研究问题:如何使深度学习模型在未见过的数据域中也能表现良好,特别是在LiDAR语义分割领域。
动机:现有的3D深度学习模型在训练源数据域上表现良好,但在未见过的数据域(如不同的LiDAR传感器配置和场景分布)中性能下降,存在明显的领域差距。
方法:提出一种名为DGLSS的LiDAR语义分割单域泛化方法,通过仅在源数据域上进行学习来确保在源数据域和未见过的数据域中都有良好的性能。主要通过模拟未见过的数据域来扩大训练领域,并引入两种约束条件来进行可泛化的特征学习:稀疏性不变特征一致性(SIFC)和语义相关性一致性(SCC)。
效果:实验结果表明,与其它基线相比,该方法在未见过的数据域中的性能有所提高。即使没有目标数据域的访问权限,该方法的性能也优于领域适应方法。

With the success of the 3D deep learning models, various perception technologies for autonomous driving have been developed in the LiDAR domain. While these models perform well in the trained source domain, they struggle in unseen domains with a domain gap. In this paper, we propose a single domain generalization method for LiDAR semantic segmentation (DGLSS) that aims to ensure good performance not only in the source domain but also in the unseen domain by learning only on the source domain. We mainly focus on generalizing from a dense source domain and target the domain shift from different LiDAR sensor configurations and scene distributions. To this end, we augment the domain to simulate the unseen domains by randomly subsampling the LiDAR scans. With the augmented domain, we introduce two constraints for generalizable representation learning: sparsity invariant feature consistency (SIFC) and semantic correlation consistency (SCC). The SIFC aligns sparse internal features of the source domain with the augmented domain based on the feature affinity. For SCC, we constrain the correlation between class prototypes to be similar for every LiDAR scan. We also establish a standardized training and evaluation setting for DGLSS. With the proposed evaluation setting, our method showed improved performance in the unseen domains compared to other baselines. Even without access to the target domain, our method performed better than the domain adaptation method. The code is available at https://github.com/gzgzys9887/DGLSS.

Balanced Energy Regularization Loss for Out-of-Distribution Detection
Choi, HyunjunandJeong, HawookandChoi, JinYoung



研究问题:在OOD检测中,如何有效处理辅助数据分布不平衡的问题。
动机:现有的方法对所有辅助数据等同对待,无法解决类别间的不平衡问题。
方法:提出一种平衡能量正则化损失函数,利用各类别的先验概率对辅助数据进行不同程度的正则化,主要思想是对多数类别的辅助样本施加更重的正则化。
效果:在语义分割、长尾图像分类和图像分类的OOD检测任务上,该方法均表现优于先前的能量正则化损失函数,并在语义分割和长尾图像分类的OOD检测任务上达到最先进的性能。

In the field of out-of-distribution (OOD) detection, a previous method that use auxiliary data as OOD data has shown promising performance. However, the method provides an equal loss to all auxiliary data to differentiate them from inliers. However, based on our observation, in various tasks, there is a general imbalance in the distribution of the auxiliary OOD data across classes. We propose a balanced energy regularization loss that is simple but generally effective for a variety of tasks. Our balanced energy regularization loss utilizes class-wise different prior probabilities for auxiliary data to address the class imbalance in OOD data. The main concept is to regularize auxiliary samples from majority classes, more heavily than those from minority classes. Our approach performs better for OOD detection in semantic segmentation, long-tailed image classification, and image classification than the prior energy regularization loss. Furthermore, our approach achieves state-of-the-art performance in two tasks: OOD detection in semantic segmentation and long-tailed image classification.

SLACK: Stable Learning of Augmentations With Cold-Start and KL Regularization
Marrie, JulietteandArbel, MichaelandLarlus, DianeandMairal, Julien



研究问题:如何自动进行数据增强以提高神经网络的泛化能力,同时避免依赖手动选择的转换集。
动机:现有的自动数据增强方法大多依赖于一些先验信息,如预训练网络或强制将手动选择的默认转换作为自动数据增强算法学习的策略的一部分。
方法:本文提出了一种直接学习增强策略的方法,不依赖这种先验知识。通过使用连续分布参数化大小和采用带有KL散度正则化的逐次冷启动策略,以解决双层优化问题的更大搜索空间和固有不稳定性问题。
效果:尽管设置更具挑战性,但该方法在标准基准测试中取得了有竞争力的结果,并能够推广到自然图像之外。

Data augmentation is known to improve the generalization capabilities of neural networks, provided that the set of transformations is chosen with care, a selection often performed manually. Automatic data augmentation aims at automating this process. However, most recent approaches still rely on some prior information; they start from a small pool of manually-selected default transformations that are either used to pretrain the network or forced to be part of the policy learned by the automatic data augmentation algorithm. In this paper, we propose to directly learn the augmentation policy without leveraging such prior knowledge. The resulting bilevel optimization problem becomes more challenging due to the larger search space and the inherent instability of bilevel optimization algorithms. To mitigate these issues (i) we follow a successive cold-start strategy with a Kullback-Leibler regularization, and (ii) we parameterize magnitudes as continuous distributions. Our approach leads to competitive results on standard benchmarks despite a more challenging setting, and generalizes beyond natural images.

Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization
Zhang, XingxuanandXu, RenzheandYu, HanandZou, HaoandCui, Peng



研究问题:当前模型优化器在寻找最小化损失函数时,如何更好地区分具有低泛化误差和高泛化误差的极小值。
动机:现有的平坦度定义(如SAM)主要关注零阶平坦度,即在给定扰动半径内的最坏情况损失,但这种定义可能无法充分区分具有低和高泛化误差的极小值。
方法:提出了一阶平坦度作为更强的平坦度度量,它关注在给定扰动半径内的最大梯度范数,可以同时约束局部极小点的海森矩阵的最大特征值和SAM的正则化函数。并设计了一种新的训练过程——梯度范数感知最小化(GAM),以寻找在所有方向上曲率均匀较小的极小值。
效果:实验结果显示,GAM能提高各种数据集和网络下使用现有优化器(如SGD和AdamW)训练的模型的泛化能力。此外,GAM还能帮助SAM找到更平坦的极小值,实现更好的泛化。

Recently, flat minima are proven to be effective for improving generalization and sharpness-aware minimization (SAM) achieves state-of-the-art performance. Yet the current definition of flatness discussed in SAM and its follow-ups are limited to the zeroth-order flatness (i.e., the worst-case loss within a perturbation radius). We show that the zeroth-order flatness can be insufficient to discriminate minima with low generalization error from those with high generalization error both when there is a single minimum or multiple minima within the given perturbation radius. Thus we present first-order flatness, a stronger measure of flatness focusing on the maximal gradient norm within a perturbation radius which bounds both the maximal eigenvalue of Hessian at local minima and the regularization function of SAM. We also present a novel training procedure named Gradient norm Aware Minimization (GAM) to seek minima with uniformly small curvature across all directions. Experimental results show that GAM improves the generalization of models trained with current optimizers such as SGD and AdamW on various datasets and networks. Furthermore, we show that GAM can help SAM find flatter minima and achieve better generalization.

GraVoS: Voxel Selection for 3D Point-Cloud Detection
Shrout, OrenandBen-Shabat, YizhakandTal, Ayellet



研究问题:在大型3D场景中进行3D物体检测是一项挑战,不仅因为稀疏和不规则的3D点云,还因为前景-背景场景和类别的极度不平衡。
动机:我们提出了一种通过移除元素(体素)而不是添加来修改场景的方法,以解决这两种类型的数据集不平衡问题。
方法:我们选择“有意义的”体素,这种方法可以应用于任何基于体素的检测器,但体素的意义取决于网络。
效果:我们的体素选择被证明可以提高几种突出的3D检测方法的性能。

3D object detection within large 3D scenes is challenging not only due to the sparse and irregular 3D point clouds, but also due to both the extreme foreground-background scene imbalance and class imbalance. A common approach is to add ground-truth objects from other scenes. Differently, we propose to modify the scenes by removing elements (voxels), rather than adding ones. Our approach selects the "meaningful" voxels, in a manner that addresses both types of dataset imbalance. The approach is general and can be applied to any voxel-based detector, yet the meaningfulness of a voxel is network-dependent. Our voxel selection is shown to improve the performance of several prominent 3D detection methods.

Rethinking Image Super Resolution From Long-Tailed Distribution Learning Perspective
Gou, YuanbiaoandHu, PengandLv, JianchengandZhu, HongyuanandPeng, Xi



研究问题:现有的预训练语言模型如何更好地利用知识图谱中的结构化知识。
动机:目前的预训练语言模型在处理知识驱动任务时,对知识图谱的利用不足,而知识图谱中的有信息量的实体可以增强语言表示。
方法:本文提出了一种增强的语言表示模型ERNIE,该模型同时利用大规模文本语料库和知识图谱进行训练,能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Existing studies have empirically observed that the resolution of the low-frequency region is easier to enhance than that of the high-frequency one. Although plentiful works have been devoted to alleviating this problem, little understanding is given to explain it. In this paper, we try to give a feasible answer from a machine learning perspective, i.e., the twin fitting problem caused by the long-tailed pixel distribution in natural images. With this explanation, we reformulate image super resolution (SR) as a long-tailed distribution learning problem and solve it by bridging the gaps of the problem between in low- and high-level vision tasks. As a result, we design a long-tailed distribution learning solution, that rebalances the gradients from the pixels in the low- and high-frequency region, by introducing a static and a learnable structure prior. The learned SR model achieves better balance on the fitting of the low- and high-frequency region so that the overall performance is improved. In the experiments, we evaluate the solution on four CNN- and one Transformer-based SR models w.r.t. six datasets and three tasks, and experimental results demonstrate its superiority.

On the Pitfall of Mixup for Uncertainty Calibration
Wang, Deng-BaoandLi, LanqingandZhao, PeilinandHeng, Pheng-AnnandZhang, Min-Ling



研究问题:本文旨在解决混合训练可能导致模型校准性下降的问题。
动机:虽然混合训练已被证明能提高预测精度,并使模型在不确定性校准上表现良好,但我们发现它通常会降低模型的校准性,这可能会对后验校准产生负面影响。
方法:我们将混合过程分解为数据转换和随机扰动,并提出了一种名为mixup推理的训练策略,该策略采用简单的解耦原则,在网络前向传播结束时恢复原始样本的输出。
效果:实验表明,这种策略在不牺牲预测性能的情况下,适当解决了混合训练的校准问题,甚至比原始混合训练提高了精度。

By simply taking convex combinations between pairs of samples and their labels, mixup training has been shown to easily improve predictive accuracy. It has been recently found that models trained with mixup also perform well on uncertainty calibration. However, in this study, we found that mixup training usually makes models less calibratable than vanilla empirical risk minimization, which means that it would harm uncertainty estimation when post-hoc calibration is considered. By decomposing the mixup process into data transformation and random perturbation, we suggest that the confidence penalty nature of the data transformation is the reason of calibration degradation. To mitigate this problem, we first investigate the mixup inference strategy and found that despite it improves calibration on mixup, this ensemble-like strategy does not necessarily outperform simple ensemble. Then, we propose a general strategy named mixup inference in training, which adopts a simple decoupling principle for recovering the outputs of raw samples at the end of forward network pass. By embedding the mixup inference, models can be learned from the original one-hot labels and hence avoid the negative impact of confidence penalty. Our experiments show this strategy properly solves mixup's calibration issue without sacrificing the predictive performance, while even improves accuracy than vanilla mixup.

C-SFDA: A Curriculum Learning Aided Self-Training Framework for Efficient Source Free Domain Adaptation
Karim, NazmulandMithun, NiluthpolChowdhuryandRajvanshi, AbhinavandChiu, Han-pangandSamarasekera, SupunandRahnavard, Nazanin



研究问题:如何将预训练的语言模型与知识图谱结合,以增强语言表示?
动机:目前的预训练语言模型缺乏对结构化知识的利用,而知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱联合训练ERNIE模型,同时充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Unsupervised domain adaptation (UDA) approaches focus on adapting models trained on a labeled source domain to an unlabeled target domain. In contrast to UDA, source-free domain adaptation (SFDA) is a more practical setup as access to source data is no longer required during adaptation. Recent state-of-the-art (SOTA) methods on SFDA mostly focus on pseudo-label refinement based self-training which generally suffers from two issues: i) inevitable occurrence of noisy pseudo-labels that could lead to early training time memorization, ii) refinement process requires maintaining a memory bank which creates a significant burden in resource constraint scenarios. To address these concerns, we propose C-SFDA, a curriculum learning aided self-training framework for SFDA that adapts efficiently and reliably to changes across domains based on selective pseudo-labeling. Specifically, we employ a curriculum learning scheme to promote learning from a restricted amount of pseudo labels selected based on their reliabilities. This simple yet effective step successfully prevents label noise propagation during different stages of adaptation and eliminates the need for costly memory-bank based label refinement. Our extensive experimental evaluations on both image recognition and semantic segmentation tasks confirm the effectiveness of our method. C-SFDA is also applicable to online test-time domain adaptation and outperforms previous SOTA methods in this task.

Improving Zero-Shot Generalization and Robustness of Multi-Modal Models
Ge, YunhaoandRen, JieandGallagher, AndrewandWang, YuxiaoandYang, Ming-HsuanandAdam, HartwigandItti, LaurentandLakshminarayanan, BalajiandZhao, Jiaping



研究问题:多模态图像-文本模型如CLIP和LiT在图像分类基准测试中表现出色,但其零样本泛化能力存在显著的性能差距。
动机:这些模型的零样本准确率虽然很高,但top-1准确率却低很多(在某些情况下差距超过25%)。研究发现,许多失败案例都是由文本提示的模糊性引起的。
方法:我们开发了一种简单而有效的零样本后处理方法,通过测量预测与多个提示和图像变换的一致性,来识别其top-1预测可能错误的图像。我们还提出了一种利用WordNet层次结构提高这种不确定图像准确性的方法。
效果:在CLIP和LiT模型上进行实验,结果显示,我们的方法是有效的,可以显著提高模型的top-1准确率。

Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First, we develop a simple and efficient zero-shot post-hoc method to identify images whose top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy; specifically we augment the original class by incorporating its parent and children from the semantic label hierarchy, and plug the augmentation into text prompts. We conduct experiments on both CLIP and LiT models with five different ImageNet- based datasets. For CLIP, our method improves the top-1 accuracy by 17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set. We also show that our method improves across ImageNet shifted datasets, four other datasets, and other model architectures such as LiT. Our proposed method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures. Code is available at https://github.com/gyhandy/Hierarchy-CLIP.

Modeling the Distributional Uncertainty for Salient Object Detection Models
Tian, XinyuandZhang, JingandXiang, MochuandDai, Yuchao



研究问题:现有的显著性目标检测(SOD)模型主要关注提高整体模型性能,没有明确解释训练和测试分布之间的差异。
动机:本文研究了显著性目标检测中的一种特定类型的认识不确定性,即分布不确定性。
方法:我们首次探索了现有的类别感知分布差距探索技术,如长尾学习、单一模型不确定性建模和测试时策略,并将其适应于我们的类别无关任务的分布不确定性模型。
效果:通过广泛的实验结果验证了现有分布差距模型技术在SOD中的有效性,得出结论,训练时的单一模型不确定性估计技术和防止模型激活过度漂移的权重正则化解决方案是SOD分布不确定性建模有希望的方向。

Most of the existing salient object detection (SOD) models focus on improving the overall model performance, without explicitly explaining the discrepancy between the training and testing distributions. In this paper, we investigate a particular type of epistemic uncertainty, namely distributional uncertainty, for salient object detection. Specifically, for the first time, we explore the existing class-aware distribution gap exploration techniques, i.e. long-tail learning, single-model uncertainty modeling and test-time strategies, and adapt them to model the distributional uncertainty for our class-agnostic task. We define test sample that is dissimilar to the training dataset as being "out-of-distribution" (OOD) samples. Different from the conventional OOD definition, where OOD samples are those not belonging to the closed-world training categories, OOD samples for SOD are those break the basic priors of saliency, i.e. center prior, color contrast prior, compactness prior and etc., indicating OOD as being "continuous" instead of being discrete for our task. We've carried out extensive experimental results to verify effectiveness of existing distribution gap modeling techniques for SOD, and conclude that both train-time single-model uncertainty estimation techniques and weight-regularization solutions that preventing model activation from drifting too much are promising directions for modeling distributional uncertainty for SOD.

Robust Model-Based Face Reconstruction Through Weakly-Supervised Outlier Segmentation
Li, ChunluandMorel-Forster, AndreasandVetter, ThomasandEgger, BernhardandKortylewski, Adam



研究问题:本文旨在通过避免拟合模型到异常值(如遮挡物或化妆)来增强基于模型的面部重建。
动机:异常值的高度变异性和难以标注性给局部化异常值带来了挑战。
方法:提出了一种联合人脸自动编码器和异常值分割方法(FOCUS)。利用高质量模型无法良好拟合异常值的事实,可以很好地定位异常值。采用EM类型的训练策略,将人脸自动编码器与异常值分割网络联合训练,防止面部编码器拟合到异常值,提高重建质量。
效果:在NoW测试集上的实验表明,FOCUS在所有未使用3D标注进行训练的基线上实现了最先进的3D面部重建性能。此外,在CelebA-HQ和AR数据库上的结果还显示,即使没有进行任何分割标注的训练,分割网络也能准确地定位遮挡物。

In this work, we aim to enhance model-based face reconstruction by avoiding fitting the model to outliers, i.e. regions that cannot be well-expressed by the model such as occluders or make-up. The core challenge for localizing outliers is that they are highly variable and difficult to annotate. To overcome this challenging problem, we introduce a joint Face-autoencoder and outlier segmentation approach (FOCUS).In particular, we exploit the fact that the outliers cannot be fitted well by the face model and hence can be localized well given a high-quality model fitting. The main challenge is that the model fitting and the outlier segmentation are mutually dependent on each other, and need to be inferred jointly. We resolve this chicken-and-egg problem with an EM-type training strategy, where a face autoencoder is trained jointly with an outlier segmentation network. This leads to a synergistic effect, in which the segmentation network prevents the face encoder from fitting to the outliers, enhancing the reconstruction quality. The improved 3D face reconstruction, in turn, enables the segmentation network to better predict the outliers. To resolve the ambiguity between outliers and regions that are difficult to fit, such as eyebrows, we build a statistical prior from synthetic data that measures the systematic bias in model fitting. Experiments on the NoW testset demonstrate that FOCUS achieves SOTA 3D face reconstruction performance among all baselines that are trained without 3D annotation. Moreover, our results on CelebA-HQ and the AR database show that the segmentation network can localize occluders accurately despite being trained without any segmentation annotation.

Exploring Incompatible Knowledge Transfer in Few-Shot Image Generation
Zhao, YunqingandDu, ChaoandAbdollahzadeh, MiladandPang, TianyuandLin, MinandYan, ShuichengandCheung, Ngai-Man



研究问题:本文旨在解决小样本图像生成(FSIG)中的知识不兼容转移问题,即研究问题:本文旨在解决小样本图像生成(FSIG)中的知识不兼容转移问题,即从源生成器向目标生成器传递知识时,由于最不显著的过滤器导致合成样本的真实性显著降低。
动机:现有的FSIG方法通过选择、保留和转移来自相关领域的预训练源生成器的先验知识来学习目标生成器,但在此过程中存在一个被忽视的问题,即知识不兼容转移,这会严重影响生成图像的真实性。
方法:为解决这个问题,本文提出了知识截断方法,作为知识保留的补充操作,并通过一种轻量级的基于剪枝的方法进行实现。
效果:实验表明,知识截断方法简单有效,在各种挑战性设置下,包括源域和目标域距离较远的情况,都能取得最先进的性能。

Few-shot image generation (FSIG) learns to generate diverse and high-fidelity images from a target domain using a few (e.g., 10) reference samples. Existing FSIG methods select, preserve and transfer prior knowledge from a source generator (pretrained on a related domain) to learn the target generator. In this work, we investigate an underexplored issue in FSIG, dubbed as incompatible knowledge transfer, which would significantly degrade the realisticness of synthetic samples. Empirical observations show that the issue stems from the least significant filters from the source generator. To this end, we propose knowledge truncation to mitigate this issue in FSIG, which is a complementary operation to knowledge preservation and is implemented by a lightweight pruning-based method. Extensive experiments show that knowledge truncation is simple and effective, consistently achieving state-of-the-art performance, including challenging setups where the source and target domains are more distant. Project Page: https://yunqing-me.github.io/RICK.

Zero-Shot Generative Model Adaptation via Image-Specific Prompt Learning
Guo, JiayiandWang, ChaofeiandWu, YouandZhang, EricandWang, KaiandXu, XingqianandSong, ShijiandShi, HumphreyandHuang, Gao



研究问题:如何提高跨领域图像生成的质量与多样性,并解决模式崩溃问题。
动机:现有的跨领域图像生成方法在质量和多样性上存在局限,且易受模式崩溃影响,主要原因是对所有跨领域图像对应用固定的适应方向导致监督信号相同。
方法:提出一种针对特定图像的提示学习(IPL)方法,为每个源领域图像学习特定的提示向量,为每个跨领域图像对产生更精确的适应方向,增强目标领域生成器的灵活性。
效果:实验证明,IPL能有效提升生成图像的质量和多样性,缓解模式崩溃问题,且独立于生成模型的结构,如生成对抗网络或扩散模型。

Recently, CLIP-guided image synthesis has shown appealing performance on adapting a pre-trained source-domain generator to an unseen target domain. It does not require any target-domain samples but only the textual domain labels. The training is highly efficient, e.g., a few minutes. However, existing methods still have some limitations in the quality of generated images and may suffer from the mode collapse issue. A key reason is that a fixed adaptation direction is applied for all cross-domain image pairs, which leads to identical supervision signals. To address this issue, we propose an Image-specific Prompt Learning (IPL) method, which learns specific prompt vectors for each source-domain image. This produces a more precise adaptation direction for every cross-domain image pair, endowing the target-domain generator with greatly enhanced flexibility. Qualitative and quantitative evaluations on various domains demonstrate that IPL effectively improves the quality and diversity of synthesized images and alleviates the mode collapse. Moreover, IPL is independent of the structure of the generative model, such as generative adversarial networks or diffusion models. Code is available at https://github.com/Picsart-AI-Research/IPL-Zero-Shot-Generative-Model-Adaptation.

Hard Patches Mining for Masked Image Modeling
Wang, HaochenandSong, KaiyouandFan, JunsongandWang, YuxiandXie, JinandZhang, Zhaoxiang



研究问题:本文旨在解决预训练视觉模型(Masked Image Modeling,MIM)在预测被遮蔽图像内容时过于依赖预设遮蔽策略的问题。
动机:目前的预训练视觉模型主要关注预测被遮蔽图像的具体内容,其性能与预设的遮蔽策略高度相关。作者认为,模型不仅应该专注于解决给定的问题,还应该扮演教师的角色,自己产生更具挑战性的问题。
方法:提出了一种新的预训练框架——硬patches挖掘(Hard Patches Mining,HPM)。通过引入一个辅助的损失预测器,预测每个区域的重建损失,并决定下一个要遮蔽的区域。采用相对关系学习策略防止过拟合到精确的重建损失值。
效果:实验表明,HPM在构造遮蔽图像方面非常有效。仅引入损失预测目标就足以产生强大的表示,验证了能够意识到难以重建的部分的能力的有效性。

Masked image modeling (MIM) has attracted much research attention due to its promising potential for learning scalable visual representations. In typical approaches, models usually focus on predicting specific contents of masked patches, and their performances are highly related to pre-defined mask strategies. Intuitively, this procedure can be considered as training a student (the model) on solving given problems (predict masked patches). However, we argue that the model should not only focus on solving given problems, but also stand in the shoes of a teacher to produce a more challenging problem by itself. To this end, we propose Hard Patches Mining (HPM), a brand-new framework for MIM pre-training. We observe that the reconstruction loss can naturally be the metric of the difficulty of the pre-training task. Therefore, we introduce an auxiliary loss predictor, predicting patch-wise losses first and deciding where to mask next. It adopts a relative relationship learning strategy to prevent overfitting to exact reconstruction loss values. Experiments under various settings demonstrate the effectiveness of HPM in constructing masked images. Furthermore, we empirically find that solely introducing the loss prediction objective leads to powerful representations, verifying the efficacy of the ability to be aware of where is hard to reconstruct.

GKEAL: Gaussian Kernel Embedded Analytic Learning for Few-Shot Class Incremental Task
Zhuang, HuipingandWeng, ZhenyuandHe, RunandLin, ZhipingandZeng, Ziqian



研究问题:本文旨在解决小样本增量学习中的灾难性遗忘问题。
动机:在小样本学习设置中,类别增量学习会导致灾难性遗忘。
方法:采用解析学习技术,将网络训练转化为线性问题,通过递归实现和权重相同属性来避免灾难性遗忘,提出高斯核嵌入解析学习方法(GKEAL)。
效果:实验表明,GKEAL在几个基准数据集上取得了最先进的性能。

Few-shot class incremental learning (FSCIL) aims to address catastrophic forgetting during class incremental learning in a few-shot learning setting. In this paper, we approach the FSCIL by adopting analytic learning, a technique that converts network training into linear problems. This is inspired by the fact that the recursive implementation (batch-by-batch learning) of analytic learning gives identical weights to that produced by training on the entire dataset at once. The recursive implementation and the weight-identical property highly resemble the FSCIL setting (phase-by-phase learning) and its goal of avoiding catastrophic forgetting. By bridging the FSCIL with the analytic learning, we propose a Gaussian kernel embedded analytic learning (GKEAL) for FSCIL. The key components of GKEAL include the kernel analytic module which allows the GKEAL to conduct FSCIL in a recursive manner, and the augmented feature concatenation module that balances the preference between old and new tasks especially effectively under the few-shot setting. Our experiments show that the GKEAL gives state-of-the-art performance on several benchmark datasets.

Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition
Wanyan, YuyangandYang, XiaoshanandChen, ChaofanandXu, Changsheng



研究问题:如何有效地利用多模态信息进行少次动作识别。
动机:尽管少次动作识别已取得显著进步,但大部分方法主要依赖有限的单模态数据(如RGB帧),而对多模态信息的探索相对较少。
方法:提出一种新的主动多模态少次动作识别(AMFAR)框架,该框架可以根据任务相关的上下文信息主动为每个样本找到可靠的模态,以提高少次推理过程。在元训练中,设计了一个主动样本选择(ASS)模块,将模态可靠性差异大的查询样本根据模态特定的后验分布组织成不同的组。此外,设计了一个主动互导蒸馏(AMD)模块,通过双向知识蒸馏从可靠的模态中捕获区分性的任务特定知识,以提高不可靠模态的表示学习。在元测试中,采用自适应多模态推理(AMI)模块,自适应地融合模态特定的后验分布,其中可靠的模态权重更大。
效果:在四个公共基准测试集上的大量实验结果表明,我们的模型在现有的单模态和多模态方法上取得了显著的改进。

Recently, few-shot action recognition receives increasing attention and achieves remarkable progress. However, previous methods mainly rely on limited unimodal data (e.g., RGB frames) while the multimodal information remains relatively underexplored. In this paper, we propose a novel Active Multimodal Few-shot Action Recognition (AMFAR) framework, which can actively find the reliable modality for each sample based on task-dependent context information to improve few-shot reasoning procedure. In meta-training, we design an Active Sample Selection (ASS) module to organize query samples with large differences in the reliability of modalities into different groups based on modality-specific posterior distributions. In addition, we design an Active Mutual Distillation (AMD) module to capture discriminative task-specific knowledge from the reliable modality to improve the representation learning of unreliable modality by bidirectional knowledge distillation. In meta-test, we adopt Adaptive Multimodal Inference (AMI) module to adaptively fuse the modality-specific posterior distributions with a larger weight on the reliable modality. Extensive experimental results on four public benchmarks demonstrate that our model achieves significant improvements over existing unimodal and multimodal methods.

Deep Learning of Partial Graph Matching via Differentiable Top-K
Wang, RunzhongandGuo, ZiaoandJiang, ShaofeiandYang, XiaokangandYan, Junchi



研究问题:本文旨在解决图匹配(GM)中的NP-hard问题,特别是在存在离群点的情况下。
动机:在图匹配中,离群点的存在是一个普遍的问题,尤其是在视觉问题上。现有的基于亲和力最大化的方法往往缺乏抑制错误匹配的机制,而依赖于手工设定的阈值来排除离群点。
方法:本文将部分图匹配问题表述为给定/估计内联数量k的top-k选择任务。具体来说,我们设计了一个可微分的top-k模块,该模块可以在最优传输层上进行有效的梯度下降,可以方便地插入到包括二次匹配网络NGMv2和线性匹配网络GCAN在内的最新深GM管道中。同时,我们还开发了注意力融合的聚合层来估计k,以实现野外自动离群稳健匹配。
效果:实验表明,我们的方法在流行的基准测试上优于其他部分匹配方案。

Graph matching (GM) aims at discovering node matching between graphs, by maximizing the node- and edge-wise affinities between the matched elements. As an NP-hard problem, its challenge is further pronounced in the existence of outlier nodes in both graphs which is ubiquitous in practice, especially for vision problems. However, popular affinity-maximization-based paradigms often lack a principled scheme to suppress the false matching and resort to handcrafted thresholding to dismiss the outliers. This limitation is also inherited by the neural GM solvers though they have shown superior performance in the ideal no-outlier setting. In this paper, we propose to formulate the partial GM problem as the top-k selection task with a given/estimated number of inliers k. Specifically, we devise a differentiable top-k module that enables effective gradient descent over the optimal-transport layer, which can be readily plugged into SOTA deep GM pipelines including the quadratic matching network NGMv2 as well as the linear matching network GCAN. Meanwhile, the attention-fused aggregation layers are developed to estimate k to enable automatic outlier-robust matching in the wild. Last but not least, we remake and release a new benchmark called IMC-PT-SparseGM, originating from the IMC-PT stereo-matching dataset. The new benchmark involves more scale-varying graphs and partial matching instances from the real world. Experiments show that our methods outperform other partial matching schemes on popular benchmarks.

Super-CLEVR: A Virtual Benchmark To Diagnose Domain Robustness in Visual Reasoning
Li, ZhuowanandWang, XingruiandStengel-Eskin, EliasandKortylewski, AdamandMa, WufeiandVanDurme, BenjaminandYuille, AlanL.



研究问题:视觉问答(VQA)模型在分布外数据上表现不佳,且在领域泛化上存在困难。
动机:由于VQA任务的多模态特性,多个变化因素交织在一起,使得泛化分析变得困难。因此,我们引入了一个虚拟基准——Super-CLEVR,以隔离VQA领域偏移中的不同因素,以便独立研究它们的影响。
方法:我们在Super-CLEVR中考虑了四个因素:视觉复杂性、问题冗余、概念分布和概念复合性。通过控制生成的数据,Super-CLEVR使我们能够在测试数据与训练数据在这些轴上有所不同的情况下测试VQA方法。我们研究了四种现有方法,包括两种神经符号方法NSCL和NSVQA,以及两种非符号方法FiLM和mDETR;以及我们提出的方法,概率NSVQA(P-NSVQA),该方法通过不确定性推理扩展了NSVQA。P-NSVQA在四个领域偏移因素中的三个上优于其他方法。
效果:我们的结果表明,将推理和感知解耦,结合概率不确定性,构成了一个强大的VQA模型,对领域偏移更具鲁棒性。数据集和代码发布在https://github.com/Lizw14/Super-CLEVR。

Visual Question Answering (VQA) models often perform poorly on out-of-distribution data and struggle on domain generalization. Due to the multi-modal nature of this task, multiple factors of variation are intertwined, making generalization difficult to analyze. This motivates us to introduce a virtual benchmark, Super-CLEVR, where different factors in VQA domain shifts can be isolated in order that their effects can be studied independently. Four factors are considered: visual complexity, question redundancy, concept distribution and concept compositionality. With controllably generated data, Super-CLEVR enables us to test VQA methods in situations where the test data differs from the training data along each of these axes. We study four existing methods, including two neural symbolic methods NSCL and NSVQA, and two non-symbolic methods FiLM and mDETR; and our proposed method, probabilistic NSVQA (P-NSVQA), which extends NSVQA with uncertainty reasoning. P-NSVQA outperforms other methods on three of the four domain shift factors. Our results suggest that disentangling reasoning and perception, combined with probabilistic uncertainty, form a strong VQA model that is more robust to domain shifts. The dataset and code are released at https://github.com/Lizw14/Super-CLEVR.

Domain Expansion of Image Generators
Nitzan, YotamandGharbi, Micha\"elandZhang, RichardandPark, TaesungandZhu, Jun-YanandCohen-Or, DanielandShechtman, Eli



研究问题:能否在尊重已训练生成模型的现有结构和知识的情况下,向其注入新的概念?
动机:为了解决这一问题,我们提出了一个新的任务——领域扩展。
方法:我们提出了一种新方法,通过“重新利用”预训练生成模型中未使用的“休眠”轴,来表示新的领域,而无需干扰原始表示。
效果:实验结果表明,预训练生成模型有能力添加多个甚至数百个新领域。使用我们的扩展技术,一个“扩展”的模型可以取代许多特定领域的模型,而无需扩大模型规模。此外,使用单个扩展生成器可以自然地支持领域之间的平滑过渡和组合。

Can one inject new concepts into an already trained generative model, while respecting its existing structure and knowledge? We propose a new task -- domain expansion -- to address this. Given a pretrained generator and novel (but related) domains, we expand the generator to jointly model all domains, old and new, harmoniously. First, we note the generator contains a meaningful, pretrained latent space. Is it possible to minimally perturb this hard-earned representation, while maximally representing the new domains? Interestingly, we find that the latent space offers unused, "dormant" axes, which do not affect the output. This provides an opportunity -- by "repurposing" these axes, we are able to represent new domains, without perturbing the original representation. In fact, we find that pretrained generators have the capacity to add several -- even hundreds -- of new domains! Using our expansion technique, one "expanded" model can supersede numerous domain-specific models, without expanding model size. Additionally, using a single, expanded generator natively supports smooth transitions between and composition of domains.

FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding
Truong, Thanh-DatandLe, NganandRaj, BhikshaandCothren, JacksonandLuu, Khoa



研究问题:尽管近年来在语义场景分割中的领域适应取得了显著的改进,但公平性问题尚未得到明确定义和解决。
动机:公平性是部署分割模型到与人类相关的现实世界应用(如自动驾驶)中最关键的方面之一,因为任何不公平的预测都可能影响人类安全。
方法:本文提出了一种新的公平性领域适应(FREDOM)方法来进行语义场景分割。具体来说,从提出的公平性目标出发,基于类别分布的公平处理引入了一个新的适应框架。此外,为了一般化地模拟结构依赖的上下文,引入了一种新的条件结构约束来强制预测分割的一致性。
效果:通过消融研究,该方法已显示出分割模型的性能改进,并促进了模型预测的公平性。在两个标准基准测试(即SYNTHIA -> Cityscapes和GTA5 -> Cityscapes)上的实验结果表明,我们的方法实现了最先进的性能。

Although Domain Adaptation in Semantic Scene Segmentation has shown impressive improvement in recent years, the fairness concerns in the domain adaptation have yet to be well defined and addressed. In addition, fairness is one of the most critical aspects when deploying the segmentation models into human-related real-world applications, e.g., autonomous driving, as any unfair predictions could influence human safety. In this paper, we propose a novel Fairness Domain Adaptation (FREDOM) approach to semantic scene segmentation. In particular, from the proposed formulated fairness objective, a new adaptation framework will be introduced based on the fair treatment of class distributions. Moreover, to generally model the context of structural dependency, a new conditional structural constraint is introduced to impose the consistency of predicted segmentation. Thanks to the proposed Conditional Structure Network, the self-attention mechanism has sufficiently modeled the structural information of segmentation. Through the ablation studies, the proposed method has shown the performance improvement of the segmentation models and promoted fairness in the model predictions. The experimental results on the two standard benchmarks, i.e., SYNTHIA -> Cityscapes and GTA5 -> Cityscapes, have shown that our method achieved State-of-the-Art (SOTA) performance.

SimpleNet: A Simple Network for Image Anomaly Detection and Localization
Liu, ZhikangandZhou, YimingandXu, YuanshengandWang, Zilei



研究问题:提出一种简单且实用的网络(称为SimpleNet)用于检测和定位异常。
动机:目前的网络在异常检测和定位上存在一些问题,如需要大量训练数据、计算资源消耗大等。
方法:SimpleNet由四个组件组成:预训练的特征提取器生成局部特征;浅层特征适配器将局部特征转移到目标领域;简单的异常特征生成器通过向正常特征添加高斯噪声来伪造异常特征;二进制异常判别器区分异常特征和正常特征。在推理过程中,异常特征生成器将被丢弃。
效果:实验结果表明,尽管简单,但SimpleNet在数量和质量上都优于先前的方法。在MVTec AD基准测试中,SimpleNet实现了99.6%的异常检测AUROC,与表现第二好的模型相比,误差降低了55.5%。此外,SimpleNet比现有方法更快,在3080ti GPU上的帧率高达77 FPS。此外,SimpleNet在一类新颖性检测任务上的性能也有显著提高。

We propose a simple and application-friendly network (called SimpleNet) for detecting and localizing anomalies. SimpleNet consists of four components: (1) a pre-trained Feature Extractor that generates local features, (2) a shallow Feature Adapter that transfers local features towards target domain, (3) a simple Anomaly Feature Generator that counterfeits anomaly features by adding Gaussian noise to normal features, and (4) a binary Anomaly Discriminator that distinguishes anomaly features from normal features. During inference, the Anomaly Feature Generator would be discarded. Our approach is based on three intuitions. First, transforming pre-trained features to target-oriented features helps avoid domain bias. Second, generating synthetic anomalies in feature space is more effective, as defects may not have much commonality in the image space. Third, a simple discriminator is much efficient and practical. In spite of simplicity, SimpleNet outperforms previous methods quantitatively and qualitatively. On the MVTec AD benchmark, SimpleNet achieves an anomaly detection AUROC of 99.6%, reducing the error by 55.5% compared to the next best performing model. Furthermore, SimpleNet is faster than existing methods, with a high frame rate of 77 FPS on a 3080ti GPU. Additionally, SimpleNet demonstrates significant improvements in performance on the One-Class Novelty Detection task. Code: https://github.com/DonaldRR/SimpleNet.

Decoupling MaxLogit for Out-of-Distribution Detection
Zhang, ZihanandXiang, Xiang



研究问题:在机器学习中,标准训练对分布内(ID)和分布外(OOD)数据的输出信心异常高,因此检测OOD样本的能力对于模型部署至关重要。
动机:为了解决这一问题,研究人员提出了一种基于日志的评分函数MaxLogit,但发现其性能受到MaxNorm的影响。
方法:研究人员将日志转换为余弦相似度和日志范数,并提出了使用MaxCosine和MaxNorm的方法。同时,他们提出了Decoupling MaxLogit (DML) 方法来平衡MaxCosine和MaxNorm,并将其扩展到DML+,以进一步体现该方法的核心思想。
效果:实验结果表明,他们在CIFAR-10、CIFAR-100和ImageNet上提出的基于日志的OOD检测方法非常有效,并建立了最先进的性能。

In machine learning, it is often observed that standard training outputs anomalously high confidence for both in-distribution (ID) and out-of-distribution (OOD) data. Thus, the ability to detect OOD samples is critical to the model deployment. An essential step for OOD detection is post-hoc scoring. MaxLogit is one of the simplest scoring functions which uses the maximum logits as OOD score. To provide a new viewpoint to study the logit-based scoring function, we reformulate the logit into cosine similarity and logit norm and propose to use MaxCosine and MaxNorm. We empirically find that MaxCosine is a core factor in the effectiveness of MaxLogit. And the performance of MaxLogit is encumbered by MaxNorm. To tackle the problem, we propose the Decoupling MaxLogit (DML) for flexibility to balance MaxCosine and MaxNorm. To further embody the core of our method, we extend DML to DML+ based on the new insights that fewer hard samples and compact feature space are the key components to make logit-based methods effective. We demonstrate the effectiveness of our logit-based OOD detection methods on CIFAR-10, CIFAR-100 and ImageNet and establish state-of-the-art performance.

Few-Shot Class-Incremental Learning via Class-Aware Bilateral Distillation
Zhao, LinglanandLu, JingandXu, YunluandCheng, ZhanzhanandGuo, DashanandNiu, YiandFang, Xiangzhong



研究问题:本文旨在解决小样本类别增量学习(FSCIL)中由于数据稀缺性带来的挑战,即如何基于少量训练样本持续学习新类别。
动机:现有的类别增量学习方法在处理小样本类别增量学习时,由于数据稀缺,往往会出现过拟合新类别的问题。
方法:提出一种新的知识蒸馏结构,通过从两个互补的教师那里获取知识来解决这个问题。其中一个是训练于丰富基础类别数据的模型,可以用于缓解当前新类别的过拟合;另一个是上一轮增量学习会话更新的模型,包含了之前新类别的适应知识,用于减轻其遗忘。同时,引入一个基于类别语义相似性的自适应策略来结合这两种指导。
效果:在mini-ImageNet、CIFAR100和CUB200三个流行的FSCIL数据集上的大量实验表明,该方法显著超越了现有工作,证明了其有效性。

Few-Shot Class-Incremental Learning (FSCIL) aims to continually learn novel classes based on only few training samples, which poses a more challenging task than the well-studied Class-Incremental Learning (CIL) due to data scarcity. While knowledge distillation, a prevailing technique in CIL, can alleviate the catastrophic forgetting of older classes by regularizing outputs between current and previous model, it fails to consider the overfitting risk of novel classes in FSCIL. To adapt the powerful distillation technique for FSCIL, we propose a novel distillation structure, by taking the unique challenge of overfitting into account. Concretely, we draw knowledge from two complementary teachers. One is the model trained on abundant data from base classes that carries rich general knowledge, which can be leveraged for easing the overfitting of current novel classes. The other is the updated model from last incremental session that contains the adapted knowledge of previous novel classes, which is used for alleviating their forgetting. To combine the guidances, an adaptive strategy conditioned on the class-wise semantic similarities is introduced. Besides, for better preserving base class knowledge when accommodating novel concepts, we adopt a two-branch network with an attention-based aggregation module to dynamically merge predictions from two complementary branches. Extensive experiments on 3 popular FSCIL datasets: mini-ImageNet, CIFAR100 and CUB200 validate the effectiveness of our method by surpassing existing works by a significant margin.

Detection of Out-of-Distribution Samples Using Binary Neuron Activation Patterns
Olber, Bart{\l



研究问题:深度神经网络在各种应用中表现出色,但面临分布外(OOD)样本的挑战。
动机:识别并处理未见过的输入对于安全关键应用如自动驾驶汽车、无人机和机器人至关重要。
方法:提出一种新颖的OOD检测方法,通过理论分析ReLU架构中的神经元激活模式,从卷积层提取二进制表示的激活模式,避免引入高计算开销。
效果:经过广泛的实证评估,该方法在各种深度神经网络架构和七个图像数据集上表现出高性能。

Deep neural networks (DNN) have outstanding performance in various applications. Despite numerous efforts of the research community, out-of-distribution (OOD) samples remain a significant limitation of DNN classifiers. The ability to identify previously unseen inputs as novel is crucial in safety-critical applications such as self-driving cars, unmanned aerial vehicles, and robots. Existing approaches to detect OOD samples treat a DNN as a black box and evaluate the confidence score of the output predictions. Unfortunately, this method frequently fails, because DNNs are not trained to reduce their confidence for OOD inputs. In this work, we introduce a novel method for OOD detection. Our method is motivated by theoretical analysis of neuron activation patterns (NAP) in ReLU-based architectures. The proposed method does not introduce a high computational overhead due to the binary representation of the activation patterns extracted from convolutional layers. The extensive empirical evaluation proves its high performance on various DNN architectures and seven image datasets.

Enlarging Instance-Specific and Class-Specific Information for Open-Set Action Recognition
Cen, JunandZhang, ShiweiandWang, XiangandPei, YixuanandQing, ZhiwuandZhang, YingyaandChen, Qifeng



研究问题:本文旨在解决开放集动作识别中的问题,即如何拒绝训练集分布之外的未知人类动作。
动机:现有的方法主要关注学习更好的不确定性分数,但忽视了特征表示的重要性。我们发现,在相同的不确定性分数下,具有更丰富语义多样性的特征可以显著提高开放集的性能。
方法:本文首先基于信息瓶颈理论分析开放集动作识别(OSAR)问题中的特征表示行为,并提出扩大特征中实例特定(IS)和类别特定(CS)信息以提高性能。为此,提出了一种新的原型相似性学习(PSL)框架,以保持同一类中的实例方差,从而保留更多的IS信息。此外,我们还注意到,与已知样本外观相似的未知样本容易被误分类为已知类别。为了解决这个问题,我们在PSL中进一步引入了视频混洗,以学习原始样本和混洗样本之间的独特时间信息,我们发现这扩大了CS信息。
效果:大量实验证明,提出的PSL可以显著提高开放集和封闭集的性能,并在多个基准测试上取得了最先进的结果。代码可在https://github.com/Jun-CEN/PSL获取。

Open-set action recognition is to reject unknown human action cases which are out of the distribution of the training set. Existing methods mainly focus on learning better uncertainty scores but dismiss the importance of feature representations. We find that features with richer semantic diversity can significantly improve the open-set performance under the same uncertainty scores. In this paper, we begin with analyzing the feature representation behavior in the open-set action recognition (OSAR) problem based on the information bottleneck (IB) theory, and propose to enlarge the instance-specific (IS) and class-specific (CS) information contained in the feature for better performance. To this end, a novel Prototypical Similarity Learning (PSL) framework is proposed to keep the instance variance within the same class to retain more IS information. Besides, we notice that unknown samples sharing similar appearances to known samples are easily misclassified as known classes. To alleviate this issue, video shuffling is further introduced in our PSL to learn distinct temporal information between original and shuffled samples, which we find enlarges the CS information. Extensive experiments demonstrate that the proposed PSL can significantly boost both the open-set and closed-set performance and achieves state-of-the-art results on multiple benchmarks. Code is available at https://github.com/Jun-CEN/PSL.

Deep Hashing With Minimal-Distance-Separated Hash Centers
Wang, LiangdaoandPan, YanandLiu, CongandLai, HanjiangandYin, JianandLiu, Ye



研究问题:如何提高大规模图像检索的效率和效果。
动机:现有的有监督深度学习哈希方法在训练效率、数据分布覆盖以及样本对不平衡问题上存在问题。
方法:提出一种优化方法,通过使用编码理论中的吉尔伯特-瓦沙莫夫边界来获取大的最小距离,同时确保优化方法的实证可行性。并采用明确分离的哈希中心为每个图像类别分配一个哈希中心,并提出几种有效的损失函数来训练深度哈希网络。
效果:在三个图像检索数据集上的大量实验表明,该方法在检索性能上超过了最先进的深度学习哈希方法。

Deep hashing is an appealing approach for large-scale image retrieval. Most existing supervised deep hashing methods learn hash functions using pairwise or triple image similarities in randomly sampled mini-batches. They suffer from low training efficiency, insufficient coverage of data distribution, and pair imbalance problems. Recently, central similarity quantization (CSQ) attacks the above problems by using "hash centers" as a global similarity metric, which encourages the hash codes of similar images to approach their common hash center and distance themselves from other hash centers. Although achieving SOTA retrieval performance, CSQ falls short of a worst-case guarantee on the minimal distance between its constructed hash centers, i.e. the hash centers can be arbitrarily close. This paper presents an optimization method that finds hash centers with a constraint on the minimal distance between any pair of hash centers, which is non-trivial due to the non-convex nature of the problem. More importantly, we adopt the Gilbert-Varshamov bound from coding theory, which helps us to obtain a large minimal distance while ensuring the empirical feasibility of our optimization approach. With these clearly-separated hash centers, each is assigned to one image class, we propose several effective loss functions to train deep hashing networks. Extensive experiments on three datasets for image retrieval demonstrate that the proposed method achieves superior retrieval performance over the state-of-the-art deep hashing methods.

GEN: Pushing the Limits of Softmax-Based Out-of-Distribution Detection
Liu, XixiandLochman, YaroslavaandZach, Christopher



研究问题:如何有效地进行神经网络的OOD检测,特别是在大规模数据集上。
动机:为了成功部署神经网络,特别是对于安全关键应用,需要进行OOD检测。在大规模数据集上进行OOD检测更接近现实,但也更具挑战性。
方法:提出了一种通用的熵评分函数GEN,这是一种简单但有效的基于熵的评分函数,可以应用于任何预训练的基于softmax的分类器。
效果:在大型ImageNet-1k OOD检测基准测试中,GEN的性能表现优越,其平均AUROC比六种常用的CNN和视觉变换器分类器的先进后处理方法提高了至少3.5%。

Out-of-distribution (OOD) detection has been extensively studied in order to successfully deploy neural networks, in particular, for safety-critical applications. Moreover, performing OOD detection on large-scale datasets is closer to reality, but is also more challenging. Several approaches need to either access the training data for score design or expose models to outliers during training. Some post-hoc methods are able to avoid the aforementioned constraints, but are less competitive. In this work, we propose Generalized ENtropy score (GEN), a simple but effective entropy-based score function, which can be applied to any pre-trained softmax-based classifier. Its performance is demonstrated on the large-scale ImageNet-1k OOD detection benchmark. It consistently improves the average AUROC across six commonly-used CNN-based and visual transformer classifiers over a number of state-of-the-art post-hoc methods. The average AUROC improvement is at least 3.5%. Furthermore, we used GEN on top of feature-based enhancing methods as well as methods using training statistics to further improve the OOD detection performance. The code is available at: https://github.com/XixiLiu95/GEN.

Sample-Level Multi-View Graph Clustering
Tan, YuzeandLiu, YixiandHuang, ShudongandFeng, WentaoandLv, Jiancheng



研究问题:本文旨在解决多视角聚类中存在的挑战,如忽视数据拓扑结构以及在探索聚类结构时无法完全保持不同视图之间的局部结构一致性。
动机:现有的多视角聚类算法往往忽视了数据中的拓扑结构,且在探索聚类结构时无法完全保持不同视图之间的局部结构一致性。
方法:本文提出了一种利用数据隐含的流形来学习数据拓扑结构的多视角聚类方法。同时,考虑到多个视图的一致性主要体现在相似的局部结构上,而不一致的结构是少数,因此进一步在样本级别探索了多个视图的交集,以更好地保持跨视图的一致性。
效果:实验结果证明,该方法在各种多视角数据集上都表现出了有效性,并优于其他最先进的方法。

Multi-view clustering have hitherto been studied due to their effectiveness in dealing with heterogeneous data. Despite the empirical success made by recent works, there still exists several severe challenges. Particularly, previous multi-view clustering algorithms seldom consider the topological structure in data, which is essential for clustering data on manifold. Moreover, existing methods cannot fully consistency the consistency of local structures between different views as they explore the clustering structure in a view-wise manner. In this paper, we propose to exploit the implied data manifold by learning the topological structure of data. Besides, considering that the consistency of multiple views is manifested in the generally similar local structure while the inconsistent structures are minority, we further explore the intersections of multiple views in the sample level such that the cross-view consistency can be better maintained. We model the above concerns in a unified framework and design an efficient algorithm to solve the corresponding optimization problem. Experimental results on various multi-view datasets certificate the effectiveness of the proposed method and verify its superiority over other SOTA approaches.

Curricular Contrastive Regularization for Physics-Aware Single Image Dehazing
Zheng, YuandZhan, JiahuiandHe, ShengfengandDong, JunyuandDu, Yong



研究问题:如何提高图像去雾模型的性能和解释性。
动机:目前的图像去雾模型存在欠拟合问题,对比度正则化方法引入了负样本信息作为下界,但负样本通常远离清晰(即正)图像,导致解决方案空间仍然受到限制。此外,对深度去雾模型的可解释性对于雾霾过程的物理机制的研究还不够充分。
方法:本文提出了一种新的课程对比度正则化方法,针对一个一致的对比空间而不是一个不一致的对比空间。我们的负样本可以由1)有雾图像和2)其他现有方法的相应恢复组成,提供了更好的下界约束。此外,由于清晰图像和负样本嵌入之间的不同相似性,多个组件的学习难度本质上是不平衡的。为了解决这个问题,我们定制了一个课程学习策略来重新权衡不同负样本的重要性。另外,为了提高特征空间中的可解释性,我们根据大气散射模型构建了一个物理感知的双分支单元。
效果:通过使用该单元以及课程对比度正则化,我们建立了我们的去雾网络C2PNet。大量实验表明,我们的C2PNet显著优于最先进的方法,在SOTS室内和SOTS室外数据集上分别实现了3.94dB和1.50dB的最大PSNR提升。代码可在https://github.com/YuZheng9/C2PNet获取。

Considering the ill-posed nature, contrastive regularization has been developed for single image dehazing, introducing the information from negative images as a lower bound. However, the contrastive samples are nonconsensual, as the negatives are usually represented distantly from the clear (i.e., positive) image, leaving the solution space still under-constricted. Moreover, the interpretability of deep dehazing models is underexplored towards the physics of the hazing process. In this paper, we propose a novel curricular contrastive regularization targeted at a consensual contrastive space as opposed to a non-consensual one. Our negatives, which provide better lower-bound constraints, can be assembled from 1) the hazy image, and 2) corresponding restorations by other existing methods. Further, due to the different similarities between the embeddings of the clear image and negatives, the learning difficulty of the multiple components is intrinsically imbalanced. To tackle this issue, we customize a curriculum learning strategy to reweight the importance of different negatives. In addition, to improve the interpretability in the feature space, we build a physics-aware dual-branch unit according to the atmospheric scattering model. With the unit, as well as curricular contrastive regularization, we establish our dehazing network, named C2PNet. Extensive experiments demonstrate that our C2PNet significantly outperforms state-of-the-art methods, with extreme PSNR boosts of 3.94dB and 1.50dB, respectively, on SOTS-indoor and SOTS-outdoor datasets. Code is available at https://github.com/YuZheng9/C2PNet.

Learning From Noisy Labels With Decoupled Meta Label Purifier
Tu, YuanpengandZhang, BoshenandLi, YuxiandLiu, LiangandLi, JianandWang, YabiaoandWang, ChengjieandZhao, CaiRong



研究问题:训练深度神经网络时,由于标签噪声的存在,模型容易记住不准确的标签,导致泛化能力差。
动机:为了解决这个问题,研究人员开始采用基于元学习的标签修正策略,通过使用一小部分干净的验证数据来识别和修正潜在的噪声标签。然而,这种双重优化过程(模型权重和超参数的优化)限制了模型表示能力和修正标签的准确性。
方法:本文提出了一种新的多阶段标签净化器DMLP,它将标签修正过程解耦为无标签表示学习和一个简单的元标签净化器。这样,DMLP可以在两个不同的阶段专注于提取判别性特征和修正标签。
效果:实验结果表明,DMLP在几个合成和真实世界的噪声数据集上取得了最先进的结果,特别是在高噪声水平下。

Training deep neural networks (DNN) with noisy labels is challenging since DNN can easily memorize inaccurate labels, leading to poor generalization ability. Recently, the meta-learning based label correction strategy is widely adopted to tackle this problem via identifying and correcting potential noisy labels with the help of a small set of clean validation data. Although training with purified labels can effectively improve performance, solving the meta-learning problem inevitably involves a nested loop of bi-level optimization between model weights and hyperparameters (i.e., label distribution). As compromise, previous methods resort toa coupled learning process with alternating update. In this paper, we empirically find such simultaneous optimization over both model weights and label distribution can not achieve an optimal routine, consequently limiting the representation ability of backbone and accuracy of corrected labels. From this observation, a novel multi-stage label purifier named DMLP is proposed. DMLP decouples the label correction process into label-free representation learning and a simple meta label purifier, In this way, DMLP can focus on extracting discriminative feature and label correction in two distinctive stages. DMLP is a plug-and-play label purifier, the purified labels can be directly reused in naive end-to-end network retraining or other robust learning methods, where state-of-the-art results are obtained on several synthetic and real-world noisy datasets, especially under high noise levels.

Sharpness-Aware Gradient Matching for Domain Generalization
Wang, PengfeiandZhang, ZhaoxiangandLei, ZhenandZhang, Lei



研究问题:如何提高模型从源领域到未见过的其他领域的泛化能力。
动机:尽管现有的锐度感知最小化(SAM)方法在领域泛化(DG)上表现出色,但并不总是能收敛到期望的平坦区域和较小的损失值。
方法:提出了两个条件来确保模型可以收敛到一个具有小损失的平坦最小值,并提出了锐度感知梯度匹配(SAGM)算法以满足这两个条件以增强模型的泛化能力。
效果:实验结果表明,SAGM方法在五个DG基准测试中始终优于最先进的方法,包括PACS、VLCS、OfficeHome、TerraIncognita和DomainNet。

The goal of domain generalization (DG) is to enhance the generalization capability of the model learned from a source domain to other unseen domains. The recently developed Sharpness-Aware Minimization (SAM) method aims to achieve this goal by minimizing the sharpness measure of the loss landscape. Though SAM and its variants have demonstrated impressive DG performance, they may not always converge to the desired flat region with a small loss value. In this paper, we present two conditions to ensure that the model could converge to a flat minimum with a small loss, and present an algorithm, named Sharpness-Aware Gradient Matching (SAGM), to meet the two conditions for improving model generalization capability. Specifically, the optimization objective of SAGM will simultaneously minimize the empirical risk, the perturbed loss (i.e., the maximum loss within a neighborhood in the parameter space), and the gap between them. By implicitly aligning the gradient directions between the empirical risk and the perturbed loss, SAGM improves the generalization capability over SAM and its variants without increasing the computational cost. Extensive experimental results show that our proposed SAGM method consistently outperforms the state-of-the-art methods on five DG benchmarks, including PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet. Codes are available at https://github.com/Wang-pengfei/SAGM.

Local Connectivity-Based Density Estimation for Face Clustering
Shin, JunhoandLee, Hyo-JunandKim, HyunseopandBaek, Jong-HyeonandKim, DaehyunandKoh, YeongJun



研究问题:现有的基于图的面部聚类方法存在预测大量边缘连接的问题,包括将不同类别节点链接的错误阳性边缘。
动机:这些错误阳性边缘可能会在连接性被错误估计时将不同的簇合并在一起,导致面部聚类效果不佳。
方法:本文提出了一种新的面部聚类方法,该方法采用基于密度的聚类,保持了较高密度的边缘。为此,我们提出了一种基于K最近邻(KNN)之间局部连接的可靠密度估计算法。
效果:实验结果表明,所提出的聚类方法在大规模面部聚类数据集和时尚图像聚类数据集上显著优于最先进的聚类方法。

Recent graph-based face clustering methods predict the connectivity of enormous edges, including false positive edges that link nodes with different classes. However, those false positive edges, which connect negative node pairs, have the risk of integration of different clusters when their connectivity is incorrectly estimated. This paper proposes a novel face clustering method to address this problem. The proposed clustering method employs density-based clustering, which maintains edges that have higher density. For this purpose, we propose a reliable density estimation algorithm based on local connectivity between K nearest neighbors (KNN). We effectively exclude negative pairs from the KNN graph based on the reliable density while maintaining sufficient positive pairs. Furthermore, we develop a pairwise connectivity estimation network to predict the connectivity of the selected edges. Experimental results demonstrate that the proposed clustering method significantly outperforms the state-of-the-art clustering methods on large-scale face clustering datasets and fashion image clustering datasets. Our code is available at https://github.com/illian01/LCE-PCENet

Deep Deterministic Uncertainty: A New Simple Baseline
Mukhoti, JishnuandKirsch, AndreasandvanAmersfoort, JoostandTorr, PhilipH.S.andGal, Yarin



研究问题:寻找可靠的确定性单向前向传递模型的不确定性,因为传统的不确定性量化方法计算成本高。
动机:对两种复杂的单向前向传递不确定性方法DUQ和SNGP进行研究,看它们是否主要依赖于良好规范化的特征空间。
方法:通过残差连接和谱归一化实现具有这种规范化特征空间的单一softmax神经网络,无需使用更复杂的估计不确定性的方法,就优于DUQ和SNGP使用简单高斯判别分析作为单独特征空间密度估计器的先验知识不确定性预测。
效果:概念上简单的深度确定性不确定性(DDU)基线也可以用于区分随机性和先验知识不确定性,并在几个OoD基准测试(CIFAR-10/100与SVHN/Tiny-ImageNet,ImageNet与ImageNet-O),不同模型架构的主动学习设置以及大规模视觉任务如语义分割等方面表现良好,同时计算成本更低。

Reliable uncertainty from deterministic single-forward pass models is sought after because conventional methods of uncertainty quantification are computationally expensive. We take two complex single-forward-pass uncertainty approaches, DUQ and SNGP, and examine whether they mainly rely on a well-regularized feature space. Crucially, without using their more complex methods for estimating uncertainty, we find that a single softmax neural net with such a regularized feature-space, achieved via residual connections and spectral normalization, outperforms DUQ and SNGP's epistemic uncertainty predictions using simple Gaussian Discriminant Analysis post-training as a separate feature-space density estimator---without fine-tuning on OoD data, feature ensembling, or input pre-procressing. Our conceptually simple Deep Deterministic Uncertainty (DDU) baseline can also be used to disentangle aleatoric and epistemic uncertainty and performs as well as Deep Ensembles, the state-of-the art for uncertainty prediction, on several OoD benchmarks (CIFAR-10/100 vs SVHN/Tiny-ImageNet, ImageNet vs ImageNet-O), active learning settings across different model architectures, as well as in large scale vision tasks like semantic segmentation, while being computationally cheaper.

Towards Realistic Long-Tailed Semi-Supervised Learning: Consistency Is All You Need
Wei, TongandGan, Kai



研究问题:现有的长尾半监督学习算法通常假设标记和未标记的数据类别分布几乎相同,但在数据类别分布不匹配时会遭受严重影响。
动机:为了解决这个问题,我们提出了一种新的简单方法,通过引入自适应一致性正则化器(ACR)来有效地利用未知类别分布的未标记数据。
方法:ACR通过估计未标记数据的真实类别分布,以统一公式实现对各种分布的伪标签动态细化。
效果:实验表明,ACR在多种标准的长尾半监督学习基准测试中取得了最先进的性能,例如,当标记和未标记数据的类别分布不匹配时,相对于现有算法,测试精度平均提高了10%。即使类别分布相同,ACR也始终优于许多复杂的长尾半监督学习算法。

While long-tailed semi-supervised learning (LTSSL) has received tremendous attention in many real-world classification problems, existing LTSSL algorithms typically assume that the class distributions of labeled and unlabeled data are almost identical. Those LTSSL algorithms built upon the assumption can severely suffer when the class distributions of labeled and unlabeled data are mismatched since they utilize biased pseudo-labels from the model. To alleviate this issue, we propose a new simple method that can effectively utilize unlabeled data of unknown class distributions by introducing the adaptive consistency regularizer (ACR). ACR realizes the dynamic refinery of pseudo-labels for various distributions in a unified formula by estimating the true class distribution of unlabeled data. Despite its simplicity, we show that ACR achieves state-of-the-art performance on a variety of standard LTSSL benchmarks, e.g., an averaged 10% absolute increase of test accuracy against existing algorithms when the class distributions of labeled and unlabeled data are mismatched. Even when the class distributions are identical, ACR consistently outperforms many sophisticated LTSSL algorithms. We carry out extensive ablation studies to tease apart the factors that are most important to ACR's success. Source code is available at https://github.com/Gank0078/ACR.

PartMix: Regularization Strategy To Learn Part Discovery for Visible-Infrared Person Re-Identification
Kim, MinsuandKim, SeungryongandPark, JunginandPark, SeongheonandSohn, Kwanghoon



研究问题:本文旨在探索一种适用于基于部分的可见光-红外人体重识别(VI-ReID)模型的数据增强技术。
动机:现有的数据增强技术在各种计算机视觉应用中可以防止模型过拟合,但针对基于部分的VI-ReID模型的适当数据增强技术尚未被探索。
方法:我们提出了一种新的数据增强技术,称为PartMix,通过混合跨模态的部分描述符来合成增强的样本,以提高基于部分的VI-ReID模型的性能。我们还提出了一种基于熵的挖掘策略,以减弱不可靠正负样本的负面影响。
效果:实验结果表明,PartMix在现有的基于部分的VI-ReID模型中的表现优于现有的VI-ReID方法,且具有稳定性能提升。

Modern data augmentation using a mixture-based technique can regularize the models from overfitting to the training data in various computer vision applications, but a proper data augmentation technique tailored for the part-based Visible-Infrared person Re-IDentification (VI-ReID) models remains unexplored. In this paper, we present a novel data augmentation technique, dubbed PartMix, that synthesizes the augmented samples by mixing the part descriptors across the modalities to improve the performance of part-based VI-ReID models. Especially, we synthesize the positive and negative samples within the same and across different identities and regularize the backbone model through contrastive learning. In addition, we also present an entropy-based mining strategy to weaken the adverse impact of unreliable positive and negative samples. When incorporated into existing part-based VI-ReID model, PartMix consistently boosts the performance. We conduct experiments to demonstrate the effectiveness of our PartMix over the existing VI-ReID methods and provide ablation studies.

Learning Sample Relationship for Exposure Correction
Huang, JieandZhao, FengandZhou, ManandXiao, JieandZheng, NaishanandZheng, KaiwenandXiong, Zhiwei



研究问题:本文旨在解决现有曝光修正方法在优化过程中的不一致性问题。
动机:尽管现有的曝光修正方法取得了很大的进步,但它们通常是通过训练小批量混合的欠曝和过曝样本来进行的,没有探索它们之间的关系来解决优化的不一致性。
方法:本文提出了一种新的视角,通过在一个mini-batch中关联和约束修正过程的关系来连接它们的优化过程。该方法的核心设计包括两个步骤:1)通过一个与上下文无关的预训练任务来形成样本在整个批次维度上的曝光关系;2)将上述样本关系设计作为损失函数中的正则化项,以促进优化的一致性。
效果:通过在多个代表性曝光修正基准上进行大量实验,证明了引入样本关系设计可以带来一致的性能提升。

Exposure correction task aims to correct the underexposure and its adverse overexposure images to the normal exposure in a single network. As well recognized, the optimization flow is opposite. Despite the great advancement, existing exposure correction methods are usually trained with a mini-batch of both underexposure and overexposure mixed samples and have not explored the relationship between them to solve the optimization inconsistency. In this paper, we introduce a new perspective to conjunct their optimization processes by correlating and constraining the relationship of correction procedure in a mini-batch. The core designs of our framework consist of two steps: 1) formulating the exposure relationship of samples across the batch dimension via a context-irrelevant pretext task. 2) delivering the above sample relationship design as the regularization term within the loss function to promote optimization consistency. The proposed sample relationship design as a general term can be easily integrated into existing exposure correction methods without any computational burden in inference time. Extensive experiments over multiple representative exposure correction benchmarks demonstrate consistent performance gains by introducing our sample relationship design.

TTA-COPE: Test-Time Adaptation for Category-Level Object Pose Estimation
Lee, TaeyeopandTremblay, JonathanandBlukis, ValtsandWen, BowenandLee, Byeong-UkandShin, InkyuandBirchfield, StanandKweon, InSoandYoon, Kuk-Jin



研究问题:如何通过逐步更新模型,无需目标数据标签来解决源到目标领域差距。
动机:解决类别级物体姿态估计的无监督领域适应问题。
方法:提出一种名为TTA-COPE的测试时间适应方法,设计了一种带有自我训练损失的姿态集合方法,使用姿态感知置信度。
效果:实验结果表明,所提出的姿态集合和自我训练损失在半监督和无监督设置下都能提高类别级物体姿态性能。

Test-time adaptation methods have been gaining attention recently as a practical solution for addressing source-to-target domain gaps by gradually updating the model without requiring labels on the target data. In this paper, we propose a method of test-time adaptation for category-level object pose estimation called TTA-COPE. We design a pose ensemble approach with a self-training loss using pose-aware confidence. Unlike previous unsupervised domain adaptation methods for category-level object pose estimation, our approach processes the test data in a sequential, online manner, and it does not require access to the source domain at runtime. Extensive experimental results demonstrate that the proposed pose ensemble and the self-training loss improve category-level object pose performance during test time under both semi-supervised and unsupervised settings.

Geometry and Uncertainty-Aware 3D Point Cloud Class-Incremental Semantic Segmentation
Yang, YuweiandHayat, MunawarandJin, ZhaoandRen, ChaoandLei, Yinjie



研究问题:尽管3D点云语义分割在最近取得了显著进展,但当前的方法需要一次性训练所有类别的数据,不适合新类别不断被发现的真实场景。
动机:为了使用先前的知识持续学习新类别,我们提出了一种3D点云的类别增量语义分割方法。由于3D点云无序且无结构,特别是在没有先前数据的情况下,存储和传递知识变得困难。
方法:我们设计了一个几何感知的蒸馏模块来转移点特征关联的几何特性。为了应对由语义转移引起的遗忘问题,我们还开发了一种不确定性感知的伪标签方案,通过局部邻域内的标签传播消除了不确定伪标签中的噪声。
效果:我们在S3DIS和ScanNet上进行了广泛的类别增量实验,结果令人印象深刻,与联合训练策略(上限)相当。

Despite the significant recent progress made on 3D point cloud semantic segmentation, the current methods require training data for all classes at once, and are not suitable for real-life scenarios where new categories are being continuously discovered. Substantial memory storage and expensive re-training is required to update the model to sequentially arriving data for new concepts. In this paper, to continually learn new categories using previous knowledge, we introduce class-incremental semantic segmentation of 3D point cloud. Unlike 2D images, 3D point clouds are disordered and unstructured, making it difficult to store and transfer knowledge especially when the previous data is not available. We further face the challenge of semantic shift, where previous/future classes are indiscriminately collapsed and treated as the background in the current step, causing a dramatic performance drop on past classes. We exploit the structure of point cloud and propose two strategies to address these challenges. First, we design a geometry-aware distillation module that transfers point-wise feature associations in terms of their geometric characteristics. To counter forgetting caused by the semantic shift, we further develop an uncertainty-aware pseudo-labelling scheme that eliminates noise in uncertain pseudo-labels by label propagation within a local neighborhood. Our extensive experiments on S3DIS and ScanNet in a class-incremental setting show impressive results comparable to the joint training strategy (upper bound). Code is available at: https://github.com/leolyj/3DPC-CISS

Decompose, Adjust, Compose: Effective Normalization by Playing With Frequency for Domain Generalization
Lee, SangrokandBae, JongseongandKim, HaYoung



研究问题:评估计算机视觉模型的鲁棒性,特别是在领域泛化任务中。
动机:现有的领域泛化方法在去除风格时存在内容变化问题,因为内容和风格的边界不清晰。
方法:从频率域的角度出发,将振幅和相位分别视为风格和内容,提出一种新的归一化方法PCNorm,并通过频谱分解只保留内容来消除风格。此外,还提出了改进的PCNorm变体CCNorm和SCNorm,它们可以分别调整内容和风格的变异程度,以学习领域无关的表示进行领域泛化。
效果:使用这些归一化方法,我们提出了ResNet变体模型DAC-P和DAC-SC,它们对领域差距具有鲁棒性。所提出的模型在五个数据集上的表现优于其他最新的领域泛化方法,其中DAC-SC在PACS、VLCS、Office-Home、DomainNet和TerraIncognita五个数据集上的平均性能达到了65.6%,处于最先进的水平。

Domain generalization (DG) is a principal task to evaluate the robustness of computer vision models. Many previous studies have used normalization for DG. In normalization, statistics and normalized features are regarded as style and content, respectively. However, it has a content variation problem when removing style because the boundary between content and style is unclear. This study addresses this problem from the frequency domain perspective, where amplitude and phase are considered as style and content, respectively. First, we verify the quantitative phase variation of normalization through the mathematical derivation of the Fourier transform formula. Then, based on this, we propose a novel normalization method, PCNorm, which eliminates style only as the preserving content through spectral decomposition. Furthermore, we propose advanced PCNorm variants, CCNorm and SCNorm, which adjust the degrees of variations in content and style, respectively. Thus, they can learn domain-agnostic representations for DG. With the normalization methods, we propose ResNet-variant models, DAC-P and DAC-SC, which are robust to the domain gap. The proposed models outperform other recent DG methods. The DAC-SC achieves an average state-of-the-art performance of 65.6% on five datasets: PACS, VLCS, Office-Home, DomainNet, and TerraIncognita.

Multilateral Semantic Relations Modeling for Image Text Retrieval
Wang, ZhengandGao, ZhenweiandGuo, KangshuaiandYang, YangandWang, XiaomingandShen, HengTao



研究问题:本文旨在解决图像-文本检索中一对多对应关系的问题,即如何通过细粒度的配对将视觉和语言联系起来。
动机:尽管现有的解决方案如多点映射、概率分布和几何嵌入等已取得一定进展,但一对多对应关系仍未得到充分探索。
方法:本文提出了一种多边语义关系建模(MSRM)方法,通过超图模型捕捉多个样本与给定查询之间的一对多对应关系。具体来说,首先将给定的查询映射为概率嵌入,学习其基于马氏距离的真实语义分布;然后将每个候选实例视为一个具有平均语义的超图节点,而高斯查询则被建模为超边以捕捉候选点和查询之间的语义关联。
效果:在两个广泛使用的数据集上的全面实验结果表明,我们的MSRM方法在解决多个匹配问题方面优于现有方法,同时在实例级别匹配方面保持了相当的性能。

Image-text retrieval is a fundamental task to bridge vision and language by exploiting various strategies to fine-grained alignment between regions and words. This is still tough mainly because of one-to-many correspondence, where a set of matches from another modality can be accessed by a random query. While existing solutions to this problem including multi-point mapping, probabilistic distribution, and geometric embedding have made promising progress, one-to-many correspondence is still under-explored. In this work, we develop a Multilateral Semantic Relations Modeling (termed MSRM) for image-text retrieval to capture the one-to-many correspondence between multiple samples and a given query via hypergraph modeling. Specifically, a given query is first mapped as a probabilistic embedding to learn its true semantic distribution based on Mahalanobis distance. Then each candidate instance in a mini-batch is regarded as a hypergraph node with its mean semantics while a Gaussian query is modeled as a hyperedge to capture the semantic correlations beyond the pair between candidate points and the query. Comprehensive experimental results on two widely used datasets demonstrate that our MSRM method can outperform state-of-the-art methods in the settlement of multiple matches while still maintaining the comparable performance of instance-level matching. Our codes and checkpoints will be released soon.

Novel Class Discovery for 3D Point Cloud Semantic Segmentation
Riz, LuigiandSaltori, CristianoandRicci, ElisaandPoiesi, Fabio



研究问题:如何利用标注的基础类别,对未标注的新类别进行语义分割。
动机:尽管2D图像数据的新颖类别发现(NCD)问题已经得到了解决,但3D点云数据上的问题尚未得到解决。
方法:我们提出了一种基于在线聚类的新型NCD方法,该方法利用不确定性量化来为新类别的点生成伪标签原型。
效果:我们在SemanticKITTI和SemanticPOSS数据集上进行了全面评估,结果显示,我们的方法可以显著超越基线。

Novel class discovery (NCD) for semantic segmentation is the task of learning a model that can segment unlabelled (novel) classes using only the supervision from labelled (base) classes. This problem has recently been pioneered for 2D image data, but no work exists for 3D point cloud data. In fact, the assumptions made for 2D are loosely applicable to 3D in this case. This paper is presented to advance the state of the art on point cloud data analysis in four directions. Firstly, we address the new problem of NCD for point cloud semantic segmentation. Secondly, we show that the transposition of the only existing NCD method for 2D semantic segmentation to 3D data is suboptimal. Thirdly, we present a new method for NCD based on online clustering that exploits uncertainty quantification to produce prototypes for pseudo-labelling the points of the novel classes. Lastly, we introduce a new evaluation protocol to assess the performance of NCD for point cloud semantic segmentation. We thoroughly evaluate our method on SemanticKITTI and SemanticPOSS datasets, showing that it can significantly outperform the baseline. Project page: https://github.com/LuigiRiz/NOPS.

Normalizing Flow Based Feature Synthesis for Outlier-Aware Object Detection
Kumar, Nishantand\v{S



研究问题:如何提高物体检测器对异常物体的识别能力。
动机:目前通用的物体检测器如Faster R-CNN容易对异常物体给出过于自信的预测,这对于自动驾驶等应用来说十分关键。
方法:提出一种新的异常物体检测框架,通过学习所有正常类别的联合数据分布以及可逆归一化流来区分异常物体和正常物体。
效果:在图像和视频数据集上,该方法显著优于现有的异常物体检测技术。

Real-world deployment of reliable object detectors is crucial for applications such as autonomous driving. However, general-purpose object detectors like Faster R-CNN are prone to providing overconfident predictions for outlier objects. Recent outlier-aware object detection approaches estimate the density of instance-wide features with class-conditional Gaussians and train on synthesized outlier features from their low-likelihood regions. However, this strategy does not guarantee that the synthesized outlier features will have a low likelihood according to the other class-conditional Gaussians. We propose a novel outlier-aware object detection framework that distinguishes outliers from inlier objects by learning the joint data distribution of all inlier classes with an invertible normalizing flow. The appropriate sampling of the flow model ensures that the synthesized outliers have a lower likelihood than inliers of all object classes, thereby modeling a better decision boundary between inlier and outlier objects. Our approach significantly outperforms the state-of-the-art for outlier-aware object detection on both image and video datasets.

DivClust: Controlling Diversity in Deep Clustering
Metaxas, IoannisManiadisandTzimiropoulos, GeorgiosandPatras, Ioannis



研究问题:现有的深度学习聚类方法无法有效生成具有多样性的数据集分区。
动机:多样性的基聚类对于产生更优、更鲁棒的结果的共识聚类是必要的,但现有方法无法有效生成。
方法:提出DivClust,一种可以整合到现有深度学习聚类框架中以生成具有期望多样性的多个聚类的多样性控制损失函数。
效果:实验证明,该方法能有效控制不同框架和数据集的多样性,且其生成的聚类结果明显优于单一聚类基准,使用现有的共识聚类算法,DivClust产生的共识聚类结果也始终优于单一聚类基准,从而有效提升了基础深度学习聚类框架的性能。

Clustering has been a major research topic in the field of machine learning, one to which Deep Learning has recently been applied with significant success. However, an aspect of clustering that is not addressed by existing deep clustering methods, is that of efficiently producing multiple, diverse partitionings for a given dataset. This is particularly important, as a diverse set of base clusterings are necessary for consensus clustering, which has been found to produce better and more robust results than relying on a single clustering. To address this gap, we propose DivClust, a diversity controlling loss that can be incorporated into existing deep clustering frameworks to produce multiple clusterings with the desired degree of diversity. We conduct experiments with multiple datasets and deep clustering frameworks and show that: a) our method effectively controls diversity across frameworks and datasets with very small additional computational cost, b) the sets of clusterings learned by DivClust include solutions that significantly outperform single-clustering baselines, and c) using an off-the-shelf consensus clustering algorithm, DivClust produces consensus clustering solutions that consistently outperform single-clustering baselines, effectively improving the performance of the base deep clustering framework.

Bi-Directional Distribution Alignment for Transductive Zero-Shot Learning
Wang, ZhicaiandHao, YanbinandMu, TingtingandLi, OuxiangandWang, ShuoandHe, Xiangnan



研究问题:零样本学习(ZSL)在处理未见过类别时,真实数据分布和学习到的数据分布不匹配的问题。
动机:尽管转导式零样本学习(TZSL)试图通过允许使用未见过类别的无标签示例来改善这个问题,但仍存在高度的分布偏移。
方法:我们提出了一种新的TZSL模型(命名为Bi-VAEGAN),通过加强视觉和辅助空间之间的分布对齐,从而大大改善了这种偏移。模型设计的关键提议包括:(1)双向分布对齐;(2)简单而有效的基于L_2范数的特征归一化方法;(3)更复杂的未见过类别先验估计方法。
效果:在四个数据集上进行基准评估,Bi-VAEGAN在标准和广义TZSL设置下均取得了新的最先进的成果。代码可在https://github.com/Zhicaiwww/Bi-VAEGAN找到。

It is well-known that zero-shot learning (ZSL) can suffer severely from the problem of domain shift, where the true and learned data distributions for the unseen classes do not match. Although transductive ZSL (TZSL) attempts to improve this by allowing the use of unlabelled examples from the unseen classes, there is still a high level of distribution shift. We propose a novel TZSL model (named as Bi-VAEGAN), which largely improves the shift by a strengthened distribution alignment between the visual and auxiliary spaces. The key proposal of the model design includes (1) a bi-directional distribution alignment, (2) a simple but effective L_2-norm based feature normalization approach, and (3) a more sophisticated unseen class prior estimation approach. In benchmark evaluation using four datasets, Bi-VAEGAN achieves the new state of the arts under both the standard and generalized TZSL settings. Code could be found at https://github.com/Zhicaiwww/Bi-VAEGAN.

Adaptive Graph Convolutional Subspace Clustering
Wei, LaiandChen, ZhengweiandYin, JunandZhu, ChangmingandZhou, RiguiandLiu, Jin



研究问题:本文旨在开发一种自适应图卷积子空间聚类(AGCSC)算法,以改善现有谱系子空间聚类算法的性能。
动机:现有的谱系子空间聚类算法主要关注于设计重构系数矩阵的约束或寻找原始数据样本潜在特征的特征提取方法。
方法:受图卷积网络的启发,我们使用图卷积技术同时开发了一种特征提取方法和一种系数矩阵约束。在我们的提议的算法中,图卷积运算符是迭代和自适应更新的。
效果:大量的子空间聚类实验证明了我们的结论,并表明AGCSC优于一些相关的方法以及一些深度模型。

Spectral-type subspace clustering algorithms have shown excellent performance in many subspace clustering applications. The existing spectral-type subspace clustering algorithms either focus on designing constraints for the reconstruction coefficient matrix or feature extraction methods for finding latent features of original data samples. In this paper, inspired by graph convolutional networks, we use the graph convolution technique to develop a feature extraction method and a coefficient matrix constraint simultaneously. And the graph-convolutional operator is updated iteratively and adaptively in our proposed algorithm. Hence, we call the proposed method adaptive graph convolutional subspace clustering (AGCSC). We claim that, by using AGCSC, the aggregated feature representation of original data samples is suitable for subspace clustering, and the coefficient matrix could reveal the subspace structure of the original data set more faithfully. Finally, plenty of subspace clustering experiments prove our conclusions and show that AGCSC outperforms some related methods as well as some deep models.

Exploring the Relationship Between Architectural Design and Adversarially Robust Generalization
Liu, AishanandTang, ShiyuandLiang, SiyuanandGong, RuihaoandWu, BoxiandLiu, XianglongandTao, Dacheng



研究问题:对抗训练已被证明是防御对抗性示例最有效的方法之一,但它在面对未见过的攻击者时,通常存在巨大的鲁棒性泛化差距,即对抗性鲁棒性泛化问题。
动机:尽管对对抗性鲁棒性泛化有了初步的理解,但从架构的角度来看,我们对此知之甚少。为了填补这一空白,本文首次系统地研究了对抗性鲁棒性泛化与架构设计之间的关系。
方法:我们在ImageNette和CIFAR-10数据集上,对20种最具代表性的对抗训练架构进行了全面评估,以应对多种l_p-范数对抗攻击。
效果:实验发现,在相同的设置下,视觉转换器(如PVT、CoAtNet)在对抗性鲁棒性泛化方面表现更好,而CNNs往往在特定攻击上过拟合,无法在多个攻击者上进行泛化。理论分析揭示了更高的权重稀疏性对Transformers更好的对抗性鲁棒性泛化有显著贡献,这通常是通过专门设计的注意力块实现的。

Adversarial training has been demonstrated to be one of the most effective remedies for defending adversarial examples, yet it often suffers from the huge robustness generalization gap on unseen testing adversaries, deemed as the adversarially robust generalization problem. Despite the preliminary understandings devoted to adversarially robust generalization, little is known from the architectural perspective. To bridge the gap, this paper for the first time systematically investigated the relationship between adversarially robust generalization and architectural design. In particular, we comprehensively evaluated 20 most representative adversarially trained architectures on ImageNette and CIFAR-10 datasets towards multiple l_p-norm adversarial attacks. Based on the extensive experiments, we found that, under aligned settings, Vision Transformers (e.g., PVT, CoAtNet) often yield better adversarially robust generalization while CNNs tend to overfit on specific attacks and fail to generalize on multiple adversaries. To better understand the nature behind it, we conduct theoretical analysis via the lens of Rademacher complexity. We revealed the fact that the higher weight sparsity contributes significantly towards the better adversarially robust generalization of Transformers, which can be often achieved by the specially-designed attention blocks. We hope our paper could help to better understand the mechanism for designing robust DNNs. Our model weights can be found at http://robust.art.

FCC: Feature Clusters Compression for Long-Tailed Visual Recognition
Li, JianandMeng, ZiyaoandShi, DaqianandSong, RuiandDiao, XiaoleiandWang, JingwenandXu, Hao



研究问题:深度神经网络在处理长尾数据时存在限制,因为其对少数类别的表示不足。
动机:现有的解决此问题的方法忽视了骨干特征(BFs)密度对此问题的影响。
方法:提出一种简单通用的方法——特征聚类压缩(FCC),通过压缩骨干特征聚类来增加BFs的密度。该方法仅在训练阶段将原始BFs乘以一个缩放因子,建立了原始和乘法特征之间的线性压缩关系,并迫使深度神经网络将前者映射到更密集的聚类中。
效果:实验结果充分验证了该方法的有效性和通用性。

Deep Neural Networks (DNNs) are rather restrictive in long-tailed data, since they commonly exhibit an under-representation for minority classes. Various remedies have been proposed to tackle this problem from different perspectives, but they ignore the impact of the density of Backbone Features (BFs) on this issue. Through representation learning, DNNs can map BFs into dense clusters in feature space, while the features of minority classes often show sparse clusters. In practical applications, these features are discretely mapped or even cross the decision boundary resulting in misclassification. Inspired by this observation, we propose a simple and generic method, namely Feature Clusters Compression (FCC), to increase the density of BFs by compressing backbone feature clusters. The proposed FCC can be easily achieved by only multiplying original BFs by a scaling factor in training phase, which establishes a linear compression relationship between the original and multiplied features, and forces DNNs to map the former into denser clusters. In test phase, we directly feed original features without multiplying the factor to the classifier, such that BFs of test samples are mapped closer together and do not easily cross the decision boundary. Meanwhile, FCC can be friendly combined with existing long-tailed methods and further boost them. We apply FCC to numerous state-of-the-art methods and evaluate them on widely used long-tailed benchmark datasets. Extensive experiments fully verify the effectiveness and generality of our method. Code is available at https://github.com/lijian16/FCC.

Multi-Centroid Task Descriptor for Dynamic Class Incremental Inference
Cai, TenghaoandZhang, ZhizhongandTan, XinandQu, YanyunandJiang, GuannanandWang, ChengjieandXie, Yuan



研究问题:本文旨在解决增量学习中的类别和任务增量学习问题,特别是在评估过程中是否给出任务ID的问题。
动机:作者发现在增量学习中,任务信息是一种强大的先验知识,可以显著提高类别增量学习的效果。
方法:提出了一种门控网络来预测任务ID进行类别增量推理,并设计了一种多中心任务描述符来处理任务间没有明确语义关系的问题。
效果:实验结果表明,该方法在CIFAR100-B0S50数据集上取得了72.41%的平均准确率,比DER提高了3.40%。

Incremental learning could be roughly divided into two categories, i.e., class- and task-incremental learning. The main difference is whether the task ID is given during evaluation. In this paper, we show this task information is indeed a strong prior knowledge, which will bring significant improvement over class-incremental learning baseline, e.g., DER. Based on this observation, we propose a gate network to predict the task ID for class incremental inference. This is challenging as there is no explicit semantic relationship between categories in the concept of task. Therefore, we propose a multi-centroid task descriptor by assuming the data within a task can form multiple clusters. The cluster centers are optimized by pulling relevant sample-centroid pairs while pushing others away, which ensures that there is at least one centroid close to a given sample. To select relevant pairs, we use class prototypes as proxies and solve a bipartite matching problem, making the task descriptor representative yet not degenerate to uni-modal. As a result, our dynamic inference network is trained independently of baseline and provides a flexible, efficient solution to distinguish between tasks. Extensive experiments show our approach achieves state-of-the-art results, e.g., we achieve 72.41% average accuracy on CIFAR100-B0S50, outperforming DER by 3.40%.

Open-Set Likelihood Maximization for Few-Shot Learning
Boudiaf, MalikandBennequin, EtienneandTami, MyriamandToubhans, AntoineandPiantanida, PabloandHudelot, CelineandBenAyed, Ismail



研究问题:解决小样本开放集识别(FSOSR)问题,即在只有少量标记样本的类别集合中进行分类,同时检测不属于任何已知类别的实例。
动机:现有的转导式方法在开放集场景中表现不佳,因此提出一种最大似然原理的泛化,引入潜在分数来降低可能的异常值的影响。
方法:提出了开放集似然优化(OSLO)方法,该方法将潜在分数和参数模型共同优化,从而从彼此中受益。
效果:通过大量实验,证明该方法在开放集识别的两个方面,即内点分类和外点检测上超越了现有的归纳式和转导式方法。

We tackle the Few-Shot Open-Set Recognition (FSOSR) problem, i.e. classifying instances among a set of classes for which we only have a few labeled samples, while simultaneously detecting instances that do not belong to any known class. We explore the popular transductive setting, which leverages the unlabelled query instances at inference. Motivated by the observation that existing transductive methods perform poorly in open-set scenarios, we propose a generalization of the maximum likelihood principle, in which latent scores down-weighing the influence of potential outliers are introduced alongside the usual parametric model. Our formulation embeds supervision constraints from the support set and additional penalties discouraging overconfident predictions on the query set. We proceed with a block-coordinate descent, with the latent scores and parametric model co-optimized alternately, thereby benefiting from each other. We call our resulting formulation Open-Set Likelihood Optimization (OSLO). OSLO is interpretable and fully modular; it can be applied on top of any pre-trained model seamlessly. Through extensive experiments, we show that our method surpasses existing inductive and transductive methods on both aspects of open-set recognition, namely inlier classification and outlier detection. Code is available at https://github.com/ebennequin/few-shot-open-set.

DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection
Ma, JiaweiandNiu, YuleiandXu, JinchengandHuang, ShiyuanandHan, GuangxingandChang, Shih-Fu



研究问题:本文旨在解决通用的少样本物体检测问题,即如何在有丰富标注的基础类别和只有少量训练数据的新颖类别上实现精确的检测。
动机:现有的方法在提高少样本泛化能力时,往往会牺牲基础类别的性能,或者在保持基础类别高精度的同时,对新颖类别的适应能力提升有限。
方法:本文提出了一种新的训练框架DiGeo,通过学习类间分离和类内紧凑的几何感知特征,以改善这一问题。具体来说,我们推导了一个离线的单纯形等角紧框架(ETF)分类器,其权重作为类中心并被最大化和等距分离。同时,我们还在分类损失中引入了自适应的特定类别间隔,以紧密每个类别的特征。
效果:实验结果证明,该方法在两个少样本基准数据集(PASCAL VOC、MSCOCO)和一个长尾数据集(LVIS)上表现出色,可以在不损害基础类别检测性能的情况下,有效提高对新颖类别的泛化能力。

Generalized few-shot object detection aims to achieve precise detection on both base classes with abundant annotations and novel classes with limited training data. Existing approaches enhance few-shot generalization with the sacrifice of base-class performance, or maintain high precision in base-class detection with limited improvement in novel-class adaptation. In this paper, we point out the reason is insufficient Discriminative feature learning for all of the classes. As such, we propose a new training framework, DiGeo, to learn Geometry-aware features of inter-class separation and intra-class compactness. To guide the separation of feature clusters, we derive an offline simplex equiangular tight frame (ETF) classifier whose weights serve as class centers and are maximally and equally separated. To tighten the cluster for each class, we include adaptive class-specific margins into the classification loss and encourage the features close to the class centers. Experimental studies on two few-shot benchmark datasets (PASCAL VOC, MSCOCO) and one long-tail dataset (LVIS) demonstrate that, with a single model, our method can effectively improve generalization on novel classes without hurting the detection of base classes.

Fine-Grained Classification With Noisy Labels
Wei, QiandFeng, LeiandSun, HaoliangandWang, RenandGuo, ChenhuiandYin, Yilong



研究问题:本文旨在解决在标签噪声存在的情况下,如何提高细粒度数据集上的模型泛化能力。
动机:现有的学习方法在面对标签噪声时表现不佳,特别是在细粒度类别之间存在大量类间模糊性,导致标签噪声更为严重的情况下。
方法:提出了一种新的框架——随机噪声容忍的有监督对比学习(SNSCL),通过鼓励区分性表示来应对标签噪声。具体来说,设计了一种噪声容忍的有监督对比学习损失函数,该函数包含一个加权机制用于纠正噪声标签和选择性更新动量队列列表。
效果:实验结果表明,SNSCL在各种细粒度数据集上都能显著提高模型的泛化能力,且优于现有的处理方法。

Learning with noisy labels (LNL) aims to ensure model generalization given a label-corrupted training set. In this work, we investigate a rarely studied scenario of LNL on fine-grained datasets (LNL-FG), which is more practical and challenging as large inter-class ambiguities among fine-grained classes cause more noisy labels. We empirically show that existing methods that work well for LNL fail to achieve satisfying performance for LNL-FG, arising the practical need of effective solutions for LNL-FG. To this end, we propose a novel framework called stochastic noise-tolerated supervised contrastive learning (SNSCL) that confronts label noise by encouraging distinguishable representation. Specifically, we design a noise-tolerated supervised contrastive learning loss that incorporates a weight-aware mechanism for noisy label correction and selectively updating momentum queue lists. By this mechanism, we mitigate the effects of noisy anchors and avoid inserting noisy labels into the momentum-updated queue. Besides, to avoid manually-defined augmentation strategies in contrastive learning, we propose an efficient stochastic module that samples feature embeddings from a generated distribution, which can also enhance the representation ability of deep models. SNSCL is general and compatible with prevailing robust LNL strategies to improve their performance for LNL-FG. Extensive experiments demonstrate the effectiveness of SNSCL.

Hybrid Active Learning via Deep Clustering for Video Action Detection
Rana, AayushJ.andRawat, YogeshS.



研究问题:本文旨在降低视频动作检测的标注成本,该任务需要昂贵的逐帧密集标注。
动机:目前的标注方法成本高昂,因此需要一种有效的标注策略来降低成本。
方法:提出了一种新的混合主动学习策略,通过样本内和样本间的选择进行高效的标注。这种混合策略从两个方面降低了标注成本。
效果:在UCF-101-24和J-HMDB-21数据集上进行的实验表明,该方法能有效降低标注成本,并始终优于其他基线方法。

In this work, we focus on reducing the annotation cost for video action detection which requires costly frame-wise dense annotations. We study a novel hybrid active learning (AL) strategy which performs efficient labeling using both intra-sample and inter-sample selection. The intra-sample selection leads to labeling of fewer frames in a video as opposed to inter-sample selection which operates at video level. This hybrid strategy reduces the annotation cost from two different aspects leading to significant labeling cost reduction. The proposed approach utilize Clustering-Aware Uncertainty Scoring (CLAUS), a novel label acquisition strategy which relies on both informativeness and diversity for sample selection. We also propose a novel Spatio-Temporal Weighted (STeW) loss formulation, which helps in model training under limited annotations. The proposed approach is evaluated on UCF-101-24 and J-HMDB-21 datasets demonstrating its effectiveness in significantly reducing the annotation cost where it consistently outperforms other baselines. Project details available at https://sites.google.com/view/activesparselabeling/home

Uncertainty-Aware Optimal Transport for Semantically Coherent Out-of-Distribution Detection
Lu, FanandZhu, KaiandZhai, WeiandZheng, KechengandCao, Yang



研究问题:如何区分目标数据分布中的异常值(SCOOD检测)。
动机:在存在分布内和分布外样本的情况下,如果没有进行区分,模型可能会过拟合。
方法:提出一种新颖的不确定性感知最优传输方案,包括一个估计不确定性波动成本的能量传输机制和一个通过扩大相应间隔距离增强不同簇之间语义属性辨别性的簇间扩展策略。
效果:在两个标准的SCOOD基准测试中,该方法的异常检测性能优于最先进方法,FPR@95分别提高了27.69%和34.4%。

Semantically coherent out-of-distribution (SCOOD) detection aims to discern outliers from the intended data distribution with access to unlabeled extra set. The coexistence of in-distribution and out-of-distribution samples will exacerbate the model overfitting when no distinction is made. To address this problem, we propose a novel uncertainty-aware optimal transport scheme. Our scheme consists of an energy-based transport (ET) mechanism that estimates the fluctuating cost of uncertainty to promote the assignment of semantic-agnostic representation, and an inter-cluster extension strategy that enhances the discrimination of semantic property among different clusters by widening the corresponding margin distance. Furthermore, a T-energy score is presented to mitigate the magnitude gap between the parallel transport and classifier branches. Extensive experiments on two standard SCOOD benchmarks demonstrate the above-par OOD detection performance, outperforming the state-of-the-art methods by a margin of 27.69% and 34.4% on FPR@95, respectively.

Hunting Sparsity: Density-Guided Contrastive Learning for Semi-Supervised Semantic Segmentation
Wang, XiaoyangandZhang, BingfengandYu, LiminandXiao, Jimin



研究问题:本文旨在通过直接从特征空间的几何结构中提取适当的监督,提高模型在扰动不变训练中的泛化能力。
动机:现有的半监督语义分割方法结合伪标签和一致性正则化来增强模型的泛化能力,但作者认为可以从特征空间的几何结构中直接提取适当的监督。
方法:受基于密度的无监督聚类启发,提出利用特征密度定位由标签和伪标签定义的特征簇内的稀疏区域。假设低密度特征与高密度聚集的特征相比,往往训练不足。因此,提出通过解决稀疏性来对簇的结构进行正则化,以增加特征空间内的类内紧凑性。为此,提出了一种密度引导的对比学习(DGCL)策略,将锚定特征推向由高密度正键近似的簇中心的稀疏区域。该方法的核心是估计特征密度,定义为邻居密集度。设计了一个多尺度密度估计模块,从多个最近邻图中获得密度,以实现稳健的密度建模。此外,提出了一个统一的训练框架,将标签引导的自我训练和密度引导的几何正则化相结合,形成对未标记数据的互补监督。
效果:在PASCAL VOC和Cityscapes的各种半监督设置下进行的实验结果表明,所提出的方法实现了最先进的性能。

Recent semi-supervised semantic segmentation methods combine pseudo labeling and consistency regularization to enhance model generalization from perturbation-invariant training. In this work, we argue that adequate supervision can be extracted directly from the geometry of feature space. Inspired by density-based unsupervised clustering, we propose to leverage feature density to locate sparse regions within feature clusters defined by label and pseudo labels. The hypothesis is that lower-density features tend to be under-trained compared with those densely gathered. Therefore, we propose to apply regularization on the structure of the cluster by tackling the sparsity to increase intra-class compactness in feature space. With this goal, we present a Density-Guided Contrastive Learning (DGCL) strategy to push anchor features in sparse regions toward cluster centers approximated by high-density positive keys. The heart of our method is to estimate feature density which is defined as neighbor compactness. We design a multi-scale density estimation module to obtain the density from multiple nearest-neighbor graphs for robust density modeling. Moreover, a unified training framework is proposed to combine label-guided self-training and density-guided geometry regularization to form complementary supervision on unlabeled data. Experimental results on PASCAL VOC and Cityscapes under various semi-supervised settings demonstrate that our proposed method achieves state-of-the-art performances.

Learning Analytical Posterior Probability for Human Mesh Recovery
Fang, QiandChen, KangandFan, YinghuiandShuai, QingandLi, JiefengandZhang, Weidong



研究问题:现有的人体网格恢复模型的不确定性和模糊性建模方法精度有限,因为其关节研究问题:现有的人体网格恢复模型的不确定性和模糊性建模方法精度有限,因为其关节旋转的联合建模方式要么不约束在SO(3)上,要么对神经网络难以学习。
动机:为了解决这个问题,我们提出了一种新的分析公式,以贝叶斯的方式学习人体关节旋转的条件概率分布,并基于此提出了一个新的后验引导的人体网格恢复框架。
方法:我们的框架不仅优于现有的最佳基准,而且由于其贝叶斯性质,具有足够的灵活性,可以无缝地与其他传感器集成。
效果:实验证明,我们的框架在多个基准测试中表现优越,代码已在GitHub上开源。

Despite various probabilistic methods for modeling the uncertainty and ambiguity in human mesh recovery, their overall precision is limited because existing formulations for joint rotations are either not constrained to SO(3) or difficult to learn for neural networks. To address such an issue, we derive a novel analytical formulation for learning posterior probability distributions of human joint rotations conditioned on bone directions in a Bayesian manner, and based on this, we propose a new posterior-guided framework for human mesh recovery. We demonstrate that our framework is not only superior to existing SOTA baselines on multiple benchmarks but also flexible enough to seamlessly incorporate with additional sensors due to its Bayesian nature. The code is available at https://github.com/NetEase-GameAI/ProPose.

An Erudite Fine-Grained Visual Classification Model
Chang, DongliangandTong, YujunandDu, RuoyiandHospedales, TimothyandSong, Yi-ZheandMa, Zhanyu



研究问题:目前的细粒度视觉分类(FGVC)模型是孤立的,需要先识别出对象的粗粒度标签,再选择相应的FGVC模型进行识别,这限制了FGVC算法在现实场景中的应用。
动机:为了解决这一问题,本文提出了一种联合训练的博学FGVC模型,该模型可以高效准确地预测对象在整个组合标签空间中的细粒度标签。
方法:首先,我们提出了一个特征解耦模块和一个特征再融合模块,以减少不同数据集之间训练时的负迁移并增强正迁移。然后,我们提出了一个基于元学习的数据集无关的空间注意力层,以充分利用多数据集的训练数据。
效果:实验结果表明,该方法在11个不同的混合数据集上取得了良好的效果,这些数据集基于四个不同的FGVC数据集构建。此外,该方法可以很容易地与现有的FGVC方法结合,获得最先进的结果。

Current fine-grained visual classification (FGVC) models are isolated. In practice, we first need to identify the coarse-grained label of an object, then select the corresponding FGVC model for recognition. This hinders the application of the FGVC algorithm in real-life scenarios. In this paper, we propose an erudite FGVC model jointly trained by several different datasets, which can efficiently and accurately predict an object's fine-grained label across the combined label space. We found through a pilot study that positive and negative transfers co-occur when different datasets are mixed for training, i.e., the knowledge from other datasets is not always useful. Therefore, we first propose a feature disentanglement module and a feature re-fusion module to reduce negative transfer and boost positive transfer between different datasets. In detail, we reduce negative transfer by decoupling the deep features through many dataset-specific feature extractors. Subsequently, these are channel-wise re-fused to facilitate positive transfer. Finally, we propose a meta-learning based dataset-agnostic spatial attention layer to take full advantage of the multi-dataset training data, given that localisation is dataset-agnostic between different datasets. Experimental results across 11 different mixed-datasets built on four different FGVC datasets demonstrate the effectiveness of the proposed method. Furthermore, the proposed method can be easily combined with existing FGVC methods to obtain state-of-the-art results.

RONO: Robust Discriminative Learning With Noisy Labels for 2D-3D Cross-Modal Retrieval
Feng, YanglinandZhu, HongyuanandPeng, DezhongandPeng, XiandHu, Peng



研究问题:随着Metaverse和AI生成内容的出现,跨模态检索在2D和3D数据中变得流行,但由于异构结构和语义差异,这个问题具有挑战性。
动机:由于模糊的2D和3D内容,普遍存在不完美的标注,从而不可避免地产生噪声标签,降低学习性能。
方法:本文提出了一种鲁棒的2D-3D检索框架(RONO),以从噪声多模态数据中稳健地学习。具体来说,提出了一种新的鲁棒判别中心学习机制(RDCL),以自适应地区分干净和噪声样本,分别提供它们正负优化方向,从而减轻噪声标签的负面影响。此外,还提出了一个共享空间一致性学习机制(SSCL),通过同时最小化公共空间和标签空间之间的跨模态和语义差异,捕获噪声数据中的固有信息。
效果:通过与15种最先进的方法进行比较,在四个3D模型多模态数据集上进行了广泛的实验,验证了该方法的有效性。

Recently, with the advent of Metaverse and AI Generated Content, cross-modal retrieval becomes popular with a burst of 2D and 3D data. However, this problem is challenging given the heterogeneous structure and semantic discrepancies. Moreover, imperfect annotations are ubiquitous given the ambiguous 2D and 3D content, thus inevitably producing noisy labels to degrade the learning performance. To tackle the problem, this paper proposes a robust 2D-3D retrieval framework (RONO) to robustly learn from noisy multimodal data. Specifically, one novel Robust Discriminative Center Learning mechanism (RDCL) is proposed in RONO to adaptively distinguish clean and noisy samples for respectively providing them with positive and negative optimization directions, thus mitigating the negative impact of noisy labels. Besides, we present a Shared Space Consistency Learning mechanism (SSCL) to capture the intrinsic information inside the noisy data by minimizing the cross-modal and semantic discrepancy between common space and label space simultaneously. Comprehensive mathematical analyses are given to theoretically prove the noise tolerance of the proposed method. Furthermore, we conduct extensive experiments on four 3D-model multimodal datasets to verify the effectiveness of our method by comparing it with 15 state-of-the-art methods. Code is available at https://github.com/penghu-cs/RONO.

DISC: Learning From Noisy Labels via Dynamic Instance-Specific Selection and Correction
Li, YifanandHan, HuandShan, ShiguangandChen, Xilin



研究问题:现有的深度学习网络会最终记住标签噪声,我们观察到每个实例对这种记忆的强度不同,并且可以通过置信度值来表示,该值在训练过程中会越来越大。
动机:基于此,我们提出了一种从有噪声的标签中学习的动态实例特定选择和修正方法(DISC)。
方法:首先,我们使用一个基于两个视图的分类模型进行图像分类,从两个视图中获得每个图像的置信度。然后,我们为每个实例提出一个动态阈值策略,根据前几个时期每个实例的记忆强度的动量来选择和修正有噪声的标签数据。
效果:得益于动态阈值策略和双视图学习,我们可以有效地将每个实例分为三个子集之一(即干净、困难和净化),基于每个时期两个视图的预测一致性和差异。最后,我们采用不同的正则化策略来处理不同程度标签噪声的子集,提高整个网络的鲁棒性。在三个可控和四个真实世界的有噪声标签基准测试中,我们的方法优于最先进的方法,利用了有噪声数据中的有用信息,同时减轻了标签噪声的污染。

Existing studies indicate that deep neural networks (DNNs) can eventually memorize the label noise. We observe that the memorization strength of DNNs towards each instance is different and can be represented by the confidence value, which becomes larger and larger during the training process. Based on this, we propose a Dynamic Instance-specific Selection and Correction method (DISC) for learning from noisy labels (LNL). We first use a two-view-based backbone for image classification, obtaining confidence for each image from two views. Then we propose a dynamic threshold strategy for each instance, based on the momentum of each instance's memorization strength in previous epochs to select and correct noisy labeled data. Benefiting from the dynamic threshold strategy and two-view learning, we can effectively group each instance into one of the three subsets (i.e., clean, hard, and purified) based on the prediction consistency and discrepancy by two views at each epoch. Finally, we employ different regularization strategies to conquer subsets with different degrees of label noise, improving the whole network's robustness. Comprehensive evaluations on three controllable and four real-world LNL benchmarks show that our method outperforms the state-of-the-art (SOTA) methods to leverage useful information in noisy data while alleviating the pollution of label noise.

A Probabilistic Framework for Lifelong Test-Time Adaptation
Brahma, DhanajitandRai, Piyush



研究问题:本文旨在解决测试时适应(TTA)的问题,即在推理时间给定来自不同目标域的测试输入时更新预训练源模型。
动机:现有的TTA方法大多假设目标域是稳定的,即所有测试输入都来自单个目标域。然而,在许多实际情况下,测试输入分布可能会随时间发生持续变化。此外,现有的TTA方法也缺乏提供可靠不确定性估计的能力,这是源和目标域之间出现分布偏移时的关键。
方法:我们提出了PETAL(概率终身测试时适应与自我训练先验),使用概率方法解决终身TTA,自然地产生了(1)学生-教师框架,其中教师模型是学生模型的指数移动平均,以及(2)使用源模型作为正则化器来正则化推理时的模型更新。为了防止终身/持续TTA设置中的模型漂移,我们还提出了一种数据驱动的参数恢复技术,通过仅恢复无关参数来减少误差累积并保持对最近领域的知识。
效果:无论在预测错误率还是不确定性指标如Brier得分和负对数似然方面,我们的方法在各种基准上,如CIFAR-10C、CIFAR-100C、ImageNetC和ImageNet3DCC数据集,都比当前最先进的在线终身测试时适应方法取得了更好的结果。

Test-time adaptation (TTA) is the problem of updating a pre-trained source model at inference time given test input(s) from a different target domain. Most existing TTA approaches assume the setting in which the target domain is stationary, i.e., all the test inputs come from a single target domain. However, in many practical settings, the test input distribution might exhibit a lifelong/continual shift over time. Moreover, existing TTA approaches also lack the ability to provide reliable uncertainty estimates, which is crucial when distribution shifts occur between the source and target domain. To address these issues, we present PETAL (Probabilistic lifElong Test-time Adaptation with seLf-training prior), which solves lifelong TTA using a probabilistic approach, and naturally results in (1) a student-teacher framework, where the teacher model is an exponential moving average of the student model, and (2) regularizing the model updates at inference time using the source model as a regularizer. To prevent model drift in the lifelong/continual TTA setting, we also propose a data-driven parameter restoration technique which contributes to reducing the error accumulation and maintaining the knowledge of recent domains by restoring only the irrelevant parameters. In terms of predictive error rate as well as uncertainty based metrics such as Brier score and negative log-likelihood, our method achieves better results than the current state-of-the-art for online lifelong test-time adaptation across various benchmarks, such as CIFAR-10C, CIFAR-100C, ImageNetC, and ImageNet3DCC datasets. The source code for our approach is accessible at https://github.com/dhanajitb/petal.

Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Radenovic, FilipandDubey, AbhimanyuandKadian, AbhishekandMihaylov, TodorandVandenhende, SimonandPatel, YashandWen, YiandRamanathan, VigneshandMahajan, Dhruv



研究问题:本文旨在改进对比预训练流程的三个主要方面:数据集噪声、模型初始化和训练目标。
动机:在大规模嘈杂数据上进行对比学习的视觉语言模型,对于零样本识别问题越来越受欢迎。
方法:提出了一种名为“复杂性、动作和文本定位”(CAT)的直接过滤策略,显著减少了数据集大小,同时提高了各种零样本视觉语言任务的性能。然后,提出了一种名为“概念蒸馏”的方法,利用强大的单模态表示进行对比训练,不增加训练复杂度,同时优于先前的工作。最后,修改了传统的对比对齐目标,并提出了一种新的重要采样方法来提高困难负例的重要性,而不增加额外的复杂性。
效果:在29个任务的广泛零样本基准测试中,我们的“蒸馏和困难负例训练”(DiHT)方法在20个任务上超过了基线。此外,对于少数样本线性探测,我们提出了一种将零样本和少数样本性能之间的差距弥合的新方法,大大改善了先前的工作。

Vision-language models trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems. In this paper we improve the following three aspects of the contrastive pre-training pipeline: dataset noise, model initialization and the training objective. First, we propose a straightforward filtering strategy titled Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset size, while achieving improved performance across zero-shot vision-language tasks. Next, we propose an approach titled Concept Distillation to leverage strong unimodal representations for contrastive training that does not increase training complexity while outperforming prior work. Finally, we modify the traditional contrastive alignment objective, and propose an importance-sampling approach to up-sample the importance of hard-negatives without adding additional complexity. On an extensive zero-shot benchmark of 29 tasks, our Distilled and Hard-negative Training (DiHT) approach improves on 20 tasks compared to the baseline. Furthermore, for few-shot linear probing, we propose a novel approach that bridges the gap between zero-shot and few-shot performance, substantially improving over prior work. Models are available at github.com/facebookresearch/diht.

Meta Omnium: A Benchmark for General-Purpose Learning-To-Learn
Bohdal, OndrejandTian, YinbingandZong, YongshuoandChavhan, RuchikaandLi, DaandGouk, HenryandGuo, LiandHospedales, Timothy



研究问题:本文旨在研究元学习和其他少样本学习方法是否能够跨多种视觉任务进行泛化。
动机:元学习和少样本学习方法在图像识别等任务上已得到广泛应用,但其是否能适用于其他如姿态估计和密集预测等视觉任务尚待研究。
方法:本文提出了Meta Omnium数据集,该数据集包含多个视觉任务,包括识别、关键点定位、语义分割和回归等。通过实验测试了流行的元学习基线在这些任务和知识转移方面的能力。
效果:Meta Omnium使元学习研究者能够评估模型对比以前更广泛的任务的泛化能力,并为一系列视觉应用提供了一个统一的框架来评估元学习者的性能。

Meta-learning and other approaches to few-shot learning are widely studied for image recognition, and are increasingly applied to other vision tasks such as pose estimation and dense prediction. This naturally raises the question of whether there is any few-shot meta-learning algorithm capable of generalizing across these diverse task types? To support the community in answering this question, we introduce Meta Omnium, a dataset-of-datasets spanning multiple vision tasks including recognition, keypoint localization, semantic segmentation and regression. We experiment with popular few-shot meta-learning baselines and analyze their ability to generalize across tasks and to transfer knowledge between them. Meta Omnium enables meta-learning researchers to evaluate model generalization to a much wider array of tasks than previously possible, and provides a single framework for evaluating meta-learners across a wide suite of vision applications in a consistent manner.

Change-Aware Sampling and Contrastive Learning for Satellite Images
Mall, UtkarshandHariharan, BharathandBala, Kavita



研究问题:如何利用无标签的卫星图像数据进行有效的自监督学习。
动机:虽然大量的时空卫星图像数据易于获取,但大部分未标记,这对监督学习算法来说并无太大用处。
方法:利用卫星图像的独特特性,如时间信号和地理位置变化不大的特性,提出了一种新的对比损失函数——变化感知对比(CACo)损失函数,并设计了一种新的地理区域采样方法。
效果:实验结果显示,该方法在各种下游任务中表现更好,例如,在语义分割和变化检测任务上,相对于最佳基线,性能分别提高了6.5%和8.5%。

Automatic remote sensing tools can help inform many large-scale challenges such as disaster management, climate change, etc. While a vast amount of spatio-temporal satellite image data is readily available, most of it remains unlabelled. Without labels, this data is not very useful for supervised learning algorithms. Self-supervised learning instead provides a way to learn effective representations for various downstream tasks without labels. In this work, we leverage characteristics unique to satellite images to learn better self-supervised features. Specifically, we use the temporal signal to contrast images with long-term and short-term differences, and we leverage the fact that satellite images do not change frequently. Using these characteristics, we formulate a new loss contrastive loss called Change-Aware Contrastive (CACo) Loss. Further, we also present a novel method of sampling different geographical regions. We show that leveraging these properties leads to better performance on diverse downstream tasks. For example, we see a 6.5% relative improvement for semantic segmentation and an 8.5% relative improvement for change detection over the best-performing baseline with our method.

Large-Scale Training Data Search for Object Re-Identification
Yao, YueandGedeon, TomandZheng, Liang



研究问题:如何在无法获取实时训练数据标注的情况下,从大规模数据集中构建替代训练集以获得有竞争力的模型。
动机:针对对象重识别(re-ID)应用,目标是匹配不同摄像头捕获的同一对象,但无法获取实时的训练数据标注。
方法:提出一种搜索和剪枝(SnP)解决方案来解决这个问题,包括两个阶段:搜索阶段识别并合并与目标领域具有相似分布的源身份集群;第二阶段在预算限制下从第一阶段的输出中选择身份及其图像,以控制生成的训练集的大小,实现有效训练。
效果:这种方法生成的训练集比源池小80%,同时达到相似的或更高的重识别准确性。这些训练集也优于现有的搜索方法,如随机采样和贪心采样,在相同的预算约束下。如果放开预算限制,仅第一阶段生成的训练集甚至能实现更高的重识别准确性。

We consider a scenario where we have access to the target domain, but cannot afford on-the-fly training data annotation, and instead would like to construct an alternative training set from a large-scale data pool such that a competitive model can be obtained. We propose a search and pruning (SnP) solution to this training data search problem, tailored to object re-identification (re-ID), an application aiming to match the same object captured by different cameras. Specifically, the search stage identifies and merges clusters of source identities which exhibit similar distributions with the target domain. The second stage, subject to a budget, then selects identities and their images from the Stage I output, to control the size of the resulting training set for efficient training. The two steps provide us with training sets 80% smaller than the source pool while achieving a similar or even higher re-ID accuracy. These training sets are also shown to be superior to a few existing search methods such as random sampling and greedy sampling under the same budget on training data size. If we release the budget, training sets resulting from the first stage alone allow even higher re-ID accuracy. We provide interesting discussions on the specificity of our method to the re-ID problem and particularly its role in bridging the re-ID domain gap. The code is available at https://github.com/yorkeyao/SnP.

Uncertainty-Aware Unsupervised Image Deblurring With Deep Residual Prior
Tang, XiaoleandZhao, XileandLiu, JunandWang, JianliandMiao, YuchunandZeng, Tieyong



研究问题:如何设计一个适合的核(或诱导)误差先验,以处理实践中不可避免的核不确定性。
动机:现有的非盲去模糊方法在准确的模糊核假设下表现良好,但在实际应用中,由于核(或诱导)误差的存在,这些方法的效果会受到影响。
方法:提出了一种无需数据集的深度残差先验,用于表示由自定义未训练深度神经网络表达的核诱导误差(称为残差)。这种先验可以灵活适应真实场景中的不同模糊和图像。通过有机地整合深度先验和手工制作先验各自的优势,提出了一种无监督的半盲去模糊模型,该模型可以从模糊图像和不准确的模糊核中恢复潜在图像。
效果:实验表明,与模型驱动和数据驱动的方法相比,该方法在图像质量和对不同类型的核误差的鲁棒性方面表现出良好的性能。

Non-blind deblurring methods achieve decent performance under the accurate blur kernel assumption. Since the kernel uncertainty (i.e. kernel error) is inevitable in practice, semi-blind deblurring is suggested to handle it by introducing the prior of the kernel (or induced) error. However, how to design a suitable prior for the kernel (or induced) error remains challenging. Hand-crafted prior, incorporating domain knowledge, generally performs well but may lead to poor performance when kernel (or induced) error is complex. Data-driven prior, which excessively depends on the diversity and abundance of training data, is vulnerable to out-of-distribution blurs and images. To address this challenge, we suggest a dataset-free deep residual prior for the kernel induced error (termed as residual) expressed by a customized untrained deep neural network, which allows us to flexibly adapt to different blurs and images in real scenarios. By organically integrating the respective strengths of deep priors and hand-crafted priors, we propose an unsupervised semi-blind deblurring model which recovers the latent image from the blurry image and inaccurate blur kernel. To tackle the formulated model, an efficient alternating minimization algorithm is developed. Extensive experiments demonstrate the favorable performance of the proposed method as compared to model-driven and data-driven methods in terms of image quality and the robustness to different types of kernel error.

Bridging the Gap Between Model Explanations in Partially Annotated Multi-Label Classification
Kim, YoungwookandKim, JaeMyungandJeong, JieunandSchmid, CordeliaandAkata, ZeynepandLee, Jungwoo



研究问题:多标签分类中,由于标注成本高,部分标注的情况越来越常见。如何减少未观察到的标签对模型解释的影响,提高模型性能是当前的研究问题。
动机:在部分标注的多标签分类任务中,通常将未观察到的标签视为负标签,但这会引入标签噪声,形成假负标签。假负标签会对模型的解释产生影响,降低模型的性能。
方法:通过对比全标和部分标训练的模型的解释,发现两者在相似区域有不同缩放,且后者的归一化得分较低。因此,提出一种提升部分标训练模型归一化得分的方法,使其解释更接近全标训练模型的解释。
效果:该方法在三个不同的数据集上,单正标签设置和大规模部分标签设置下,均使多标签分类性能大幅提高。

Due to the expensive costs of collecting labels in multi-label classification datasets, partially annotated multi-label classification has become an emerging field in computer vision. One baseline approach to this task is to assume unobserved labels as negative labels, but this assumption induces label noise as a form of false negative. To understand the negative impact caused by false negative labels, we study how these labels affect the model's explanation. We observe that the explanation of two models, trained with full and partial labels each, highlights similar regions but with different scaling, where the latter tends to have lower attribution scores. Based on these findings, we propose to boost the attribution scores of the model trained with partial labels to make its explanation resemble that of the model trained with full labels. Even with the conceptually simple approach, the multi-label classification performance improves by a large margin in three different datasets on a single positive label setting and one on a large-scale partial label setting. Code is available at https://github.com/youngwk/BridgeGapExplanationPAMC.

Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
Sun, WeixuanandZhang, JiayiandWang, JianyuanandLiu, ZheyuanandZhong, YiranandFeng, TianpengandGuo, YandongandZhang, YanhaoandBarnes, Nick



研究问题:现有的自监督视听源定位方法在训练中可能会受到假负样本的影响,导致模型学习到的表示效果不佳。
动机:为了解决这个问题,我们提出了一种新的学习方法——False Negative Aware Contrastive(FNAC),通过利用模态内相似性来识别可能相似的样本,并构建相应的邻接矩阵来指导对比学习。
方法:我们使用视觉特征强化真实负样本的作用,以帮助区分真实的声源区域。
效果:实验结果表明,FNAC在Flickr-SoundNet、VGG-Sound和AVSBench等数据集上取得了最先进的性能,证明了我们的方法能有效缓解假负样本的问题。

Self-supervised audio-visual source localization aims to locate sound-source objects in video frames without extra annotations. Recent methods often approach this goal with the help of contrastive learning, which assumes only the audio and visual contents from the same video are positive samples for each other. However, this assumption would suffer from false negative samples in real-world training. For example, for an audio sample, treating the frames from the same audio class as negative samples may mislead the model and therefore harm the learned representations (e.g., the audio of a siren wailing may reasonably correspond to the ambulances in multiple images). Based on this observation, we propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with such false negative samples. Specifically, we utilize the intra-modal similarities to identify potentially similar samples and construct corresponding adjacency matrices to guide contrastive learning. Further, we propose to strengthen the role of true negative samples by explicitly leveraging the visual features of sound sources to facilitate the differentiation of authentic sounding source regions. FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench, which demonstrates the effectiveness of our method in mitigating the false negative issue. The code is available at https://github.com/OpenNLPLab/FNAC_AVL

Improving the Transferability of Adversarial Samples by Path-Augmented Method
Zhang, JianpingandHuang, Jen-tseandWang, WenxuanandLi, YichenandWu, WeibinandWang, XiaosenandSu, YuxinandLyu, MichaelR.



研究问题:现有的深度神经网络在各种视觉任务上取得了巨大成功,但对人眼无法察觉的对抗性噪声非常敏感,这限制了它们在实际场景,特别是安全相关场景中的应用。
动机:为了评估目标模型在实践中的鲁棒性,基于转移的攻击通过本地模型制作对抗样本,由于其高效率而引起了研究者的广泛关注。
方法:我们提出了路径增强方法(PAM),首先构造一个候选增强路径池,然后在生成对抗样本时使用贪婪搜索确定使用的增强路径。此外,为了避免增强语义不一致的图像,我们训练了一个语义预测器(SP)来约束增强路径的长度。
效果:大量实验证明,与最先进的基线相比,PAM在攻击成功率方面平均提高了4.8%以上。

Deep neural networks have achieved unprecedented success on diverse vision tasks. However, they are vulnerable to adversarial noise that is imperceptible to humans. This phenomenon negatively affects their deployment in real-world scenarios, especially security-related ones. To evaluate the robustness of a target model in practice, transfer-based attacks craft adversarial samples with a local model and have attracted increasing attention from researchers due to their high efficiency. The state-of-the-art transfer-based attacks are generally based on data augmentation, which typically augments multiple training images from a linear path when learning adversarial samples. However, such methods selected the image augmentation path heuristically and may augment images that are semantics-inconsistent with the target images, which harms the transferability of the generated adversarial samples. To overcome the pitfall, we propose the Path-Augmented Method (PAM). Specifically, PAM first constructs a candidate augmentation path pool. It then settles the employed augmentation paths during adversarial sample generation with greedy search. Furthermore, to avoid augmenting semantics-inconsistent images, we train a Semantics Predictor (SP) to constrain the length of the augmentation path. Extensive experiments confirm that PAM can achieve an improvement of over 4.8% on average compared with the state-of-the-art baselines in terms of the attack success rates.

Robust Mean Teacher for Continual and Gradual Test-Time Adaptation
D\"obler, MarioandMarsden, RobertA.andYang, Bin



研究问题:如何在测试时处理领域转移,并解决由此产生的误差累积问题。
动机:在实际应用中,测试时的领域转移是不可避免的,因此需要一种方法来适应这种转移。
方法:提出了一种新的健壮均值教师(RMT)方法,该方法使用对称交叉熵作为一致性损失,并通过对比学习将测试特征空间拉近源域。
效果:在CIFAR10C、CIFAR100C和Imagenet-C等连续和逐渐损坏的基准上取得了最先进的结果,并在新的持续DomainNet-126基准上取得了优秀的表现。

Since experiencing domain shifts during test-time is inevitable in practice, test-time adaption (TTA) continues to adapt the model after deployment. Recently, the area of continual and gradual test-time adaptation (TTA) emerged. In contrast to standard TTA, continual TTA considers not only a single domain shift, but a sequence of shifts. Gradual TTA further exploits the property that some shifts evolve gradually over time. Since in both settings long test sequences are present, error accumulation needs to be addressed for methods relying on self-training. In this work, we propose and show that in the setting of TTA, the symmetric cross-entropy is better suited as a consistency loss for mean teachers compared to the commonly used cross-entropy. This is justified by our analysis with respect to the (symmetric) cross-entropy's gradient properties. To pull the test feature space closer to the source domain, where the pre-trained model is well posed, contrastive learning is leveraged. Since applications differ in their requirements, we address several settings, including having source data available and the more challenging source-free setting. We demonstrate the effectiveness of our proposed method "robust mean teacher" (RMT) on the continual and gradual corruption benchmarks CIFAR10C, CIFAR100C, and Imagenet-C. We further consider ImageNet-R and propose a new continual DomainNet-126 benchmark. State-of-the-art results are achieved on all benchmarks.

Understanding Imbalanced Semantic Segmentation Through Neural Collapse
Zhong, ZhishengandCui, JiequanandYang, YiboandWu, XiaoyangandQi, XiaojuanandZhang, XiangyuandJia, Jiaya



研究问题:本研究探索了在语义分割中,最后一层特征中心和分类器的结构。
动机:我们发现在语义分割中,类别间的上下文关联性和不平衡分布破坏了神经网络塌陷的等角最大分离结构,但对少数类有利。
方法:提出了一种新的健壮均值教师(RMT)方法,该方法使用对称交叉熵作为一致性损失,并通过对比学习将测试特征空间拉近源域。
效果:在CIFAR10C、CIFAR100C和Imagenet-C等连续和逐渐损坏的基准上取得了最先进的结果,并在新的持续DomainNet-126基准上取得了优秀的表现。

A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks first and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard.

Generalized UAV Object Detection via Frequency Domain Disentanglement
Wang, KunyuandFu, XueyangandHuang, YukunandCao, ChengzhiandShi, GegeandZha, Zheng-Jun



研究问题:在将无人机目标检测(UAV-OD)网络部署到复杂和未见过的真实世界场景时,由于领域转移,通常会导致其泛化能力下降。
动机:为了解决这个问题,本文提出了一种新的频域解耦方法来提高UAV-OD的泛化能力。
方法:首先验证了图像中不同波段的频谱对UAV-OD泛化的影响不同。基于这个结论,设计了两种可学习的滤波器来提取领域不变的频谱和领域特定的频谱。前者可以用于训练UAV-OD网络并提高其泛化能力。此外,设计了一种新的实例级对比损失来指导网络训练。这种损失使网络能够专注于提取领域不变的频谱和领域特定的频谱,从而实现更好的解耦结果。
效果:在三个未见过的目标领域的实验结果表明,我们的方法比基线方法和最先进的方法具有更好的泛化能力。

When deploying the Unmanned Aerial Vehicles object detection (UAV-OD) network to complex and unseen real-world scenarios, the generalization ability is usually reduced due to the domain shift. To address this issue, this paper proposes a novel frequency domain disentanglement method to improve the UAV-OD generalization. Specifically, we first verified that the spectrum of different bands in the image has different effects to the UAV-OD generalization. Based on this conclusion, we design two learnable filters to extract domain-invariant spectrum and domain-specific spectrum, respectively. The former can be used to train the UAV-OD network and improve its capacity for generalization. In addition, we design a new instance-level contrastive loss to guide the network training. This loss enables the network to concentrate on extracting domain-invariant spectrum and domain-specific spectrum, so as to achieve better disentangling results. Experimental results on three unseen target domains demonstrate that our method has better generalization ability than both the baseline method and state-of-the-art methods.

Source-Free Adaptive Gaze Estimation by Uncertainty Reduction
Cai, XinandZeng, JiabeiandShan, ShiguangandChen, Xilin



研究问题:如何训练一种能在真实和多样化环境中使用的注视估计器,同时解决在源数据和目标数据之间进行联合训练的隐私和效率问题。
动机:由于训练数据通常在受控条件下收集,而训练好的注视估计器需要在真实和多样化的环境中使用,因此需要探索跨领域的注视估计。
方法:提出了一种无监督的源自由领域适应方法,通过降低样本和模型的不确定性来适应未标记的目标领域,无需源数据。
效果:在六个跨领域任务上进行了广泛的实验,结果显示该方法优于其他最先进的跨领域注视估计方法,无论是否有源数据。

Gaze estimation across domains has been explored recently because the training data are usually collected under controlled conditions while the trained gaze estimators are used in real and diverse environments. However, due to privacy and efficiency concerns, simultaneous access to annotated source data and to-be-predicted target data can be challenging. In light of this, we present an unsupervised source-free domain adaptation approach for gaze estimation, which adapts a source-trained gaze estimator to unlabeled target domains without source data. We propose the Uncertainty Reduction Gaze Adaptation (UnReGA) framework, which achieves adaptation by reducing both sample and model uncertainty. Sample uncertainty is mitigated by enhancing image quality and making them gaze-estimation-friendly, whereas model uncertainty is reduced by minimizing prediction variance on the same inputs. Extensive experiments are conducted on six cross-domain tasks, demonstrating the effectiveness of UnReGA and its components. Results show that UnReGA outperforms other state-of-the-art cross-domain gaze estimation methods under both protocols, with and without source data

SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail
Du, YingjunandShen, JiayiandZhen, XiantongandSnoek, CeesG.M.



研究问题:如何改善现代图像分类器在少数实例的尾部类别上的性能下降问题。
动机:人类可以轻松处理长尾识别挑战,而现有的图像分类器则在尾部类别上性能显著下降。
方法:提出SuperDisco算法,通过构建超类图模型来发现用于长尾识别的超类表示。通过在超类图上进行消息传递,根据语义相似性,修正和提炼与最相关实体相关的图像表示。
效果:在CIFAR-100、ImageNet、Places和iNaturalist等长尾数据集上的实验表明,该方法可以有效改善长尾识别性能,并取得了一致的最先进的结果。

Modern image classifiers perform well on populated classes while degrading considerably on tail classes with only a few instances. Humans, by contrast, effortlessly handle the long-tailed recognition challenge, since they can learn the tail representation based on different levels of semantic abstraction, making the learned tail features more discriminative. This phenomenon motivated us to propose SuperDisco, an algorithm that discovers super-class representations for long-tailed recognition using a graph model. We learn to construct the super-class graph to guide the representation learning to deal with long-tailed distributions. Through message passing on the super-class graph, image representations are rectified and refined by attending to the most relevant entities based on the semantic similarity among their super-classes. Moreover, we propose to meta-learn the super-class graph under the supervision of a prototype graph constructed from a small amount of imbalanced data. By doing so, we obtain a more robust super-class graph that further improves the long-tailed recognition performance. The consistent state-of-the-art experiments on the long-tailed CIFAR-100, ImageNet, Places, and iNaturalist demonstrate the benefit of the discovered super-class graph for dealing with long-tailed distributions.

Improving Generalization of Meta-Learning With Inverted Regularization at Inner-Level
Wang, LianzheandZhou, ShijiandZhang, ShanghangandChu, XuandChang, HengandZhu, Wenwu



研究问题:元学习中的泛化问题是一个重要的挑战,尽管现有工作通过在元级别上正则化元损失来关注未见过的任务的元泛化,但忽视了适应模型可能在适应级别上无法泛化到任务领域。
动机:本文提出了一种新的元学习正则化机制——Minimax-Meta Regularization,该机制在训练过程中在内循环使用反向正则化,在外循环使用普通正则化。
方法:具体来说,内部反向正则化使得适应模型更难以泛化到任务领域;因此,优化外循环损失迫使元模型学习具有更好泛化的元知识。
效果:理论上,我们证明反向正则化通过减少泛化错误来提高元测试性能。我们在代表性场景中进行了大量的实验,结果表明我们的方法始终能提高元学习算法的性能。

Despite the broad interest in meta-learning, the generalization problem remains one of the significant challenges in this field. Existing works focus on meta-generalization to unseen tasks at the meta-level by regularizing the meta-loss, while ignoring that adapted models may not generalize to the task domains at the adaptation level. In this paper, we propose a new regularization mechanism for meta-learning -- Minimax-Meta Regularization, which employs inverted regularization at the inner loop and ordinary regularization at the outer loop during training. In particular, the inner inverted regularization makes the adapted model more difficult to generalize to task domains; thus, optimizing the outer-loop loss forces the meta-model to learn meta-knowledge with better generalization. Theoretically, we prove that inverted regularization improves the meta-testing performance by reducing generalization errors. We conduct extensive experiments on the representative scenarios, and the results show that our method consistently improves the performance of meta-learning algorithms.

Data-Efficient Large Scale Place Recognition With Graded Similarity Supervision
Leyva-Vallina, Mar{\'\i



研究问题:视觉地点识别(VPR)是计算机视觉的基本任务,但现有的方法在训练研究问题:视觉地点识别(VPR)是计算机视觉的基本任务,但现有的方法在训练中使用的图像对只能表示同一地点或不同地点,这种二分法的指示并未考虑到同一地点在不同位置拍摄的图像之间的连续相似性。
动机:由于相机位姿的差异,同一地点的两个图像只有部分共享的视觉线索。因此,我们提出了一种新的自动重新标注策略来重新标记VPR数据集。
方法:我们根据可用的定位元数据计算图像对的分级相似性标签,并提出一种新的广义对比损失(GCL),该损失使用分级相似性标签来训练对比网络。
效果:新的标签和GCL的使用使得我们可以摆脱硬对挖掘,训练出在VPR中表现更好的图像描述符,通过最近邻搜索获得优于或相当于需要昂贵硬对挖掘和重排技术的方法的结果。

Visual place recognition (VPR) is a fundamental task of computer vision for visual localization. Existing methods are trained using image pairs that either depict the same place or not. Such a binary indication does not consider continuous relations of similarity between images of the same place taken from different positions, determined by the continuous nature of camera pose. The binary similarity induces a noisy supervision signal into the training of VPR methods, which stall in local minima and require expensive hard mining algorithms to guarantee convergence. Motivated by the fact that two images of the same place only partially share visual cues due to camera pose differences, we deploy an automatic re-annotation strategy to re-label VPR datasets. We compute graded similarity labels for image pairs based on available localization metadata. Furthermore, we propose a new Generalized Contrastive Loss (GCL) that uses graded similarity labels for training contrastive networks. We demonstrate that the use of the new labels and GCL allow to dispense from hard-pair mining, and to train image descriptors that perform better in VPR by nearest neighbor search, obtaining superior or comparable results than methods that require expensive hard-pair mining and re-ranking techniques.

OpenMix: Exploring Outlier Samples for Misclassification Detection
Zhu, FeiandCheng, ZhenandZhang, Xu-YaoandLiu, Cheng-Lin



研究问题:如何提高深度神经网络分类器的可靠置信度估计,特别是在高风险应用中。
动机:现代深度神经网络经常对其错误的预测过于自信,这在高风险应用中是一个挑战。
方法:利用易获取的异常样本(即来自非目标类别的未标记样本)来帮助检测误分类错误。特别是,我们发现著名的Outlier Exposure在识别未知类别的分布外(OOD)样本方面非常强大,但在识别误分类错误方面并未提供任何增益。基于这些观察,我们提出了一种名为OpenMix的新方法,该方法通过学习拒绝由异常转换生成的不确定伪样本来结合开放世界知识。
效果:OpenMix在各种情况下显著提高了置信度的可靠性,为检测已知类别的误分类样本和未知类别的OOD样本建立了一个强大而统一的框架。

Reliable confidence estimation for deep neural classifiers is a challenging yet fundamental requirement in high-stakes applications. Unfortunately, modern deep neural networks are often overconfident for their erroneous predictions. In this work, we exploit the easily available outlier samples, i.e., unlabeled samples coming from non-target classes, for helping detect misclassification errors. Particularly, we find that the well-known Outlier Exposure, which is powerful in detecting out-of-distribution (OOD) samples from unknown classes, does not provide any gain in identifying misclassification errors. Based on these observations, we propose a novel method called OpenMix, which incorporates open-world knowledge by learning to reject uncertain pseudo-samples generated via outlier transformation. OpenMix significantly improves confidence reliability under various scenarios, establishing a strong and unified framework for detecting both misclassified samples from known classes and OOD samples from unknown classes.

Bridging Precision and Confidence: A Train-Time Loss for Calibrating Object Detection
Munir, MuhammadAkhtarandKhan, MuhammadHarisandKhan, SalmanandKhan, FahadShahbaz



研究问题:现有的深度神经网络在视觉问题上取得了显著的进展,但往往预测过于自信,导致校准效果不佳。
动机:大部分解决深度神经网络校准问题的研究都集中在分类任务上,对于作为许多视觉安全关键应用核心的基于深度神经网络的目标检测模型的校准问题,目前还鲜有研究。
方法:受训练时校准方法的启发,本文提出了一种新的辅助损失函数形式,旨在明确将边界框的类别置信度与预测准确性(即精度)对齐。由于原始损失函数依赖于一个minibatch中真正例和假正例的数量,我们开发了一个可与其他特定于应用程序的损失函数一起使用的可微代理损失函数。
效果:我们在具有挑战性的内域和外域场景下,使用包括MS-COCO、Cityscapes、Sim10k和BDD100k在内的六个基准数据集进行了广泛的实验。结果显示,我们的训练时损失在减少内域和外域场景的校准误差方面优于强大的校准基线。

Deep neural networks (DNNs) have enabled astounding progress in several vision-based problems. Despite showing high predictive accuracy, recently, several works have revealed that they tend to provide overconfident predictions and thus are poorly calibrated. The majority of the works addressing the miscalibration of DNNs fall under the scope of classification and consider only in-domain predictions. However, there is little to no progress in studying the calibration of DNN-based object detection models, which are central to many vision-based safety-critical applications. In this paper, inspired by the train-time calibration methods, we propose a novel auxiliary loss formulation that explicitly aims to align the class confidence of bounding boxes with the accurateness of predictions (i.e. precision). Since the original formulation of our loss depends on the counts of true positives and false positives in a minibatch, we develop a differentiable proxy of our loss that can be used during training with other application-specific loss functions. We perform extensive experiments on challenging in-domain and out-domain scenarios with six benchmark datasets including MS-COCO, Cityscapes, Sim10k, and BDD100k. Our results reveal that our train-time loss surpasses strong calibration baselines in reducing calibration error for both in and out-domain scenarios. Our source code and pre-trained models are available at https://github.com/akhtarvision/bpc_calibration

Adaptive Data-Free Quantization
Qian, BiaoandWang, YangandHong, RichangandWang, Meng



研究问题:如何量化网络(Q)在没有原始数据的情况下恢复性能,同时生成的伪造样本是否对学习过程有益?
动机:目前的无数据量化方法忽略了生成样本的知识适应性,导致泛化误差过大。
方法:提出一种自适应无数据量化(AdaDFQ)方法,从零和游戏的角度重新审视样本适应性问题,优化生成样本与量化网络之间的边界,以解决过拟合和欠拟合问题。
效果:实验证明,AdaDFQ优于现有技术,其生成的样本不仅应具有信息性,还应与训练数据的类别和分布信息相关。

Data-free quantization (DFQ) recovers the performance of quantized network (Q) without the original data, but generates the fake sample via a generator (G) by learning from full-precision network (P), which, however, is totally independent of Q, overlooking the adaptability of the knowledge from generated samples, i.e., informative or not to the learning process of Q, resulting into the overflow of generalization error. Building on this, several critical questions -- how to measure the sample adaptability to Q under varied bit-width scenarios? whether the largest adaptability is the best? how to generate the samples with adaptive adaptability to improve Q's generalization? To answer the above questions, in this paper, we propose an Adaptive Data-Free Quantization (AdaDFQ) method, which revisits DFQ from a zero-sum game perspective upon the sample adaptability between two players -- a generator and a quantized network. Following this viewpoint, we further define the disagreement and agreement samples to form two boundaries, where the margin between two boundaries is optimized to adaptively regulate the adaptability of generated samples to Q, so as to address the over-and-under fitting issues. Our AdaDFQ reveals: 1) the largest adaptability is NOT the best for sample generation to benefit Q's generalization; 2) the knowledge of the generated sample should not be informative to Q only, but also related to the category and distribution information of the training data for P. The theoretical and empirical analysis validate the advantages of AdaDFQ over the state-of-the-arts. Our code is available at https://github.com/hfutqian/AdaDFQ.

Ground-Truth Free Meta-Learning for Deep Compressive Sampling
Qin, XinranandQuan, YuhuiandPang, TongyaoandJi, Hui



研究问题:本文旨在提出一种无需真实值(GT)的元学习方法,用于压缩采样(CS)中的高质量图像重建。
动机:目前的深度学习在压缩采样中重建图像方面起着重要作用,但需要大量的标注数据。本文提出了一种无需真实值的元学习方法,利用外部和内部学习进行无监督高质量图像重建。
方法:该方法首先使用仅基于压缩采样测量的外部元学习训练深度模型,然后通过利用其内部特性对测试样本进行有效改进来适应已训练的模型。元学习和模型适应建立在改进的Stein无偏风险估计器(iSURE)上,为测量矩阵伴随范围空间中的准确预测提供有效的指导和高效的计算。
效果:实验结果表明,提出的无需真实值的方法表现良好,甚至可以与基于监督学习的方方法竞争。

Deep learning has become an important tool for reconstructing images in compressive sampling (CS). This paper proposes a ground-truth (GT) free meta-learning method for CS, which leverages both external and internal learning for unsupervised high-quality image reconstruction. The proposed method first trains a deep model via external meta-learning using only CS measurements, and then efficiently adapts the trained model to a test sample for further improvement by exploiting its internal characteristics. The meta-learning and model adaptation are built on an improved Stein's unbiased risk estimator (iSURE) that provides efficient computation and effective guidance for accurate prediction in the range space of the adjoint of the measurement matrix. To further improve the learning on the null space of the measurement matrix, a modified model-agnostic meta-learning scheme is proposed, along with a null-space-consistent loss and a bias-adaptive deep unrolling network to improve and accelerate model adaption in test time. Experimental results have demonstrated that the proposed GT-free method performs well, and can even compete with supervised learning-based methods.

DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling
Jeong, JisooandCai, HongandGarrepalli, RisheekandPorikli, Fatih



研究问题:如何通过引入真实的干扰来训练光流估计模型。
动机:现有的数据增强方法主要关注低层次的修改,而我们的方法通过使用语义上有意义的干扰物,使模型能够学习相关的变化并提高对挑战性偏差的鲁棒性。
方法:我们提出了一种新的数据增强方法DistractFlow,通过将一对帧中的一个与描绘相似领域的干扰图像结合,引入与自然物体和场景一致的视觉扰动。我们还在原始对的估计流和其真实流之间以及干扰对的流和原始对的真实流之间定义了两个监督损失。
效果:我们在多个基准测试中进行了广泛的评估,包括Sintel、KITTI和SlowFlow,结果显示DistractFlow能够持续改善现有模型,超越最新的技术状态。

We propose a novel data augmentation approach, DistractFlow, for training optical flow estimation models by introducing realistic distractions to the input frames. Based on a mixing ratio, we combine one of the frames in the pair with a distractor image depicting a similar domain, which allows for inducing visual perturbations congruent with natural objects and scenes. We refer to such pairs as distracted pairs. Our intuition is that using semantically meaningful distractors enables the model to learn related variations and attain robustness against challenging deviations, compared to conventional augmentation schemes focusing only on low-level aspects and modifications. More specifically, in addition to the supervised loss computed between the estimated flow for the original pair and its ground-truth flow, we include a second supervised loss defined between the distracted pair's flow and the original pair's ground-truth flow, weighted with the same mixing ratio. Furthermore, when unlabeled data is available, we extend our augmentation approach to self-supervised settings through pseudo-labeling and cross-consistency regularization. Given an original pair and its distracted version, we enforce the estimated flow on the distracted pair to agree with the flow of the original pair. Our approach allows increasing the number of available training pairs significantly without requiring additional annotations. It is agnostic to the model architecture and can be applied to training any optical flow estimation models. Our extensive evaluations on multiple benchmarks, including Sintel, KITTI, and SlowFlow, show that DistractFlow improves existing models consistently, outperforming the latest state of the art.

Flexible-Cm GAN: Towards Precise 3D Dose Prediction in Radiotherapy
Gao, RiqiangandLou, BinandXu, ZhoubingandComaniciu, DorinandKamen, Ali



研究问题:如何利用深度学习进行知识基础的放射治疗规划,以适应不同的临床场景。
动机:现有的深度学习方法主要适用于简单的场景,如固定的治疗类型或一致的射束角度配置,限制了其通用性和实用性。
方法:提出了一种名为Flexible-C^m GAN的新型条件生成模型,利用额外的治疗类型和各种射束几何信息。并设计了一种失配一致性损失函数来处理输入数据条件有限的问题。
效果:通过实验验证,该方法在实际应用中的性能优于现有的深度学习方法,可以灵活地选择特定的治疗类型和射束角度以满足临床需求。

Deep learning has been utilized in knowledge-based radiotherapy planning in which a system trained with a set of clinically approved plans is employed to infer a three-dimensional dose map for a given new patient. However, previous deep methods are primarily limited to simple scenarios, e.g., a fixed planning type or a consistent beam angle configuration. This in fact limits the usability of such approaches and makes them not generalizable over a larger set of clinical scenarios. Herein, we propose a novel conditional generative model, Flexible-C^m GAN, utilizing additional information regarding planning types and various beam geometries. A miss-consistency loss is proposed to deal with the challenge of having a limited set of conditions on the input data, e.g., incomplete training samples. To address the challenges of including clinical preferences, we derive a differentiable shift-dose-volume loss to incorporate the well-known dose-volume histogram constraints. During inference, users can flexibly choose a specific planning type and a set of beam angles to meet the clinical requirements. We conduct experiments on an illustrative face dataset to show the motivation of Flexible-C^m GAN and further validate our model's potential clinical values with two radiotherapy datasets. The results demonstrate the superior performance of the proposed method in a practical heterogeneous radiotherapy planning application compared to existing deep learning-based approaches.

Learning To Measure the Point Cloud Reconstruction Loss in a Representation Space
Huang, TianxinandDing, ZhongganandZhang, JiangningandTai, YingandZhang, ZhenyuandChen, MingangandWang, ChengjieandLiu, Yong



研究问题:针对点云重建相关任务,如何更准确地评估重建结果与真实值之间的形状差异。
动机:现有的方法通常使用点到点的欧氏距离来度量训练损失,但这种方法可能引入额外的缺陷,因为预定义的匹配规则可能偏离真实的形状差异。
方法:本文提出了一种基于学习的对比对抗损失(CALoss)方法,通过结合对比约束和对抗策略,在非线性表示空间中动态地度量点云重建损失。具体来说,我们使用对比约束来帮助CALoss学习具有形状相似性的表现空间,同时引入对抗策略来帮助CALoss挖掘重建结果与真实值之间的差异。
效果:实验结果表明,CALoss可以帮助任务网络提高重建性能并学习更具代表性的表示。

For point cloud reconstruction-related tasks, the reconstruction losses to evaluate the shape differences between reconstructed results and the ground truths are typically used to train the task networks. Most existing works measure the training loss with point-to-point distance, which may introduce extra defects as predefined matching rules may deviate from the real shape differences. Although some learning-based works have been proposed to overcome the weaknesses of manually-defined rules, they still measure the shape differences in 3D Euclidean space, which may limit their ability to capture defects in reconstructed shapes. In this work, we propose a learning-based Contrastive Adversarial Loss (CALoss) to measure the point cloud reconstruction loss dynamically in a non-linear representation space by combining the contrastive constraint with the adversarial strategy. Specifically, we use the contrastive constraint to help CALoss learn a representation space with shape similarity, while we introduce the adversarial strategy to help CALoss mine differences between reconstructed results and ground truths. According to experiments on reconstruction-related tasks, CALoss can help task networks improve reconstruction performances and learn more representative representations.

Back to the Source: Diffusion-Driven Adaptation To Test-Time Corruption
Gao, JinandZhang, JialingandLiu, XihuiandDarrell, TrevorandShelhamer, EvanandWang, Dequan



研究问题:如何利用测试输入来提高在目标数据上训练的模型的准确性?
动机:大多数方法通过重新训练源模型来更新源数据,但这种方法对数据的数量和顺序以及优化的超参数非常敏感。
方法:我们更新目标数据,并使用生成性扩散模型将所有测试输入投影到源域。我们的扩散驱动适应(DDA)方法在所有领域共享其分类和生成模型,先在源上进行训练,然后对所有目标进行冻结,以避免昂贵的领域特定再训练。
效果:在ImageNet-C基准测试中,DDA的输入适应比各种破坏、模型和数据模式下的模型适应更具鲁棒性。通过输入方式的更新,DDA在数据太少(小批量)、依赖数据(相关顺序)或混合数据(多种破坏)的情况下成功,而模型适应在这些情况下会退化。

Test-time adaptation harnesses test inputs to improve the accuracy of a model trained on source data when tested on shifted target data. Most methods update the source model by (re-)training on each target domain. While re-training can help, it is sensitive to the amount and order of the data and the hyperparameters for optimization. We update the target data instead, and project all test inputs toward the source domain with a generative diffusion model. Our diffusion-driven adaptation (DDA) method shares its models for classification and generation across all domains, training both on source then freezing them for all targets, to avoid expensive domain-wise re-training. We augment diffusion with image guidance and classifier self-ensembling to automatically decide how much to adapt. Input adaptation by DDA is more robust than model adaptation across a variety of corruptions, models, and data regimes on the ImageNet-C benchmark. With its input-wise updates, DDA succeeds where model adaptation degrades on too little data (small batches), on dependent data (correlated orders), or on mixed data (multiple corruptions).

Regularizing Second-Order Influences for Continual Learning
Sun, ZhichengandMu, YadongandHua, Gang



研究问题:本文旨在解决持续学习中遗忘先前知识的问题。
动机:现有的重放方法通过重演已看到的数据来解决此挑战,但需要仔细选择样本,而现有的选择方案通常只关注当前选择的最大效用,忽视了连续选择轮次之间的干扰。
方法:本文在影响函数的框架下,分析了连续选择步骤之间的交互作用,并提出了一种新的二阶影响类别来逐渐放大偶然性偏差,以规范二阶效应。同时,还提出了一种新颖的选择目标,并与两种广泛采用的标准有明确的联系。
效果:实验证明,该方法在多个持续学习基准测试中优于现有方法。

Continual learning aims to learn on non-stationary data streams without catastrophically forgetting previous knowledge. Prevalent replay-based methods address this challenge by rehearsing on a small buffer holding the seen data, for which a delicate sample selection strategy is required. However, existing selection schemes typically seek only to maximize the utility of the ongoing selection, overlooking the interference between successive rounds of selection. Motivated by this, we dissect the interaction of sequential selection steps within a framework built on influence functions. We manage to identify a new class of second-order influences that will gradually amplify incidental bias in the replay buffer and compromise the selection process. To regularize the second-order effects, a novel selection objective is proposed, which also has clear connections to two widely adopted criteria. Furthermore, we present an efficient implementation for optimizing the proposed criterion. Experiments on multiple continual learning benchmarks demonstrate the advantage of our approach over state-of-the-art methods. Code is available at https://github.com/feifeiobama/InfluenceCL.

GradICON: Approximate Diffeomorphisms via Gradient Inverse Consistency
Tian, LinandGreer, HastingsandVialard, Fran\c{c



研究问题:如何在医学图像配准中学习图像对之间的规则空间变换。
动机:与优化基的配准技术和许多现代基于学习的方法不同,我们不直接惩罚转换不规则性,而是通过反向一致性惩罚来促进转换规则性。
方法:我们使用神经网络预测源图像和目标图像之间的映射以及交换源图像和目标图像时的映射。不同于现有方法,我们将这两个结果映射组合并规范其雅可比矩阵的偏离。
效果:我们的注册模型训练时,这种规范化器——GradICON——比直接推广映射组合的反向一致性同时保留后者的显式规范化效应,能更好地收敛。我们在各种真实世界的医疗图像数据集上实现了一流的注册性能,使用一组超参数和一个非特定于数据集的训练协议。

We present an approach to learning regular spatial transformations between image pairs in the context of medical image registration. Contrary to optimization-based registration techniques and many modern learning-based methods, we do not directly penalize transformation irregularities but instead promote transformation regularity via an inverse consistency penalty. We use a neural network to predict a map between a source and a target image as well as the map when swapping the source and target images. Different from existing approaches, we compose these two resulting maps and regularize deviations of the Jacobian of this composition from the identity matrix. This regularizer -- GradICON -- results in much better convergence when training registration models compared to promoting inverse consistency of the composition of maps directly while retaining the desirable implicit regularization effects of the latter. We achieve state-of-the-art registration performance on a variety of real-world medical image datasets using a single set of hyperparameters and a single non-dataset-specific training protocol. The code is available at https://github.com/uncbiag/ICON.

Distribution Shift Inversion for Out-of-Distribution Prediction
Yu, RunpengandLiu, SonghuaandYang, XingyiandWang, Xinchao



研究问题:如何直接减轻未见过测试集中的分布偏移,由于在训练阶段无法获取测试分布,因此训练一个在训练和测试分布之间进行映射的分布转换器是不可能的。
动机:解决现有算法在处理未见过的分布(OoD)问题上的挑战,即通过寻找统一的预测器或不变的特征表示来处理训练和测试分布之间的分布偏移。
方法:提出一种便携式的分布偏移反转(DSI)算法,该算法首先将OoD测试样本与额外的高斯噪声线性组合,然后使用仅在源分布上训练的扩散模型将其转回训练分布。
效果:理论分析和实验结果表明,该方法在多种领域泛化数据集和单一领域泛化数据集上都取得了良好的性能提升,可以广泛应用于常见的OoD算法中。

Machine learning society has witnessed the emergence of a myriad of Out-of-Distribution (OoD) algorithms, which address the distribution shift between the training and the testing distribution by searching for a unified predictor or invariant feature representation. However, the task of directly mitigating the distribution shift in the unseen testing set is rarely investigated, due to the unavailability of the testing distribution during the training phase and thus the impossibility of training a distribution translator mapping between the training and testing distribution. In this paper, we explore how to bypass the requirement of testing distribution for distribution translator training and make the distribution translation useful for OoD prediction. We propose a portable Distribution Shift Inversion (DSI) algorithm, in which, before being fed into the prediction model, the OoD testing samples are first linearly combined with additional Gaussian noise and then transferred back towards the training distribution using a diffusion model trained only on the source distribution. Theoretical analysis reveals the feasibility of our method. Experimental results, on both multiple-domain generalization datasets and single-domain generalization datasets, show that our method provides a general performance gain when plugged into a wide range of commonly used OoD algorithms. Our code is available at https://github.com/yu-rp/Distribution-Shift-Iverson https://github.com/yu-rp/Distribution-Shift-Iverson.

Soft Augmentation for Image Classification
Liu, YangandYan, ShenandLeal-Taix\'e, LauraandHays, JamesandRamanan, Deva



研究问题:如何通过数据增强来减少过拟合并提高泛化能力?
动机:目前的神经网络模型存在过度参数化的问题,需要强大的正则化方法如数据增强和权重衰减来改善。
方法:从人类视觉分类研究中获取灵感,提出将数据增强的不变变换推广为软增强,即学习目标随着样本变换程度的非线性软化。例如,更激进的图像裁剪增强会产生较低置信度的学习目标。
效果:实验证明,软目标允许进行更激进的数据增强,提供更稳健的性能提升,可以与其他增强策略一起使用,并且能产生更准确校准的模型。结合现有的激进增强策略,软目标在Cifar-10、Cifar-100、ImageNet-1K和ImageNet-V2上将top-1准确率提高了一倍,模型遮挡性能提高了4倍,预期校准误差(ECE)降低了一半。最后,我们证明了软增强可以推广到自我监督分类任务中。

Modern neural networks are over-parameterized and thus rely on strong regularization such as data augmentation and weight decay to reduce overfitting and improve generalization. The dominant form of data augmentation applies invariant transforms, where the learning target of a sample is invariant to the transform applied to that sample. We draw inspiration from human visual classification studies and propose generalizing augmentation with invariant transforms to soft augmentation where the learning target softens non-linearly as a function of the degree of the transform applied to the sample: e.g., more aggressive image crop augmentations produce less confident learning targets. We demonstrate that soft targets allow for more aggressive data augmentation, offer more robust performance boosts, work with other augmentation policies, and interestingly, produce better calibrated models (since they are trained to be less confident on aggressively cropped/occluded examples). Combined with existing aggressive augmentation strategies, soft targets 1) double the top-1 accuracy boost across Cifar-10, Cifar-100, ImageNet-1K, and ImageNet-V2, 2) improve model occlusion performance by up to 4x, and 3) half the expected calibration error (ECE). Finally, we show that soft augmentation generalizes to self-supervised classification tasks.

Probabilistic Knowledge Distillation of Face Ensembles
Xu, JianqingandLi, ShenandDeng, AilinandXiong, MiaoandWu, JiayingandWu, JiaxiangandDing, ShouhongandHooi, Bryan



研究问题:如何通过结合大规模文本语料库和知识图谱来训练一种增强的语言表示模型(ERNIE)?
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,本文提出利用知识图谱中的有信息量的实体来增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,训练出ERNIE模型,该模型能同时充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Mean ensemble (i.e. averaging predictions from multiple models) is a commonly-used technique in machine learning that improves the performance of each individual model. We formalize it as feature alignment for ensemble in open-set face recognition and generalize it into Bayesian Ensemble Averaging (BEA) through the lens of probabilistic modeling. This generalization brings up two practical benefits that existing methods could not provide: (1) the uncertainty of a face image can be evaluated and further decomposed into aleatoric uncertainty and epistemic uncertainty, the latter of which can be used as a measure for out-of-distribution detection of faceness; (2) a BEA statistic provably reflects the aleatoric uncertainty of a face image, acting as a measure for face image quality to improve recognition performance. To inherit the uncertainty estimation capability from BEA without the loss of inference efficiency, we propose BEA-KD, a student model to distill knowledge from BEA. BEA-KD mimics the overall behavior of ensemble members and consistently outperforms SOTA knowledge distillation methods on various challenging benchmarks.

Twin Contrastive Learning With Noisy Labels
Huang, ZhizhongandZhang, JunpingandShan, Hongming



研究问题:如何从有噪声的数据中学习,以改善模型性能。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:提出一种新颖的双胞胎对比学习模型TCL,通过将监督模型预测结果注入高斯混合模型(GMM)来链接标签自由的潜在变量和标签噪声注释,从而学习鲁棒的表示并处理分类的噪声标签。
效果:实验结果表明,TCL在几种标准基准和真实世界数据集上表现出优越的性能,特别是在CIFAR-10上实现了7.5%的改进,这是一个非常嘈杂的场景。

Learning from noisy data is a challenging task that significantly degenerates the model performance. In this paper, we present TCL, a novel twin contrastive learning model to learn robust representations and handle noisy labels for classification. Specifically, we construct a Gaussian mixture model (GMM) over the representations by injecting the supervised model predictions into GMM to link label-free latent variables in GMM with label-noisy annotations. Then, TCL detects the examples with wrong labels as the out-of-distribution examples by another two-component GMM, taking into account the data distribution. We further propose a cross-supervision with an entropy regularization loss that bootstraps the true targets from model predictions to handle the noisy labels. As a result, TCL can learn discriminative representations aligned with estimated labels through mixup and contrastive learning. Extensive experimental results on several standard benchmarks and real-world datasets demonstrate the superior performance of TCL. In particular, TCL achieves 7.5% improvements on CIFAR-10 with 90% noisy label---an extremely noisy scenario. The source code is available at https://github.com/Hzzone/TCL.

Density-Insensitive Unsupervised Domain Adaption on 3D Object Detection
Hu, QianjiangandLiu, DaizongandHu, Wei



研究问题:如何减少3D物体检测中由于光束密度不同导致的领域差距,提高模型的泛化能力。
动机:现有的方法在处理光束密度不同的问题上效果不佳,且标记成本高,难以适应未知数据。
方法:提出一种密度不敏感的领域适应框架,通过随机光束重采样增强源领域训练的3D探测器对光束密度变化的鲁棒性,并设计任务特定的教师-学生框架预测高质量的伪标签。
效果:实验结果表明,该方法在三个广泛使用的3D物体检测数据集上优于最先进的方法,特别是在变化密度的数据上表现优秀。

3D object detection from point clouds is crucial in safety-critical autonomous driving. Although many works have made great efforts and achieved significant progress on this task, most of them suffer from expensive annotation cost and poor transferability to unknown data due to the domain gap. Recently, few works attempt to tackle the domain gap in objects, but still fail to adapt to the gap of varying beam-densities between two domains, which is critical to mitigate the characteristic differences of the LiDAR collectors. To this end, we make the attempt to propose a density-insensitive domain adaption framework to address the density-induced domain gap. In particular, we first introduce Random Beam Re-Sampling (RBRS) to enhance the robustness of 3D detectors trained on the source domain to the varying beam-density. Then, we take this pre-trained detector as the backbone model, and feed the unlabeled target domain data into our newly designed task-specific teacher-student framework for predicting its high-quality pseudo labels. To further adapt the property of density-insensitive into the target domain, we feed the teacher and student branches with the same sample of different densities, and propose an Object Graph Alignment (OGA) module to construct two object-graphs between the two branches for enforcing the consistency in both the attribute and relation of cross-density objects. Experimental results on three widely adopted 3D object detection datasets demonstrate that our proposed domain adaption method outperforms the state-of-the-art methods, especially over varying-density data. Code is available at https://github.com/WoodwindHu/DTS.

On-the-Fly Category Discovery
Du, RuoyiandChang, DongliangandLiang, KongmingandHospedales, TimothyandSong, Yi-ZheandMa, Zhanyu



研究问题:尽管机器在视觉识别问题上超越了人类,但它们仍然只能提供封闭的解答。与机器不同,人类可以在第一次观察时就认知到新的类别。
动机:目前的新颖类别发现(NCD)技术通过将已知类别的知识转移到未知类别来区分,旨在弥合这一差距。然而,当前的NCD方法假设了一个转导学习和离线推理范例,这限制了它们只能在预定义的查询集上工作,并且无法提供即时反馈。
方法:本文研究了一种即时类别发现(OCD)方法,该方法使模型能够立即意识到新的类别样本(即实现归纳学习和流式推理)。我们首先设计了一个基于哈希编码的可扩展识别模型作为实用的基线。然后,注意到哈希码对类别内变化的敏感性,我们进一步提出了一种新的符号幅度差异(SMILE)架构来减轻它带来的干扰。
效果:实验结果表明,SMILE相对于我们的基线模型和现有技术具有优越性。我们的代码将在https://github.com/PRIS-CV/On-the-fly-Category-Discovery上公开发布。

Although machines have surpassed humans on visual recognition problems, they are still limited to providing closed-set answers. Unlike machines, humans can cognize novel categories at the first observation. Novel category discovery (NCD) techniques, transferring knowledge from seen categories to distinguish unseen categories, aim to bridge the gap. However, current NCD methods assume a transductive learning and offline inference paradigm, which restricts them to a pre-defined query set and renders them unable to deliver instant feedback. In this paper, we study on-the-fly category discovery (OCD) aimed at making the model instantaneously aware of novel category samples (i.e., enabling inductive learning and streaming inference). We first design a hash coding-based expandable recognition model as a practical baseline. Afterwards, noticing the sensitivity of hash codes to intra-category variance, we further propose a novel Sign-Magnitude dIsentangLEment (SMILE) architecture to alleviate the disturbance it brings. Our experimental results demonstrate the superiority of SMILE against our baseline model and prior art. Our code will be made publicly available. Our code is available at https://github.com/PRIS-CV/On-the-fly-Category-Discovery.

Test Time Adaptation With Regularized Loss for Weakly Supervised Salient Object Detection
Veksler, Olga



研究问题:如何应对卷积神经网络(CNN)在训练数据上过度拟合的问题。
动机:CNN往往对训练数据过度拟合,而测试时适应是一种解决过度拟合的极端方法。然而,主要困难在于无法获取真实标签。
方法:提出了一种基于正则化损失函数的测试时显著目标检测(SOD)方法,该方法可以在像素级精确的真实标签不可用的情况下训练CNN。正则化损失倾向于为更可能的目标段赋予较低的值,因此可以用于微调已训练的CNN以适应给定的测试图像。
效果:开发了一种特别适合测试时适应的正则化损失函数,并证明该方法在弱监督SOD方面明显优于先前的工作。

It is well known that CNNs tend to overfit to the training data. Test-time adaptation is an extreme approach to deal with overfitting: given a test image, the aim is to adapt the trained model to that image. Indeed nothing can be closer to the test data than the test image itself. The main difficulty of test-time adaptation is that the ground truth is not available. Thus test-time adaptation, while intriguing, applies to only a few scenarios where one can design an effective loss function that does not require ground truth. We propose the first approach for test-time Salient Object Detection (SOD) in the context of weak supervision. Our approach is based on a so called regularized loss function, which can be used for training CNN when pixel precise ground truth is unavailable. Regularized loss tends to have lower values for the more likely object segments, and thus it can be used to fine-tune an already trained CNN to a given test image, adapting to images unseen during training. We develop a regularized loss function particularly suitable for test-time adaptation and show that our approach significantly outperforms prior work for weakly supervised SOD.

Guiding Pseudo-Labels With Uncertainty Estimation for Source-Free Unsupervised Domain Adaptation
Litrico, MattiaandDelBue, AlessioandMorerio, Pietro



研究问题:本文旨在解决无监督领域适应(UDA)中源数据不可用的问题,特别是源自由无监督领域适应(SF-UDA)。
动机:在许多实际应用中,源数据可能无法获取。因此,研究如何在没有源数据的情况下进行领域适应具有重要的实际意义。
方法:提出了一种基于损失重加权策略的新颖方法,该方法通过估计伪标签的不确定性来度量其可靠性,并据此对分类损失进行重加权。同时,利用自我监督对比框架作为目标空间正则化器来增强这种知识聚合。此外,还提出了一种新的负对排除策略,以识别和排除共享相同类的样本对。
效果:在三个主要基准测试中,该方法均大幅超越了先前的方法。在VisDA-C和DomainNet上,性能提升了1.8%,在PACS上,单源设置下的性能提升了12.3%,多目标适应下的性能提升了6.6%。额外的分析表明,该方法对噪声具有鲁棒性,生成的伪标签比最先进的方法更准确。

Standard Unsupervised Domain Adaptation (UDA) methods assume the availability of both source and target data during the adaptation. In this work, we investigate Source-free Unsupervised Domain Adaptation (SF-UDA), a specific case of UDA where a model is adapted to a target domain without access to source data. We propose a novel approach for the SF-UDA setting based on a loss reweighting strategy that brings robustness against the noise that inevitably affects the pseudo-labels. The classification loss is reweighted based on the reliability of the pseudo-labels that is measured by estimating their uncertainty. Guided by such reweighting strategy, the pseudo-labels are progressively refined by aggregating knowledge from neighbouring samples. Furthermore, a self-supervised contrastive framework is leveraged as a target space regulariser to enhance such knowledge aggregation. A novel negative pairs exclusion strategy is proposed to identify and exclude negative pairs made of samples sharing the same class, even in presence of some noise in the pseudo-labels. Our method outperforms previous methods on three major benchmarks by a large margin. We set the new SF-UDA state-of-the-art on VisDA-C and DomainNet with a performance gain of +1.8% on both benchmarks and on PACS with +12.3% in the single-source setting and +6.6% in multi-target adaptation. Additional analyses demonstrate that the proposed approach is robust to the noise, which results in significantly more accurate pseudo-labels compared to state-of-the-art approaches.

Generalizable Local Feature Pre-Training for Deformable Shape Analysis
Attaiki, SouhaibandLi, LeiandOvsjanikov, Maks



研究问题:如何利用预训练特征解决3D形状识别中的问题,特别是在处理变形有机形状等新类别时。
动机:现有的迁移学习方法通常在完整的3D对象或场景级别上操作,无法泛化到新的类别,如变形的有机形状。同时,对于预训练特征在不同3D形状类别之间可转移的原因,目前还缺乏理解。
方法:通过分析特征局部性和任务可转移性之间的关系,比较不同的骨干网络和损失函数进行局部特征预训练。提出一种可微的方法来优化3D迁移学习中的感知野大小。
效果:实验结果表明,这种方法可以成功地推广到人类和动物等未见过的形状类别,并在分割、形状对应和分类等下游任务上取得了最先进的结果。

Transfer learning is fundamental for addressing problems in settings with little training data. While several transfer learning approaches have been proposed in 3D, unfortunately, these solutions typically operate on an entire 3D object or even scene-level and thus, as we show, fail to generalize to new classes, such as deformable organic shapes. In addition, there is currently a lack of understanding of what makes pre-trained features transferable across significantly different 3D shape categories. In this paper, we make a step toward addressing these challenges. First, we analyze the link between feature locality and transferability in tasks involving deformable 3D objects, while also comparing different backbones and losses for local feature pre-training. We observe that with proper training, learned features can be useful in such tasks, but, crucially, only with an appropriate choice of the receptive field size. We then propose a differentiable method for optimizing the receptive field within 3D transfer learning. Jointly, this leads to the first learnable features that can successfully generalize to unseen classes of 3D shapes such as humans and animals. Our extensive experiments show that this approach leads to state-of-the-art results on several downstream tasks such as segmentation, shape correspondence, and classification. Our code is available at https://github.com/pvnieo/vader.

Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval
Hao, XiaoshuaiandZhang, WanqianandWu, DayanandZhu, FeiandLi, Bo



研究问题:本文旨在解决无监督领域适应视频-文本检索(UDAVR)中的困难任务,即训练和测试数据来自不同分布。
动机:先前的工作仅缓解了领域偏移的问题,但忽视了目标领域中的配对错配问题,即目标视频和文本之间不存在语义关系。
方法:我们提出了一种名为双重对齐领域适应(DADA)的新方法。具体来说,我们首先引入跨模态语义嵌入以在联合嵌入空间中生成判别性源特征。此外,我们利用视频和文本领域的适应来平滑地平衡最小化领域偏移。为了解决目标领域中的配对错配问题,我们引入了双重对齐一致性(DAC),以充分利用目标领域中两种模态的语义信息。
效果:与最先进的方法相比,DADA在TGIF->MSRVTT和TGIF->MSVD设置下分别实现了20.18%和18.61%的R@1相对改进,证明了我们的方法的优越性。

Video-text retrieval is an emerging stream in both computer vision and natural language processing communities, which aims to find relevant videos given text queries. In this paper, we study the notoriously challenging task, i.e., Unsupervised Domain Adaptation Video-text Retrieval (UDAVR), wherein training and testing data come from different distributions. Previous works merely alleviate the domain shift, which however overlook the pairwise misalignment issue in target domain, i.e., there exist no semantic relationships between target videos and texts. To tackle this, we propose a novel method named Dual Alignment Domain Adaptation (DADA). Specifically, we first introduce the cross-modal semantic embedding to generate discriminative source features in a joint embedding space. Besides, we utilize the video and text domain adaptations to smoothly balance the minimization of the domain shifts. To tackle the pairwise misalignment in target domain, we introduce the Dual Alignment Consistency (DAC) to fully exploit the semantic information of both modalities in target domain. The proposed DAC adaptively aligns the video-text pairs which are more likely to be relevant in target domain, enabling that positive pairs are increasing progressively and the noisy ones will potentially be aligned in the later stages. To that end, our method can generate more truly aligned target pairs and ensure the discriminality of target features.Compared with the state-of-the-art methods, DADA achieves 20.18% and 18.61% relative improvements on R@1 under the setting of TGIF->MSRVTT and TGIF->MSVD respectively, demonstrating the superiority of our method.

Hubs and Hyperspheres: Reducing Hubness and Improving Transductive Few-Shot Learning With Hyperspherical Embeddings
Trosten, DanielJ.andChakraborty, RwiddhiandL{\o



研究问题:如何消除转导式少数镜头学习(FSL)中由于图像表示的高维性导致的"中心性问题",即一些点(中心)频繁出现在其他点的多个最近邻列表中。
动机:中心性问题会负面影响基于距离的分类,当一个类别的中心经常出现在另一个类别的点的最近邻中时,会降低分类器的性能。
方法:我们首先证明通过在超球体上均匀分布表示可以消除中心性问题。然后,我们提出了两种新的在超球体上嵌入表示的方法,这两种方法被证明可以在均匀性和局部相似性保留之间优化平衡——减少中心性问题的同时保持类结构。
效果:实验表明,所提出的方法减少了中心性问题,并在广泛的分类器上显著提高了转导式FSL的准确性。

Distance-based classification is frequently used in transductive few-shot learning (FSL). However, due to the high-dimensionality of image representations, FSL classifiers are prone to suffer from the hubness problem, where a few points (hubs) occur frequently in multiple nearest neighbour lists of other points. Hubness negatively impacts distance-based classification when hubs from one class appear often among the nearest neighbors of points from another class, degrading the classifier's performance. To address the hubness problem in FSL, we first prove that hubness can be eliminated by distributing representations uniformly on the hypersphere. We then propose two new approaches to embed representations on the hypersphere, which we prove optimize a tradeoff between uniformity and local similarity preservation -- reducing hubness while retaining class structure. Our experiments show that the proposed methods reduce hubness, and significantly improves transductive FSL accuracy for a wide range of classifiers.

Architecture, Dataset and Model-Scale Agnostic Data-Free Meta-Learning
Hu, ZixuanandShen, LiandWang, ZhenyiandLiu, TongliangandYuan, ChunandTao, Dacheng



研究问题:现有的无数据元学习只解决了参数空间的问题,忽视了预训练模型中的数据知识,无法扩展到大规模的预训练模型,并且只能元学习具有相同网络架构的预训练模型。
动机:提出一种统一的框架PURER,通过无数据元训练中的episode课程反转(ECI)和元测试中的反转校准后续内部循环(ICFIL),以解决上述问题。
方法:在元训练阶段,我们提出了ECI进行伪剧集训练,以快速适应新的未见过的任务。具体来说,我们逐步合成一系列伪剧集,从每个预训练模型中提炼训练数据。根据元模型的实时反馈,ECI自适应地增加伪剧集的难度级别。我们将带有ECI的元训练优化过程形式化为端到端的对抗形式。在元测试阶段,我们进一步提出了一种简单的即插即用补充——ICFIL——仅在元测试期间使用,以缩小元训练和元测试任务分布之间的差距。
效果:广泛的实验表明,我们的方法在各种真实场景中表现出优越的性能。

The goal of data-free meta-learning is to learn useful prior knowledge from a collection of pre-trained models without accessing their training data. However, existing works only solve the problem in parameter space, which (i) ignore the fruitful data knowledge contained in the pre-trained models; (ii) can not scale to large-scale pre-trained models; (iii) can only meta-learn pre-trained models with the same network architecture. To address those issues, we propose a unified framework, dubbed PURER, which contains: (1) ePisode cUrriculum inveRsion (ECI) during data-free meta training; and (2) invErsion calibRation following inner loop (ICFIL) during meta testing. During meta training, we propose ECI to perform pseudo episode training for learning to adapt fast to new unseen tasks. Specifically, we progressively synthesize a sequence of pseudo episodes by distilling the training data from each pre-trained model. The ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model. We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner. During meta testing, we further propose a simple plug-and-play supplement--ICFIL--only used during meta testing to narrow the gap between meta training and meta testing task distribution. Extensive experiments in various real-world scenarios show the superior performance of ours.

On the Stability-Plasticity Dilemma of Class-Incremental Learning
Kim, DongwanandHan, Bohyung



研究问题:本文旨在解决类增量学习中的稳定性和可塑性之间的平衡问题,即模型既要稳定地保留从已见过的类别中学到的知识,又要具有足够的可塑性来学习新类别的概念。
动机:尽管先前的研究在类增量基准测试上表现出强大的性能,但尚不清楚他们的成功是来自于模型的稳定性、可塑性,还是两者的混合。
方法:本文建立了测量特征表示稳定性和可塑性的分析工具,并在大规模类增量基准测试上使用这些工具对各种算法训练的模型进行研究。
效果:令人惊讶的是,我们发现大多数类增量学习算法严重倾向于稳定性而非可塑性,以至于在初始类别集上训练的模型的特征提取器的效果并不比最终的增量模型差。我们的观察结果不仅启发了两种强调特征表示分析重要性的简单算法,还表明类增量学习方法应努力改善特征表示学习。

A primary goal of class-incremental learning is to strike a balance between stability and plasticity, where models should be both stable enough to retain knowledge learned from previously seen classes, and plastic enough to learn concepts from new classes. While previous works demonstrate strong performance on class-incremental benchmarks, it is not clear whether their success comes from the models being stable, plastic, or a mixture of both. This paper aims to shed light on how effectively recent class-incremental learning algorithms address the stability-plasticity trade-off. We establish analytical tools that measure the stability and plasticity of feature representations, and employ such tools to investigate models trained with various algorithms on large-scale class-incremental benchmarks. Surprisingly, we find that the majority of class-incremental learning algorithms heavily favor stability over plasticity, to the extent that the feature extractor of a model trained on the initial set of classes is no less effective than that of the final incremental model. Our observations not only inspire two simple algorithms that highlight the importance of feature representation analysis, but also suggest that class-incremental learning approaches, in general, should strive for better feature representation learning.

Generalization Matters: Loss Minima Flattening via Parameter Hybridization for Efficient Online Knowledge Distillation
Zhang, TianliandXue, MengqiandZhang, JiangtaoandZhang, HaofeiandWang, YuandCheng, LechaoandSong, JieandSong, Mingli



研究问题:现有的在线知识蒸馏技术通常需要复杂的模块来产生多样化的知识以提高学生的泛化能力。
动机:本文旨在充分利用多模型设置,而不是精心设计的模块,以实现具有优秀泛化性能的蒸馏效果。
方法:通过在每个训练批次中线性加权学生模型的参数,构建了一个混合权重模型(HWM)来表示涉及的学生周围的参数。将HWM的损失集成到学生的训练中,并提出了一种新的在线知识蒸馏框架,即参数混合在线知识蒸馏(OKDPH),以促进更平坦的极小值并获得稳健的解决方案。
效果:与最先进的在线知识蒸馏方法和寻求平坦极小值的方法相比,我们的OKDPH在更少的参数下实现了更高的性能,使在线知识蒸馏具有轻量级和鲁棒的特性。

Most existing online knowledge distillation(OKD) techniques typically require sophisticated modules to produce diverse knowledge for improving students' generalization ability. In this paper, we strive to fully utilize multi-model settings instead of well-designed modules to achieve a distillation effect with excellent generalization performance. Generally, model generalization can be reflected in the flatness of the loss landscape. Since averaging parameters of multiple models can find flatter minima, we are inspired to extend the process to the sampled convex combinations of multi-student models in OKD. Specifically, by linearly weighting students' parameters in each training batch, we construct a Hybrid-Weight Model(HWM) to represent the parameters surrounding involved students. The supervision loss of HWM can estimate the landscape's curvature of the whole region around students to measure the generalization explicitly. Hence we integrate HWM's loss into students' training and propose a novel OKD framework via parameter hybridization(OKDPH) to promote flatter minima and obtain robust solutions. Considering the redundancy of parameters could lead to the collapse of HWM, we further introduce a fusion operation to keep the high similarity of students. Compared to the state-of-the-art(SOTA) OKD methods and SOTA methods of seeking flat minima, our OKDPH achieves higher performance with fewer parameters, benefiting OKD with lightweight and robust characteristics. Our code is publicly available at https://github.com/tianlizhang/OKDPH.

Gaussian Label Distribution Learning for Spherical Image Object Detection
Xu, HangandLiu, XinyuanandZhao, QiangandMa, YikeandYan, ChenggangandDai, Feng



研究问题:现有的球形图像检测器使用ln-范数损失进行球形边界框的回归,存在研究问题:现有的球形图像检测器使用ln-范数损失进行球形边界框的回归,存在参数独立优化和度量(以IoU为主)与损失不一致的问题。
动机:这些问题在平面图像检测中也存在,但在球形图像检测中更为严重。由于Spherical IoU(SphIoU)的不可微性,现有的基于IoU损失和相关变体的解决方案无法应用于球形图像目标检测。
方法:本文设计了一种简单而有效的基于高斯标签分布学习(GLDL)的回归损失函数用于球形图像目标检测。同时,由于球形图像中物体的大小差异巨大,不同类别物体之间的差异使得基于SphIoU的样本选择策略具有挑战性,因此提出了GLDL-ATSS作为更好的球形图像目标训练样本选择策略,以缓解基于IoU阈值的策略在尺度样本不平衡问题上的缺陷。
效果:在两个具有不同基线检测器的数据集上进行的大量实验表明,该方法是有效的。

Spherical image object detection emerges in many applications from virtual reality to robotics and automatic driving, while many existing detectors use ln-norms loss for regression of spherical bounding boxes. There are two intrinsic flaws for ln-norms loss, i.e., independent optimization of parameters and inconsistency between metric (dominated by IoU) and loss. These problems are common in planar image detection but more significant in spherical image detection. Solution for these problems has been extensively discussed in planar image detection by using IoU loss and related variants. However, these solutions cannot be migrated to spherical image object detection due to the undifferentiable of the Spherical IoU (SphIoU). In this paper, we design a simple but effective regression loss based on Gaussian Label Distribution Learning (GLDL) for spherical image object detection. Besides, we observe that the scale of the object in a spherical image varies greatly. The huge differences among objects from different categories make the sample selection strategy based on SphIoU challenging. Therefore, we propose GLDL-ATSS as a better training sample selection strategy for objects of the spherical image, which can alleviate the drawback of IoU threshold-based strategy of scale-sample imbalance. Extensive results on various two datasets with different baseline detectors show the effectiveness of our approach.

On the Effects of Self-Supervision and Contrastive Alignment in Deep Multi-View Clustering
Trosten, DanielJ.andL{\o



研究问题:本文旨在解决深度学习多视角聚类(MVC)中,自我监督学习方法发展不均的问题。
动机:作者发现在深度学习多视角聚类中,自我监督学习方法的发展存在很大差异,这可能会阻碍该领域的进展。
方法:作者提出了一个统一的深度MVC框架DeepMVC,并将许多最近的方法作为实例。通过这个框架,作者对自我监督的影响进行了关键观察,特别是对比学习对表示的对齐的缺点。
效果:实验结果表明,(i)与理论发现一致,对比对齐会降低多视角数据集的性能;(ii)所有方法都从某种形式的自我监督中受益;(iii)新实例在几个数据集上优于以前的方法。根据结果,作者为未来的研究提供了几个有希望的方向。为了增强领域的开放性,作者提供了一个开源的DeepMVC实现,包括最近的模型和新的实例。

Self-supervised learning is a central component in recent approaches to deep multi-view clustering (MVC). However, we find large variations in the development of self-supervision-based methods for deep MVC, potentially slowing the progress of the field. To address this, we present DeepMVC, a unified framework for deep MVC that includes many recent methods as instances. We leverage our framework to make key observations about the effect of self-supervision, and in particular, drawbacks of aligning representations with contrastive learning. Further, we prove that contrastive alignment can negatively influence cluster separability, and that this effect becomes worse when the number of views increases. Motivated by our findings, we develop several new DeepMVC instances with new forms of self-supervision. We conduct extensive experiments and find that (i) in line with our theoretical findings, contrastive alignments decreases performance on datasets with many views; (ii) all methods benefit from some form of self-supervision; and (iii) our new instances outperform previous methods on several datasets. Based on our results, we suggest several promising directions for future research. To enhance the openness of the field, we provide an open-source implementation of DeepMVC, including recent models and our new instances. Our implementation includes a consistent evaluation protocol, facilitating fair and accurate evaluation of methods and components.

DARE-GRAM: Unsupervised Domain Adaptation Regression by Aligning Inverse Gram Matrices
Nejjar, IsmailandWang, QinandFink, Olga



研究问题:本文旨在解决无监督领域适应回归(DAR)问题,即如何缩小有标签源数据集和无标签目标数据集之间的领域差距。
动机:现有的方法主要通过最小化源特征和目标特征之间的差异来学习深度特征编码器。然而,作者提出了一种新的视角,通过分析线性回归器在深度领域适应背景下的闭型普通最小二乘(OLS)解。
方法:作者提出的方法不是对原始特征嵌入空间进行对齐,而是对特征的逆Gram矩阵进行对齐。这是由其在OLS解中的存在性和Gram矩阵捕捉特征相关性的能力所驱动的。具体来说,作者提出了一种简单而有效的DAR方法,该方法利用伪逆低秩特性在一个由两个领域的伪逆Gram矩阵生成的选定子空间中对尺度和角度进行对齐。
效果:实验结果表明,该方法在三个领域适应回归基准测试中实现了最先进的性能。

Unsupervised Domain Adaptation Regression (DAR) aims to bridge the domain gap between a labeled source dataset and an unlabelled target dataset for regression problems. Recent works mostly focus on learning a deep feature encoder by minimizing the discrepancy between source and target features. In this work, we present a different perspective for the DAR problem by analyzing the closed-form ordinary least square (OLS) solution to the linear regressor in the deep domain adaptation context. Rather than aligning the original feature embedding space, we propose to align the inverse Gram matrix of the features, which is motivated by its presence in the OLS solution and the Gram matrix's ability to capture the feature correlations. Specifically, we propose a simple yet effective DAR method which leverages the pseudo-inverse low-rank property to align the scale and angle in a selected subspace generated by the pseudo-inverse Gram matrix of the two domains. We evaluate our method on three domain adaptation regression benchmarks. Experimental results demonstrate that our method achieves state-of-the-art performance. Our code is available at https://github.com/ismailnejjar/DARE-GRAM.

Probabilistic Debiasing of Scene Graphs
Biswas, BashirulAzamandJi, Qiang



研究问题:现有最先进的场景图生成模型由于关系和其父母对象对的长尾特性,导致生成的场景图质量降低。
动机:训练场景图的过程中,大部分关系和主要对象对占据了主导地位,导致在训练收敛后,少数对象对的关系分布没有被保留下来,从而使得模型存在偏差。
方法:我们提出了一种在三元组贝叶斯网络中引入虚拟证据的方法,以保留关系标签的对象条件分布并消除由关系边缘概率产生的偏差。同时,为了解决少数类关系样本不足的问题,我们在语义空间中从相邻的三元组借用少数类三元组样本进行嵌入增强。
效果:我们在两个不同的数据集上进行了实验,发现该方法显著提高了关系的平均召回率,并且与现有的最优场景图模型去偏技术相比,取得了更好的召回率和平均召回率性能平衡。

The quality of scene graphs generated by the state-of-the-art (SOTA) models is compromised due to the long-tail nature of the relationships and their parent object pairs. Training of the scene graphs is dominated by the majority relationships of the majority pairs and, therefore, the object-conditional distributions of relationship in the minority pairs are not preserved after the training is converged. Consequently, the biased model performs well on more frequent relationships in the marginal distribution of relationships such as 'on' and 'wearing', and performs poorly on the less frequent relationships such as 'eating' or 'hanging from'. In this work, we propose virtual evidence incorporated within-triplet Bayesian Network (BN) to preserve the object-conditional distribution of the relationship label and to eradicate the bias created by the marginal probability of the relationships. The insufficient number of relationships in the minority classes poses a significant problem in learning the within-triplet Bayesian network. We address this insufficiency by embedding-based augmentation of triplets where we borrow samples of the minority triplet classes from its neighboring triplets in the semantic space. We perform experiments on two different datasets and achieve a significant improvement in the mean recall of the relationships. We also achieve a better balance between recall and mean recall performance compared to the SOTA de-biasing techniques of scene graph models.

OSAN: A One-Stage Alignment Network To Unify Multimodal Alignment and Unsupervised Domain Adaptation
Liu, YeandQiao, LingfengandLu, ChangchongandYin, DiandLin, ChenandPeng, HaoyuanandRen, Bo



研究问题:如何进行无监督的多模态领域适应,特别是在领域适应和模态对齐这两个主要问题上。
动机:现有的两阶段研究中,领域和模态并未关联,且二者的关系未被利用,这为多模态领域适应带来了挑战。
方法:本文将这两个阶段统一起来,同时进行领域和模态的对齐。提出了一种基于张量的对齐模块(TAL)来探索领域和模态之间的关系,并建立了一个动态领域生成器(DDG)模块,通过混合两个领域的共享信息,以自监督的方式构建过渡样本,帮助模型学习领域不变的公共表示空间。
效果:实验证明,该方法在两个实际应用中都能取得优异的性能。

Extending from unimodal to multimodal is a critical challenge for unsupervised domain adaptation (UDA). Two major problems emerge in unsupervised multimodal domain adaptation: domain adaptation and modality alignment. An intuitive way to handle these two problems is to fulfill these tasks in two separate stages: aligning modalities followed by domain adaptation, or vice versa. However, domains and modalities are not associated in most existing two-stage studies, and the relationship between them is not leveraged which can provide complementary information to each other. In this paper, we unify these two stages into one to align domains and modalities simultaneously. In our model, a tensor-based alignment module (TAL) is presented to explore the relationship between domains and modalities. By this means, domains and modalities can interact sufficiently and guide them to utilize complementary information for better results. Furthermore, to establish a bridge between domains, a dynamic domain generator (DDG) module is proposed to build transitional samples by mixing the shared information of two domains in a self-supervised manner, which helps our model learn a domain-invariant common representation space. Extensive experiments prove that our method can achieve superior performance in two real-world applications. The code will be publicly available.

Solving 3D Inverse Problems Using Pre-Trained 2D Diffusion Models
Chung, HyungjinandRyu, DohoonandMcCann, MichaelT.andKlasky, MarcL.andYe, JongChul



研究问题:如何有效地解决三维医学图像重建问题,如稀疏视图断层扫描、有限角度断层扫描和压缩感知MRI。
动机:传统的基于模型的迭代重建方法和现代的扩散模型在二维图像重建中表现出色,但在三维重建中由于高维度(与数据维度相同)导致内存和计算成本极高,尚未得到广泛应用。
方法:本文将传统基于模型的迭代重建方法和现代扩散模型相结合,提出一种有效的方法,通过在测试时用基于模型的先验信息补充二维扩散模型中的剩余方向,实现在所有维度上的协调重建。
效果:该方法可以在单台普通GPU上运行,并在极端情况下(如2视图3D断层扫描)实现高保真度和准确性的重建。此外,该方法具有很高的泛化能力,可以用于重建与训练数据集完全不同的体积。

Diffusion models have emerged as the new state-of-the-art generative model with high quality samples, with intriguing properties such as mode coverage and high flexibility. They have also been shown to be effective inverse problem solvers, acting as the prior of the distribution, while the information of the forward model can be granted at the sampling stage. Nonetheless, as the generative process remains in the same high dimensional (i.e. identical to data dimension) space, the models have not been extended to 3D inverse problems due to the extremely high memory and computational cost. In this paper, we combine the ideas from the conventional model-based iterative reconstruction with the modern diffusion models, which leads to a highly effective method for solving 3D medical image reconstruction tasks such as sparse-view tomography, limited angle tomography, compressed sensing MRI from pre-trained 2D diffusion models. In essence, we propose to augment the 2D diffusion prior with a model-based prior in the remaining direction at test time, such that one can achieve coherent reconstructions across all dimensions. Our method can be run in a single commodity GPU, and establishes the new state-of-the-art, showing that the proposed method can perform reconstructions of high fidelity and accuracy even in the most extreme cases (e.g. 2-view 3D tomography). We further reveal that the generalization capacity of the proposed method is surprisingly high, and can be used to reconstruct volumes that are entirely different from the training dataset. Code available: https://github.com/HJ-harry/DiffusionMBIR

Federated Domain Generalization With Generalization Adjustment
Zhang, RuipengandXu, QinweiandYao, JiangchaoandZhang, YaandTian, QiandWang, Yanfeng



研究问题:如何在保护隐私的同时,学习一个能良好泛化到新客户端的全局模型,特别是在可能存在领域偏移的情况下。
动机:现有的方法主要关注在每个单独的领域中设计无偏的训练策略,但在没有多领域数据联合支持的批量训练中,几乎所有的方法都不能保证在领域偏移下的泛化。
方法:提出了一种结合了新的方差减少正则化器的新型全局目标,以鼓励公平性。并提出了一种新的联邦适应(FL-friendly)方法——通用调整(GA),通过动态校准聚合权重来优化上述目标。
效果:理论分析表明,使用显式重新加权的聚合可以实现更紧的泛化边界,取代仅适用于传统DG设置的隐式多领域数据共享。此外,该方法是通用的,可以与任何基于局部客户端训练的方法结合使用。在几个基准数据集上的大量实验表明了该方法的有效性,当与其他FedDG算法结合使用时,可以获得一致的改进。

Federated Domain Generalization (FedDG) attempts to learn a global model in a privacy-preserving manner that generalizes well to new clients possibly with domain shift. Recent exploration mainly focuses on designing an unbiased training strategy within each individual domain. However, without the support of multi-domain data jointly in the mini-batch training, almost all methods cannot guarantee the generalization under domain shift. To overcome this problem, we propose a novel global objective incorporating a new variance reduction regularizer to encourage fairness. A novel FL-friendly method named Generalization Adjustment (GA) is proposed to optimize the above objective by dynamically calibrating the aggregation weights. The theoretical analysis of GA demonstrates the possibility to achieve a tighter generalization bound with an explicit re-weighted aggregation, substituting the implicit multi-domain data sharing that is only applicable to the conventional DG settings. Besides, the proposed algorithm is generic and can be combined with any local client training-based methods. Extensive experiments on several benchmark datasets have shown the effectiveness of the proposed method, with consistent improvements over several FedDG algorithms when used in combination. The source code is released at https://github.com/MediaBrain-SJTU/FedDG-GA.

Learning the Distribution of Errors in Stereo Matching for Joint Disparity and Uncertainty Estimation
Chen, LiyanandWang, WeihanandMordohai, Philippos



研究问题:本文旨在提出一种新的损失函数,用于深度立体匹配中的联合视差和不确定性估计。
动机:由于需要精确的不确定性估计,并且多任务学习通常会在所有任务中提高性能,因此我们进行了这项工作。
方法:通过在网络的损失函数中加入KL散度项,要求不确定性分布与视差误差分布相匹配,从而实现联合视差和不确定性估计。使用可微分的软直方图技术来近似这些分布,以便在损失中使用。
效果:我们在大型数据集上对该方法的效果进行实验评估,观察到在视差和不确定性预测方面都有显著改进。我们的代码可以在https://github.com/lly00412/SEDNet.git获取。

We present a new loss function for joint disparity and uncertainty estimation in deep stereo matching. Our work is motivated by the need for precise uncertainty estimates and the observation that multi-task learning often leads to improved performance in all tasks. We show that this can be achieved by requiring the distribution of uncertainty to match the distribution of disparity errors via a KL divergence term in the network's loss function. A differentiable soft-histogramming technique is used to approximate the distributions so that they can be used in the loss. We experimentally assess the effectiveness of our approach and observe significant improvements in both disparity and uncertainty prediction on large datasets. Our code is available at https://github.com/lly00412/SEDNet.git.

Learning Correspondence Uncertainty via Differentiable Nonlinear Least Squares
Muhle, DominikandKoestler, LukasandJatavallabhula, KrishnaMurthyandCremers, Daniel



研究问题:提出一种可微非线性最小二乘框架,用于从特征对应关系中估计相对位姿的不确定性。
动机:现有的方法在处理相对位姿估计的不确定性时存在不足,需要一种新的方法来提高精度和稳定性。
方法:引入对称的概率正外极约束和通过微分相机位姿估计过程来估计特征位置协方差的方法,构建一个可微非线性最小二乘框架。
效果:在合成数据集以及KITTI和EuRoC真实世界数据集上的实验表明,该方法能够准确逼近真实的噪声分布,并在真实世界实验中始终优于最先进的非概率和概率方法。

We propose a differentiable nonlinear least squares framework to account for uncertainty in relative pose estimation from feature correspondences. Specifically, we introduce a symmetric version of the probabilistic normal epipolar constraint, and an approach to estimate the covariance of feature positions by differentiating through the camera pose estimation procedure. We evaluate our approach on synthetic, as well as the KITTI and EuRoC real-world datasets. On the synthetic dataset, we confirm that our learned covariances accurately approximate the true noise distribution. In real world experiments, we find that our approach consistently outperforms state-of-the-art non-probabilistic and probabilistic approaches, regardless of the feature extraction algorithm of choice.

Samples With Low Loss Curvature Improve Data Efficiency
Garg, IshaandRoy, Kaushik



研究问题:本文研究了深度神经网络训练损失的二阶特性,以理解损失表面在训练数据点附近的曲率。
动机:研究发现,训练数据中存在一种意料之外的低曲率样本集中现象。这些低曲率样本在不同架构中具有高度一致性,且在训练早期即可识别。
方法:作者提出了SLo-Curves算法,该算法将低曲率样本视为更高效的数据,并在其附近添加了一个额外的正则化项,用于惩罚高曲率的损失表面。
效果:在CIFAR-10和CIFAR-100数据集上,SLo-Curves算法表现出色,其在小型核心集大小上比最先进的核心集选择方法高出9%。所识别的核心集可以跨架构进行泛化,因此可以预先计算生成用于下游任务的数据集压缩版本。

In this paper, we study the second order properties of the loss of trained deep neural networks with respect to the training data points to understand the curvature of the loss surface in the vicinity of these points. We find that there is an unexpected concentration of samples with very low curvature. We note that these low curvature samples are largely consistent across completely different architectures, and identifiable in the early epochs of training. We show that the curvature relates to the 'cleanliness' of the data points, with low curvatures samples corresponding to clean, higher clarity samples, representative of their category. Alternatively, high curvature samples are often occluded, have conflicting features and visually atypical of their category. Armed with this insight, we introduce SLo-Curves, a novel coreset identification and training algorithm. SLo-curves identifies the samples with low curvatures as being more data-efficient and trains on them with an additional regularizer that penalizes high curvature of the loss surface in their vicinity. We demonstrate the efficacy of SLo-Curves on CIFAR-10 and CIFAR-100 datasets, where it outperforms state of the art coreset selection methods at small coreset sizes by up to 9%. The identified coresets generalize across architectures, and hence can be pre-computed to generate condensed versions of datasets for use in downstream tasks.

Re-Basin via Implicit Sinkhorn Differentiation
Pe\~na, FidelA.GuerreroandMedeiros, HeitorRapelaandDubail, ThomasandAminbeidokhti, MasihandGranger, EricandPedersoli, Marco



研究问题:如何找到最小化目标的排列模型,并整合到基于梯度的优化中。
动机:当前优化技术不具有可微分性,导致难以找到最优解。
方法:提出Sinkhorn再盆地网络,通过利用线性模式连通性属性进行增量学习。
效果:在最优传输和线性模式连通性等多个条件下,与文献中的类似方法相比,该方法在常见基准数据集上的表现与最先进的技术相媲美。

The recent emergence of new algorithms for permuting models into functionally equivalent regions of the solution space has shed some light on the complexity of error surfaces and some promising properties like mode connectivity. However, finding the permutation that minimizes some objectives is challenging, and current optimization techniques are not differentiable, which makes it difficult to integrate into a gradient-based optimization, and often leads to sub-optimal solutions. In this paper, we propose a Sinkhorn re-basin network with the ability to obtain the transportation plan that better suits a given objective. Unlike the current state-of-art, our method is differentiable and, therefore, easy to adapt to any task within the deep learning domain. Furthermore, we show the advantage of our re-basin method by proposing a new cost function that allows performing incremental learning by exploiting the linear mode connectivity property. The benefit of our method is compared against similar approaches from the literature under several conditions for both optimal transport and linear mode connectivity. The effectiveness of our continual learning method based on re-basin is also shown for several common benchmark datasets, providing experimental results that are competitive with the state-of-art. The source code is provided at https://github.com/fagp/sinkhorn-rebasin.

Layout-Based Causal Inference for Object Navigation
Zhang, SixianandSong, XinhangandLi, WeijieandBai, YubingandYu, XinyaoandJiang, Shuqiang



研究问题:如何利用先验知识在训练环境中进行导航,同时解决布局差距对导航效果的负面影响。
动机:先前的工作试图通过学习视觉输入和目标之间的关联(如关系图)来进行导航任务,但当测试环境和训练环境的布局差距较大时,这种先验知识会对导航产生负面影响。
方法:提出了基于因果关系的布局基础软总直接效应(L-sTDE)框架来调整导航策略的预测。具体来说,我们计算对象布局的后验分布和先验分布之间的KL散度作为布局差距,然后根据布局差距提出sTDE来适当控制经验的影响。
效果:在AI2THOR、RoboTHOR和Habitat等实验数据集上的实验结果表明,该方法能有效提高导航性能。

Previous works for ObjectNav task attempt to learn the association (e.g. relation graph) between the visual inputs and the goal during training. Such association contains the prior knowledge of navigating in training environments, which is denoted as the experience. The experience performs a positive effect on helping the agent infer the likely location of the goal when the layout gap between the unseen environments of the test and the prior knowledge obtained in training is minor. However, when the layout gap is significant, the experience exerts a negative effect on navigation. Motivated by keeping the positive effect and removing the negative effect of the experience, we propose the layout-based soft Total Direct Effect (L-sTDE) framework based on the causal inference to adjust the prediction of the navigation policy. In particular, we propose to calculate the layout gap which is defined as the KL divergence between the posterior and the prior distribution of the object layout. Then the sTDE is proposed to appropriately control the effect of the experience based on the layout gap. Experimental results on AI2THOR, RoboTHOR, and Habitat demonstrate the effectiveness of our method.

Source-Free Video Domain Adaptation With Spatial-Temporal-Historical Consistency Learning
Li, KaiandPatel, DeepandKruus, ErikandMin, MartinRenqiang



研究问题:如何利用未标记的目标数据对预训练的源模型进行适应,特别是在视频领域。
动机:现有的源自由领域适应方法主要针对图像,对于视频的处理效果不佳。我们提出的方法考虑了视频的空间、时间和历史特性,以解决这一问题。
方法:我们提出了一种简单而灵活的源自由视频领域适应(SFVDA)方法,通过模拟空间和时间变化来克服领域偏移,并鼓励模型对视频及其增强版本做出一致的预测。
效果:实验表明,我们的方法在所有设置中都达到了最先进的性能。

Source-free domain adaptation (SFDA) is an emerging research topic that studies how to adapt a pretrained source model using unlabeled target data. It is derived from unsupervised domain adaptation but has the advantage of not requiring labeled source data to learn adaptive models. This makes it particularly useful in real-world applications where access to source data is restricted. While there has been some SFDA work for images, little attention has been paid to videos. Naively extending image-based methods to videos without considering the unique properties of videos often leads to unsatisfactory results. In this paper, we propose a simple and highly flexible method for Source-Free Video Domain Adaptation (SFVDA), which extensively exploits consistency learning for videos from spatial, temporal, and historical perspectives. Our method is based on the assumption that videos of the same action category are drawn from the same low-dimensional space, regardless of the spatio-temporal variations in the high-dimensional space that cause domain shifts. To overcome domain shifts, we simulate spatio-temporal variations by applying spatial and temporal augmentations on target videos, and encourage the model to make consistent predictions from a video and its augmented versions. Due to the simple design, our method can be applied to various SFVDA settings, and experiments show that our method achieves state-of-the-art performance for all the settings.

MELTR: Meta Loss Transformer for Learning To Fine-Tune Video Foundation Models
Ko, DohwanandChoi, JoonmyungandChoi, HyeongKyuandOn, Kyoung-WoonandRoh, ByungseokandKim, HyunwooJ.



研究问题:现有的基础模型在微调阶段主要关注单一任务损失的最小化,没有充分利用其他可能对目标任务有益的损失。
动机:提出一种可以自动非线性组合各种损失函数的插件模块,通过辅助学习来帮助学习目标任务。
方法:将辅助学习形式化为双层优化问题,并提出了一种基于近似隐式微分(AID)的高效优化算法。
效果:在多种视频基础模型上应用该框架,并在四个下游任务中取得了显著的性能提升。定性分析表明,MELTR能够有效地转换和融合各个损失函数为一个有效的统一损失。

Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining phase, a naive strategy to minimize a single task-specific loss is adopted for fine-tuning. However, such fine-tuning methods do not fully leverage other losses that are potentially beneficial for the target task. Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning. We formulate the auxiliary learning as a bi-level optimization problem and present an efficient optimization algorithm based on Approximate Implicit Differentiation (AID). For evaluation, we apply our framework to various video foundation models (UniVL, Violet and All-in-one), and show significant performance gain on all four downstream tasks: text-to-video retrieval, video question answering, video captioning, and multi-modal sentiment analysis. Our qualitative analyses demonstrate that MELTR adequately 'transforms' individual loss functions and 'melts' them into an effective unified loss. Code is available at https://github.com/mlvlab/MELTR.

Ambiguous Medical Image Segmentation Using Diffusion Models
Rahman, AimonandValanarasu, JeyaMariaJoseandHacihaliloglu, IlkerandPatel, VishalM.



研究问题:如何利用集体专家的见解来提高医学图像分割任务的诊断效果。
动机:现有的AI模型主要模仿个体专家的表现,而忽视了集体专家的力量。
方法:提出一种基于单一扩散模型的方法,通过学习群体见解的分布来生成多个可能的输出。该方法利用扩散的固有随机采样过程,仅需要最小的额外学习就能生成分割掩模的分布。
效果:在CT、超声波和MRI三种不同的医学图像模态上进行测试,结果显示该方法能够生成多种可能的变体,同时捕捉其发生的频率。在准确性方面,该方法优于现有的最先进的模糊分割网络,同时保留了自然发生的变异。此外,还提出了一种新的评估指标,用于评估分割预测的多样性和准确性,以符合临床实践的集体见解的利益。

Collective insights from a group of experts have always proven to outperform an individual's best diagnostic for clinical tasks. For the task of medical image segmentation, existing research on AI-based alternatives focuses more on developing models that can imitate the best individual rather than harnessing the power of expert groups. In this paper, we introduce a single diffusion model-based approach that produces multiple plausible outputs by learning a distribution over group insights. Our proposed model generates a distribution of segmentation masks by leveraging the inherent stochastic sampling process of diffusion using only minimal additional learning. We demonstrate on three different medical image modalities- CT, ultrasound, and MRI that our model is capable of producing several possible variants while capturing the frequencies of their occurrences. Comprehensive results show that our proposed approach outperforms existing state-of-the-art ambiguous segmentation networks in terms of accuracy while preserving naturally occurring variation. We also propose a new metric to evaluate the diversity as well as the accuracy of segmentation predictions that aligns with the interest of clinical practice of collective insights. Implementation code will be released publicly after the review process.

Make Landscape Flatter in Differentially Private Federated Learning
Shi, YifanandLiu, YingqiandWei, KangandShen, LiandWang, XueqianandTao, Dacheng



研究问题:如何在联邦学习中保护隐私并减少敏感信息泄露?
动机:现有的联邦学习方法在保护隐私时会导致损失函数更尖锐,权重扰动的鲁棒性较差,从而严重影响性能。
方法:提出一种新的联邦学习方法DP-FedSAM,通过梯度扰动来减轻DP的负面影响。具体来说,DP-FedSAM集成了Sharpness Aware Minimization(SAM)优化器,生成具有更好稳定性和权重扰动鲁棒性的局部平坦模型,从而使局部更新的范数较小,对DP噪声具有鲁棒性,从而提高性能。
效果:从理论角度详细分析了DP-FedSAM如何减轻由DP引起的性能下降。同时,我们提供了严格的Renyi DP隐私保证,并对局部更新进行了敏感性分析。最后,我们通过实证证明,与现有的联邦学习方法相比,我们的算法实现了最先进的性能。

To defend the inference attacks and mitigate the sensitive information leakages in Federated Learning (FL), client-level Differentially Private FL (DPFL) is the de-facto standard for privacy protection by clipping local updates and adding random noise. However, existing DPFL methods tend to make a sharper loss landscape and have poorer weight perturbation robustness, resulting in severe performance degradation. To alleviate these issues, we propose a novel DPFL algorithm named DP-FedSAM, which leverages gradient perturbation to mitigate the negative impact of DP. Specifically, DP-FedSAM integrates Sharpness Aware Minimization (SAM) optimizer to generate local flatness models with better stability and weight perturbation robustness, which results in the small norm of local updates and robustness to DP noise, thereby improving the performance. From the theoretical perspective, we analyze in detail how DP-FedSAM mitigates the performance degradation induced by DP. Meanwhile, we give rigorous privacy guarantees with Renyi DP and present the sensitivity analysis of local updates. At last, we empirically confirm that our algorithm achieves state-of-the-art (SOTA) performance compared with existing SOTA baselines in DPFL.

Towards Better Stability and Adaptability: Improve Online Self-Training for Model Adaptation in Semantic Segmentation
Zhao, DongandWang, ShuangandZang, QiandQuan, DouandYe, XiutiaoandJiao, Licheng



研究问题:本文旨在解决语义分割中的无监督领域适应问题,特别是在涉及隐私、产权保护和保密性的适应场景中。
动机:传统的无监督领域适应(UDA)需要访问源数据的标签信息,这在涉及隐私等问题时无法使用。因此,本文关注无需访问源数据的无监督模型适应(UMA)。
方法:本文发现在线自我训练方法可以应用于UMA,但缺乏源领域损失会大大降低该方法的稳定性和适应性。基于此,我们提出了动态教师更新机制和基于训练一致性的重采样策略来提高在线自我训练的稳定性和适应性。
效果:在多个模型适应基准测试中,我们的方法获得了新的最先进的性能,与最先进的UDA方法相当甚至更好。

Unsupervised domain adaptation (UDA) in semantic segmentation transfers the knowledge of the source domain to the target one to improve the adaptability of the segmentation model in the target domain. The need to access labeled source data makes UDA unable to handle adaptation scenarios involving privacy, property rights protection, and confidentiality. In this paper, we focus on unsupervised model adaptation (UMA), also called source-free domain adaptation, which adapts a source-trained model to the target domain without accessing source data. We find that the online self-training method has the potential to be deployed in UMA, but the lack of source domain loss will greatly weaken the stability and adaptability of the method. We analyze the two possible reasons for the degradation of online self-training, i.e. inopportune updates of the teacher model and biased knowledge from source-trained model. Based on this, we propose a dynamic teacher update mechanism and a training-consistency based resampling strategy to improve the stability and adaptability of online self training. On multiple model adaptation benchmarks, our method obtains new state-of-the-art performance, which is comparable or even better than state-of-the-art UDA methods.

Ranking Regularization for Critical Rare Classes: Minimizing False Positives at a High True Positive Rate
Mohammadi, KiarashandZhao, HeandZhai, MengyaoandTung, Frederick



研究问题:在许多真实世界的场景中,关键类别的实例非常罕见,而漏检的代价却极高。如何在这种需要高真阳性率的场景下,降低假阳性率是一个挑战。
动机:例如在医疗诊断和银行交易欺诈检测等场景中,误报的后果严重,因此需要寻找方法降低假阳性率。
方法:本文提出了一种基于排序的正则化(RankReg)方法,该方法易于实施,并能有效减少假阳性,同时补充了传统的不平衡学习损失。
效果:通过在CIFAR-10&100和Melanoma等多个数据集上进行实验,证明该方法显著提高了先前最先进的性能。

In many real-world settings, the critical class is rare and a missed detection carries a disproportionately high cost. For example, tumors are rare and a false negative diagnosis could have severe consequences on treatment outcomes; fraudulent banking transactions are rare and an undetected occurrence could result in significant losses or legal penalties. In such contexts, systems are often operated at a high true positive rate, which may require tolerating high false positives. In this paper, we present a novel approach to address the challenge of minimizing false positives for systems that need to operate at a high true positive rate. We propose a ranking-based regularization (RankReg) approach that is easy to implement, and show empirically that it not only effectively reduces false positives, but also complements conventional imbalanced learning losses. With this novel technique in hand, we conduct a series of experiments on three broadly explored datasets (CIFAR-10&100 and Melanoma) and show that our approach lifts the previous state-of-the-art performance by notable margins.

Rethinking Feature-Based Knowledge Distillation for Face Recognition
Li, JingzhiandGuo, ZidongandLi, HuiandHan, SeungjuandBaek, Ji-wonandYang, MinandYang, RanandSuh, Sungjoo



研究问题:如何在不使用身份监督的情况下,通过特征蒸馏进行大规模人脸识别。
动机:在大规模人脸识别中,由于数据集的持续扩大,特征蒸馏方法盛行。然而,直接去除身份监督会导致蒸馏效果下降。
方法:通过反向蒸馏缩小教师模型的搜索空间,以减小内在维度的差距,从而释放特征蒸馏的潜力。同时设计了一个学生代理来更好地弥合内在差距。
效果:该方法在各种人脸识别基准测试中超越了最先进的有监督特征蒸馏技术,并且改进效果在不同的教师-学生对之间是一致的。

With the continual expansion of face datasets, feature-based distillation prevails for large-scale face recognition. In this work, we attempt to remove identity supervision in student training, to spare the GPU memory from saving massive class centers. However, this naive removal leads to inferior distillation result. We carefully inspect the performance degradation from the perspective of intrinsic dimension, and argue that the gap in intrinsic dimension, namely the intrinsic gap, is intimately connected to the infamous capacity gap problem. By constraining the teacher's search space with reverse distillation, we narrow the intrinsic gap and unleash the potential of feature-only distillation. Remarkably, the proposed reverse distillation creates universally student-friendly teacher that demonstrates outstanding student improvement. We further enhance its effectiveness by designing a student proxy to better bridge the intrinsic gap. As a result, the proposed method surpasses state-of-the-art distillation techniques with identity supervision on various face recognition benchmarks, and the improvements are consistent across different teacher-student pairs.

Revisiting Reverse Distillation for Anomaly Detection
Tien, TranDinhandNguyen, AnhTuanandTran, NguyenHoangandHuy, TaDucandDuong, SoanT.M.andNguyen, ChanhD.Tr.andTruong, StevenQ.H.



研究问题:大型工业制造中异常检测的重要应用。
动机:现有的方法虽然准确性高,但存在延迟的问题。
方法:通过改进反转蒸馏(RD)的方法,建立了一种新的、在具有挑战性的MVTec数据集上进行异常检测和定位的最先进的基准。
效果:提出的方法运行速度比PatchCore快六倍,比CFA快两倍,但与RD相比引入了可忽略的延迟。该方法在BTAD和视网膜OCT数据集上进行了实验,证明了其泛化能力,并通过重要的消融实验提供了对其配置的见解。

Anomaly detection is an important application in large-scale industrial manufacturing. Recent methods for this task have demonstrated excellent accuracy but come with a latency trade-off. Memory based approaches with dominant performances like PatchCore or Coupled-hypersphere-based Feature Adaptation (CFA) require an external memory bank, which significantly lengthens the execution time. Another approach that employs Reversed Distillation (RD) can perform well while maintaining low latency. In this paper, we revisit this idea to improve its performance, establishing a new state-of-the-art benchmark on the challenging MVTec dataset for both anomaly detection and localization. The proposed method, called RD++, runs six times faster than PatchCore, and two times faster than CFA but introduces a negligible latency compared to RD. We also experiment on the BTAD and Retinal OCT datasets to demonstrate our method's generalizability and conduct important ablation experiments to provide insights into its configurations. Source code will be available at https://github.com/tientrandinh/Revisiting-Reverse-Distillation.

Meta-Causal Learning for Single Domain Generalization
Chen, JinandGao, ZhiandWu, XinxiaoandLuo, Jiebo



研究问题:如何从单个训练领域(源领域)学习模型并将其应用于多个未见过的目标领域。
动机:现有的方法主要关注扩大训练领域的分布以覆盖目标领域,但没有估计源领域和目标领域之间的领域转移。
方法:提出了一种新的学习范式——模拟-分析-减少,首先通过构建辅助领域作为目标领域来模拟领域转移,然后学习分析领域转移的原因,最后学习减少领域转移以进行模型适应。在这个范式下,提出了一种元因果关系学习方法来学习元知识,即如何在训练过程中推断辅助领域和源领域之间的领域转移的原因。
效果:在几个图像分类基准测试上进行的大量实验表明了该方法的有效性。

Single domain generalization aims to learn a model from a single training domain (source domain) and apply it to multiple unseen test domains (target domains). Existing methods focus on expanding the distribution of the training domain to cover the target domains, but without estimating the domain shift between the source and target domains. In this paper, we propose a new learning paradigm, namely simulate-analyze-reduce, which first simulates the domain shift by building an auxiliary domain as the target domain, then learns to analyze the causes of domain shift, and finally learns to reduce the domain shift for model adaptation. Under this paradigm, we propose a meta-causal learning method to learn meta-knowledge, that is, how to infer the causes of domain shift between the auxiliary and source domains during training. We use the meta-knowledge to analyze the shift between the target and source domains during testing. Specifically, we perform multiple transformations on source data to generate the auxiliary domain, perform counterfactual inference to learn to discover the causal factors of the shift between the auxiliary and source domains, and incorporate the inferred causality into factor-aware domain alignments. Extensive experiments on several benchmarks of image classification show the effectiveness of our method.

FEND: A Future Enhanced Distribution-Aware Contrastive Learning Framework for Long-Tail Trajectory Prediction
Wang, YuningandZhang, PuandBai, LeiandXue, Jianru



研究问题:本文旨在解决自动驾驶中交通代理的未来轨迹预测问题,特别是长尾数据中的复杂和安全关键性问题。
动机:现有的轨迹预测方法没有考虑到长尾数据中多样的运动模式,且长尾数据通常更复杂、更具安全性。
方法:提出了一种未来增强的对比学习框架来识别尾部轨迹模式并形成具有独立模式簇的特征空间。此外,还提出了一个分布感知的超级预测器以更好地利用形成的特征空间。
效果:实验结果表明,该方法在尾部样本上比最先进的尾部预测方法提高了9.5%的ADE和8.5%的FDE,同时保持或略微提高了平均性能。该方法还在轨迹预测任务上超越了许多长尾技术。

Predicting the future trajectories of the traffic agents is a gordian technique in autonomous driving. However, trajectory prediction suffers from data imbalance in the prevalent datasets, and the tailed data is often more complicated and safety-critical. In this paper, we focus on dealing with the long-tail phenomenon in trajectory prediction. Previous methods dealing with long-tail data did not take into account the variety of motion patterns in the tailed data. In this paper, we put forward a future enhanced contrastive learning framework to recognize tail trajectory patterns and form a feature space with separate pattern clusters.Furthermore, a distribution aware hyper predictor is brought up to better utilize the shaped feature space.Our method is a model-agnostic framework and can be plugged into many well-known baselines. Experimental results show that our framework outperforms the state-of-the-art long-tail prediction method on tailed samples by 9.5% on ADE and 8.5% on FDE, while maintaining or slightly improving the averaged performance. Our method also surpasses many long-tail techniques on trajectory prediction task.

Reliability in Semantic Segmentation: Are We on the Right Track?
deJorge, PauandVolpi, RiccardoandTorr, PhilipH.S.andRogez, Gr\'egory



研究问题:本研究旨在探索现代语义分割模型在鲁棒性和不确定性估计方面的性能,以提升模型的可靠性。
动机:尽管现有的预训练语言模型在领域内的性能不断提高,但其对模型的鲁棒性和不确定性估计等方面的探索较少,这引发了对模型可靠性提升的疑虑。
方法:本研究对多种模型进行了广泛分析,包括从早期的基于ResNet的架构到新的转换器,并根据四个指标(鲁棒性、校准、误分类检测和分布外(OOD)检测)评估了它们的可靠性。
效果:研究发现,虽然最新的模型在鲁棒性方面有了显著提高,但在不确定性估计方面并不总是更可靠。进一步的探索发现,提高校准度也可以帮助改善其他不确定性指标,如误分类或OOD检测。这是首个关注于现代分割模型在鲁棒性和不确定性估计方面的研究,希望能帮助对此感兴趣的实践者和研究者。

Motivated by the increasing popularity of transformers in computer vision, in recent times there has been a rapid development of novel architectures. While in-domain performance follows a constant, upward trend, properties like robustness or uncertainty estimation are less explored -leaving doubts about advances in model reliability. Studies along these axes exist, but they are mainly limited to classification models. In contrast, we carry out a study on semantic segmentation, a relevant task for many real-world applications where model reliability is paramount. We analyze a broad variety of models, spanning from older ResNet-based architectures to novel transformers and assess their reliability based on four metrics: robustness, calibration, misclassification detection and out-of-distribution (OOD) detection. We find that while recent models are significantly more robust, they are not overall more reliable in terms of uncertainty estimation. We further explore methods that can come to the rescue and show that improving calibration can also help with other uncertainty metrics such as misclassification or OOD detection. This is the first study on modern segmentation models focused on both robustness and uncertainty estimation and we hope it will help practitioners and researchers interested in this fundamental vision task.

Video Test-Time Adaptation for Action Recognition
Lin, WeiandMirza, MuhammadJehanzebandKozinski, MateuszandPossegger, HorstandKuehne, HildeandBischof, Horst



研究问题:现有的动作识别系统在面对未预期的数据分布变化时表现脆弱。
动机:尽管现有的模型在评估时可以达到最佳性能,但它们对测试数据中未预期的分布变化非常敏感。
方法:提出一种适用于时空模型的方法,该方法能够在单个视频样本上进行适应。该方法包括一个特征分布对齐技术,该技术将在线估计的测试集统计量与训练统计量对齐。此外,我们还对同一测试视频样本的临时增强视图实施预测一致性。
效果:在三个基准动作识别数据集上的评估表明,我们提出的方法对架构无关,并能显著提高现有最先进的卷积架构TANet和Video Swin Transformer的性能。我们的方法在单个分布偏移的评估和具有挑战性的随机分布偏移的情况下,比现有的测试时间适应方法表现出显著的性能增益。

Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal models that is capable of adaptation on a single video sample at a step. It consists in a feature distribution alignment technique that aligns online estimates of test set statistics towards the training statistics. We further enforce prediction consistency over temporally augmented views of the same test video sample. Evaluations on three benchmark action recognition datasets show that our proposed technique is architecture-agnostic and able to significantly boost the performance on both, the state of the art convolutional architecture TANet and the Video Swin Transformer. Our proposed method demonstrates a substantial performance gain over existing test-time adaptation approaches in both evaluations of a single distribution shift and the challenging case of random distribution shifts.

Bi-Level Meta-Learning for Few-Shot Domain Generalization
Qin, XiaorongandSong, XinhangandJiang, Shuqiang



研究问题:本文旨在解决小样本学习中的领域泛化问题,即如何在只有少量样本的情况下,实现从已知领域到未知领域的泛化。
动机:现有的小样本学习方法主要集中在特定领域的泛化上,但在实际应用中,更常见的情况是需要跨领域的泛化。
方法:本文通过元学习两个层次的元知识来解决小样本领域泛化问题。较低层次的元知识是特定领域的嵌入空间,作为基础空间的子空间进行域内泛化;较高层次的元知识是基础空间和先验的特定领域空间子空间,用于域间泛化。通过双层优化来求解这两个层次的元知识学习问题,并进一步开发了一种无需海森矩阵信息的最优化算法。
效果:通过在广泛使用的基准Meta-Dataset上进行评估,证明该方法显著优于先前的工作。

The goal of few-shot learning is to learn the generalizability from seen to unseen data with only a few samples. Most previous few-shot learning focus on learning generalizability within particular domains. However, the more practical scenarios may also require generalizability across domains. In this paper, we study the problem of Few-shot domain generalization (FSDG), which is a more challenging variant of few-shot classification. FSDG requires additional generalization with larger gap from seen domains to unseen domains. We address FSDG problem by meta-learning two levels of meta-knowledge, where the lower-level meta-knowledge are domain-specific embedding spaces as subspaces of a base space for intra-domain generalization, and the upper-level meta-knowledge is the base space and a prior subspace over domain-specific spaces for inter-domain generalization. We formulate the two levels of meta-knowledge learning problem with bi-level optimization, and further develop an optimization algorithm without Hessian information to solve it. We demonstrate our method is significantly superior to the previous works by evaluating it on the widely used benchmark Meta-Dataset.

Class Relationship Embedded Learning for Source-Free Unsupervised Domain Adaptation
Zhang, YixinandWang, ZileiandHe, Weinan



研究问题:本文旨在解决源无监督领域适应(SFUDA)这一实际知识转移任务,即只有训练好的源模型和未标记的目标数据可用。
动机:为了充分利用源知识,作者提出转移类别关系,这是一种领域不变的但之前的研究尚未充分探索的类别关系。
方法:首先将源模型的分类器权重视为类别原型来计算类别关系,然后通过嵌入源领域的类别关系来提出一种新的基于概率的相似性,称为类别关系嵌入相似性(CRS)。最后,作者建议将CRS统一嵌入对比学习中。
效果:广泛的实验结果表明,由于类别关系的领域不变性转移,该方法可以实现最先进的性能。

This work focuses on a practical knowledge transfer task defined as Source-Free Unsupervised Domain Adaptation (SFUDA), where only a well-trained source model and unlabeled target data are available. To fully utilize source knowledge, we propose to transfer the class relationship, which is domain-invariant but still under-explored in previous works. To this end, we first regard the classifier weights of the source model as class prototypes to compute class relationship, and then propose a novel probability-based similarity between target-domain samples by embedding the source-domain class relationship, resulting in Class Relationship embedded Similarity (CRS). Here the inter-class term is particularly considered in order to more accurately represent the similarity between two samples, in which the source prior of class relationship is utilized by weighting. Finally, we propose to embed CRS into contrastive learning in a unified form. Here both class-aware and instance discrimination contrastive losses are employed, which are complementary to each other. We combine the proposed method with existing representative methods to evaluate its efficacy in multiple SFUDA settings. Extensive experimental results reveal that our method can achieve state-of-the-art performance due to the transfer of domain-invariant class relationship.

Spatio-Temporal Pixel-Level Contrastive Learning-Based Source-Free Domain Adaptation for Video Semantic Segmentation
Lo, Shao-YuanandOza, PoojanandChennupati, SumanthandGalindo, AlejandroandPatel, VishalM.



研究问题:如何在无法访问源数据的情况下,将已标记的源知识转移到未标记的目标领域。
动机:在现实世界的场景中,对源数据的访问通常受到限制或不可行,这使得无监督域适应(UDA)在实际应用中的效果并不理想。
方法:本文提出了一种新颖的方法——时空像素级(STPL)对比学习,该方法充分利用了时空信息来解决视频应用中的视频适应问题。
效果:实验表明,与当前的UDA和SFDA方法相比,STPL在VSS基准测试上取得了最先进的性能。

Unsupervised Domain Adaptation (UDA) of semantic segmentation transfers labeled source knowledge to an unlabeled target domain by relying on accessing both the source and target data. However, the access to source data is often restricted or infeasible in real-world scenarios. Under the source data restrictive circumstances, UDA is less practical. To address this, recent works have explored solutions under the Source-Free Domain Adaptation (SFDA) setup, which aims to adapt a source-trained model to the target domain without accessing source data. Still, existing SFDA approaches use only image-level information for adaptation, making them sub-optimal in video applications. This paper studies SFDA for Video Semantic Segmentation (VSS), where temporal information is leveraged to address video adaptation. Specifically, we propose Spatio-Temporal Pixel-Level (STPL) contrastive learning, a novel method that takes full advantage of spatio-temporal information to tackle the absence of source data better. STPL explicitly learns semantic correlations among pixels in the spatio-temporal space, providing strong self-supervision for adaptation to the unlabeled target domain. Extensive experiments show that STPL achieves state-of-the-art performance on VSS benchmarks compared to current UDA and SFDA approaches. Code is available at: https://github.com/shaoyuanlo/STPL

Mind the Label Shift of Augmentation-Based Graph OOD Generalization
Yu, JunchiandLiang, JianandHe, Ran



研究问题:如何提高图神经网络(GNNs)的分布外(OOD)泛化能力。
动机:现有的方法通过编辑图形结构生成增强环境以学习一个不变的GNN进行泛化,但这会改变图形标签,导致增强环境中的标签偏移和不一致的预测关系。
方法:提出LiSA,利用训练图中的标签不变子图构建增强环境,而不是依赖图形编辑。具体来说,LiSA首先设计变分子图生成器来有效提取局部预测模式并构造多个标签不变的子图。然后,由不同生成器产生的子图被收集起来构建不同的增强环境。为了促进增强环境之间的多样性,LiSA进一步引入了一种易于处理的能量基线正则化,以扩大环境分布之间的成对距离。
效果:在节点级和图级的OOB基准测试上进行的大量实验表明,LiSA在不同的GNN骨干网络上实现了令人印象深刻的泛化性能。代码可在https://github.com/Samyu0304/LiSA获取。

Out-of-distribution (OOD) generalization is an important issue for Graph Neural Networks (GNNs). Recent works employ different graph editions to generate augmented environments and learn an invariant GNN for generalization. However, the graph structural edition inevitably alters the graph label. This causes the label shift in augmentations and brings inconsistent predictive relationships among augmented environments. To address this issue, we propose LiSA, which generates label-invariant augmentations to facilitate graph OOD generalization. Instead of resorting to graph editions, LiSA exploits Label-invariant Subgraphs of the training graphs to construct Augmented environments. Specifically, LiSA first designs the variational subgraph generators to efficiently extract locally predictive patterns and construct multiple label-invariant subgraphs. Then, the subgraphs produced by different generators are collected to build different augmented environments. To promote diversity among augmented environments, LiSA further introduces a tractable energy-based regularization to enlarge pair-wise distances between the distributions of environments. In this manner, LiSA generates diverse augmented environments with a consistent predictive relationship to facilitate learning an invariant GNN. Extensive experiments on node-level and graph-level OOD benchmarks show that LiSA achieves impressive generalization performance with different GNN backbones. Code is available on https://github.com/Samyu0304/LiSA.

Confidence-Aware Personalized Federated Learning via Variational Expectation Maximization
Zhu, JunyiandMa, XingchenandBlaschko, MatthewB.



研究问题:本文旨在解决联邦学习中各客户端数据分布不均和规模不同的问题。
动机:个性化联邦学习通过局部适应模型来解决这个问题。
方法:提出了一种基于分层贝叶斯建模和变分推理的个性化联邦学习新框架,引入全局模型作为潜在变量,优化基于最大化边缘似然性原则,使用变分期望最大化进行计算。
效果:在多种数据集上进行了广泛的实证研究,实验结果表明,该方法在轻度异构环境下表现良好,在高度异构环境下显著优于最先进的个性化联邦学习框架。

Federated Learning (FL) is a distributed learning scheme to train a shared model across clients. One common and fundamental challenge in FL is that the sets of data across clients could be non-identically distributed and have different sizes. Personalized Federated Learning (PFL) attempts to solve this challenge via locally adapted models. In this work, we present a novel framework for PFL based on hierarchical Bayesian modeling and variational inference. A global model is introduced as a latent variable to augment the joint distribution of clients' parameters and capture the common trends of different clients, optimization is derived based on the principle of maximizing the marginal likelihood and conducted using variational expectation maximization. Our algorithm gives rise to a closed-form estimation of a confidence value which comprises the uncertainty of clients' parameters and local model deviations from the global model. The confidence value is used to weigh clients' parameters in the aggregation stage and adjust the regularization effect of the global model. We evaluate our method through extensive empirical studies on multiple datasets. Experimental results show that our approach obtains competitive results under mild heterogeneous circumstances while significantly outperforming state-of-the-art PFL frameworks in highly heterogeneous settings.

AdaptiveMix: Improving GAN Training via Feature Space Shrinkage
Liu, HaozheandZhang, WentianandLi, BingandWu, HaoqianandHe, NanjunandHuang, YawenandLi, YuexiangandGhanem, BernardandZheng, Yefeng



研究问题:训练GANs的困难在于判别器的训练分布是动态的,导致图像表示不稳定。
动机:受稳健图像表示研究的启发,提出一种简单有效的模块——AdaptiveMix,用于缩小判别器在图像表示空间中训练数据的区域。
方法:通过混合一对训练图像来构造硬样本,缩小硬样本和易样本之间的特征距离。
效果:实验结果表明,AdaptiveMix可以促进GAN的训练,有效提高生成样本的图像质量。此外,将其与最先进的方法结合,可以进一步应用于图像分类和OOD检测任务,显著提高基线性能。

Due to the outstanding capability for data generation, Generative Adversarial Networks (GANs) have attracted considerable attention in unsupervised learning. However, training GANs is difficult, since the training distribution is dynamic for the discriminator, leading to unstable image representation. In this paper, we address the problem of training GANs from a novel perspective, i.e., robust image classification. Motivated by studies on robust image representation, we propose a simple yet effective module, namely AdaptiveMix, for GANs, which shrinks the regions of training data in the image representation space of the discriminator. Considering it is intractable to directly bound feature space, we propose to construct hard samples and narrow down the feature distance between hard and easy samples. The hard samples are constructed by mixing a pair of training images. We evaluate the effectiveness of our AdaptiveMix with widely-used and state-of-the-art GAN architectures. The evaluation results demonstrate that our AdaptiveMix can facilitate the training of GANs and effectively improve the image quality of generated samples. We also show that our AdaptiveMix can be further applied to image classification and Out-Of-Distribution (OOD) detection tasks, by equipping it with state-of-the-art methods. Extensive experiments on seven publicly available datasets show that our method effectively boosts the performance of baselines. The code is publicly available at https://github.com/WentianZhang-ML/AdaptiveMix.

Angelic Patches for Improving Third-Party Object Detector Performance
Si, WenwenandLi, ShuoandPark, SangdonandLee, InsupandBastani, Osbert



研究问题:深度学习模型对简单扰动和空间变换表现出极度的脆弱性,本文探索是否可以研究问题:深度学习模型对简单扰动和空间变换表现出极度的脆弱性,本文探索是否可以采用对抗攻击方法的特性来提高目标检测的扰动鲁棒性。
动机:在现实的目标检测设置中,目标对象可以控制其外观,我们提出一种反向快速梯度符号方法(FGSM)来获取这些天使补丁,即使没有预知扰动也能显著增加检测概率。
方法:我们将补丁同时应用于每个目标实例,不仅加强分类,还提高边界框的准确性。
效果:实验证明部分覆盖的补丁在解决复杂的边界框问题上是有效的。更重要的是,即使在严重的仿射变换和变形形状下,性能也可以转移到不同的检测模型上。据我们所知,我们是第一个实现跨模型和多补丁效果的目标检测补丁。在真实世界的实验中,我们观察到平均准确率提高了30%,带来了巨大的社会价值。

Deep learning models have shown extreme vulnerability to simple perturbations and spatial transformations. In this work, we explore whether we can adopt the characteristics of adversarial attack methods to help improve perturbation robustness for object detection. We study a class of realistic object detection settings wherein the target objects have control over their appearance. To this end, we propose a reversed Fast Gradient Sign Method (FGSM) to obtain these angelic patches that significantly increase the detection probability, even without pre-knowledge of the perturbations. In detail, we apply the patch to each object instance simultaneously, strengthen not only classification but also bounding box accuracy. Experiments demonstrate the efficacy of the partial-covering patch in solving the complex bounding box problem. More importantly, the performance is also transferable to different detection models even under severe affine transformations and deformable shapes. To our knowledge, we are the first (object detection) patch that achieves both cross-model and multiple-patch efficacy. We observed average accuracy improvements of 30% in the real-world experiments, which brings large social value. Our code is available at: https://github.com/averysi224/angelic_patches.

Bi3D: Bi-Domain Active Learning for Cross-Domain 3D Object Detection
Yuan, JiakangandZhang, BoandYan, XiangchaoandChen, TaoandShi, BotianandLi, YikangandQiao, Yu



研究问题:如何通过选择部分重要目标数据并对其进行标注,以实现高性能和低标注成本之间的良好平衡。
动机:尽管无监督领域适应(UDA)技术在3D跨领域任务中取得了初步进展,但基于UDA的3D模型与使用完全注释的目标领域的有监督模型之间的性能差距仍然很大。
方法:我们提出了一种双域主动学习方法,即Bi3D,来解决跨领域的3D对象检测任务。Bi3D首先开发了一种领域感知的源采样策略,从源领域中识别出类似目标领域的样本,以避免模型受到无关源数据的干扰。然后,我们开发了一种基于多样性的目标采样策略,该策略选择了最具信息量的目标子集,以最少的标注预算提高模型对目标领域的适应性。
效果:我们在典型的跨领域适应场景中进行了实验,包括跨激光束、跨国界和跨传感器,其中Bi3D实现了令人鼓舞的目标领域检测精度(在KITTI上为89.63%),超过了基于UDA的工作(84.29%),甚至超过了在全标记目标领域上训练的检测器(88.98%)。

Unsupervised Domain Adaptation (UDA) technique has been explored in 3D cross-domain tasks recently. Though preliminary progress has been made, the performance gap between the UDA-based 3D model and the supervised one trained with fully annotated target domain is still large. This motivates us to consider selecting partial-yet-important target data and labeling them at a minimum cost, to achieve a good trade-off between high performance and low annotation cost. To this end, we propose a Bi-domain active learning approach, namely Bi3D, to solve the cross-domain 3D object detection task. The Bi3D first develops a domainness-aware source sampling strategy, which identifies target-domain-like samples from the source domain to avoid the model being interfered by irrelevant source data. Then a diversity-based target sampling strategy is developed, which selects the most informative subset of target domain to improve the model adaptability to the target domain using as little annotation budget as possible. Experiments are conducted on typical cross-domain adaptation scenarios including cross-LiDAR-beam, cross-country, and cross-sensor, where Bi3D achieves a promising target-domain detection accuracy (89.63% on KITTI) compared with UDA-based work (84.29%), even surpassing the detector trained on the full set of the labeled target domain (88.98%).

Highly Confident Local Structure Based Consensus Graph Learning for Incomplete Multi-View Clustering
Wen, JieandLiu, ChengliangandXu, GehuiandWu, ZhihaoandHuang, ChaoandFei, LunkeandXu, Yong



研究问题:如何有效地进行基于图的多视角聚类,特别是在存在大量不完整数据的情况下。
动机:现有的方法通常利用从原始数据构建的图来帮助学习一致的表示,而我们的方法直接学习跨视图的共识图进行聚类。
方法:我们设计了一种新的置信图,并将其嵌入以形成一种由置信结构驱动的共识图学习模型。我们的置信图基于直观的相似最近邻假设,不需要任何额外信息,可以帮助模型获得高质量的共识图以实现更好的聚类。
效果:通过大量的实验,我们的方法被证明是有效的。

Graph-based multi-view clustering has attracted extensive attention because of the powerful clustering-structure representation ability and noise robustness. Considering the reality of a large amount of incomplete data, in this paper, we propose a simple but effective method for incomplete multi-view clustering based on consensus graph learning, termed as HCLS_CGL. Unlike existing methods that utilize graph constructed from raw data to aid in the learning of consistent representation, our method directly learns a consensus graph across views for clustering. Specifically, we design a novel confidence graph and embed it to form a confidence structure driven consensus graph learning model. Our confidence graph is based on an intuitive similar-nearest-neighbor hypothesis, which does not require any additional information and can help the model to obtain a high-quality consensus graph for better clustering. Numerous experiments are performed to confirm the effectiveness of our method.

CafeBoost: Causal Feature Boost To Eliminate Task-Induced Bias for Class Incremental Learning
Qiu, BenliuandLi, HongliangandWen, HaitaoandQiu, HeqianandWang, LanxiaoandMeng, FanmanandWu, QingboandPan, Lili



研究问题:本文旨在解决连续学习中的灾难性遗忘问题,并发现其中出现的一种新类型的偏差——任务诱导偏差。
动机:在连续学习中,模型需要逐步学习一系列任务,但在此过程中会出现灾难性遗忘的问题。同时,作者发现在任务和领域增量学习中存在两种可以自然减少任务诱导偏差的机制,但在类别增量学习(CIL)中并不存在这种机制。
方法:作者设计了一种因果干预操作来切断导致任务诱导偏差的因果关系,并将其实现为一个因果去偏模块,该模块可以将有偏的特征转换为无偏的特征。此外,作者还提出了一种训练流程,将这个新的模块整合到现有的方法中,并对整个架构进行联合优化。
效果:在CIFAR-100和ImageNet上的大量实验表明,该方法可以显著提高准确率,并大幅减少已有方法的遗忘程度。

Continual learning requires a model to incrementally learn a sequence of tasks and aims to predict well on all the learned tasks so far, which notoriously suffers from the catastrophic forgetting problem. In this paper, we find a new type of bias appearing in continual learning, coined as task-induced bias. We place continual learning into a causal framework, based on which we find the task-induced bias is reduced naturally by two underlying mechanisms in task and domain incremental learning. However, these mechanisms do not exist in class incremental learning (CIL), in which each task contains a unique subset of classes. To eliminate the task-induced bias in CIL, we devise a causal intervention operation so as to cut off the causal path that causes the task-induced bias, and then implement it as a causal debias module that transforms biased features into unbiased ones. In addition, we propose a training pipeline to incorporate the novel module into existing methods and jointly optimize the entire architecture. Our overall approach does not rely on data replay, and is simple and convenient to plug into existing methods. Extensive empirical study on CIFAR-100 and ImageNet shows that our approach can improve accuracy and reduce forgetting of well-established methods by a large margin.

Learning With Fantasy: Semantic-Aware Virtual Contrastive Constraint for Few-Shot Class-Incremental Learning
Song, ZeyinandZhao, YifanandShi, YujunandPeng, PeixiandYuan, LiandTian, Yonghong



研究问题:本文旨在解决小样本类别增量学习(FSCIL)中的问题,即如何在有限的样本中持续学习新类别,同时不忘记旧类别。
动机:目前的FSCIL主流框架在基础会话训练阶段采用交叉熵损失函数,但发现其对类间表示的分离效果不佳,进一步影响了对新类别的泛化能力。
方法:受此启发,本文提出了一种新颖的方法——语义感知虚拟对比模型(SAVC)。该方法通过在对比学习中引入虚拟类别来促进新旧类别之间的分离。这些虚拟类别是通过预定义的转换生成的,不仅在表示空间中作为未见过类别的占位符,还提供了多样化的语义信息。
效果:实验结果表明,SAVC显著提高了基础类别的分离度和新类别的泛化能力,在三个广泛使用的FSCIL基准数据集上取得了新的最先进的性能。

Few-shot class-incremental learning (FSCIL) aims at learning to classify new classes continually from limited samples without forgetting the old classes. The mainstream framework tackling FSCIL is first to adopt the cross-entropy (CE) loss for training at the base session, then freeze the feature extractor to adapt to new classes. However, in this work, we find that the CE loss is not ideal for the base session training as it suffers poor class separation in terms of representations, which further degrades generalization to novel classes. One tempting method to mitigate this problem is to apply an additional naive supervised contrastive learning (SCL) in the base session. Unfortunately, we find that although SCL can create a slightly better representation separation among different base classes, it still struggles to separate base classes and new classes. Inspired by the observations made, we propose Semantic-Aware Virtual Contrastive model (SAVC), a novel method that facilitates separation between new classes and base classes by introducing virtual classes to SCL. These virtual classes, which are generated via pre-defined transformations, not only act as placeholders for unseen classes in the representation space but also provide diverse semantic information. By learning to recognize and contrast in the fantasy space fostered by virtual classes, our SAVC significantly boosts base class separation and novel class generalization, achieving new state-of-the-art performance on the three widely-used FSCIL benchmark datasets. Code is available at: https://github.com/zysong0113/SAVC.

Learning Partial Correlation Based Deep Visual Representation for Image Classification
Rahman, SaimunurandKoniusz, PiotrandWang, LeiandZhou, LupingandMoghadam, PeymanandSun, Changming



研究问题:如何将"部分相关性"估计过程有效地整合到CNN中,以获取更准确的深度视觉表示。
动机:现有的基于协方差矩阵的视觉表示方法在存在其他与两个目标通道相关联的通道时会出现“混淆”效应,导致估计结果失真。因此,需要引入可以消除这种混淆效应的“部分相关性”估计。
方法:我们将SICE(稀疏逆协方差估计)定义为一种新的CNN结构化层,并开发了一种迭代方法,在正向和反向传播步骤中解决上述矩阵优化问题,以确保模型的端到端可训练性。
效果:我们的工作获得了一种基于部分相关性的深度视觉表示,并解决了CNN中经常遇到的协方差矩阵估计的小样本问题。实验表明,我们的深度视觉表示在分类性能上优于基于协方差矩阵的对应方法。

Visual representation based on covariance matrix has demonstrates its efficacy for image classification by characterising the pairwise correlation of different channels in convolutional feature maps. However, pairwise correlation will become misleading once there is another channel correlating with both channels of interest, resulting in the "confounding" effect. For this case, "partial correlation" which removes the confounding effect shall be estimated instead. Nevertheless, reliably estimating partial correlation requires to solve a symmetric positive definite matrix optimisation, known as sparse inverse covariance estimation (SICE). How to incorporate this process into CNN remains an open issue. In this work, we formulate SICE as a novel structured layer of CNN. To ensure end-to-end trainability, we develop an iterative method to solve the above matrix optimisation during forward and backward propagation steps. Our work obtains a partial correlation based deep visual representation and mitigates the small sample problem often encountered by covariance matrix estimation in CNN. Computationally, our model can be effectively trained with GPU and works well with a large number of channels of advanced CNNs. Experiments show the efficacy and superior classification performance of our deep visual representation compared to covariance matrix based counterparts.

IterativePFN: True Iterative Point Cloud Filtering
deSilvaEdirimuni, DasithandLu, XuequanandShao, ZhiwenandLi, GangandRobles-Kelly, AntonioandHe, Ying



研究问题:如何提高点云质量,即去除捕获过程中引入的噪声。
动机:现有的基于学习的方法主要通过训练神经网络来推断过滤后的位移,并将有噪声的点直接转移到底层清洁表面,但这种方法在高噪声环境下效果不佳。
方法:提出一种迭代点云滤波网络(IterativePFN),该网络内部包含多个迭代模块,模拟真实的迭代滤波过程。并使用一种新的损失函数进行训练,该函数在每次迭代时利用一个自适应的地面真实目标,以捕捉训练过程中中间滤波结果之间的关系。
效果:实验结果表明,该方法比现有技术具有更好的性能。

The quality of point clouds is often limited by noise introduced during their capture process. Consequently, a fundamental 3D vision task is the removal of noise, known as point cloud filtering or denoising. State-of-the-art learning based methods focus on training neural networks to infer filtered displacements and directly shift noisy points onto the underlying clean surfaces. In high noise conditions, they iterate the filtering process. However, this iterative filtering is only done at test time and is less effective at ensuring points converge quickly onto the clean surfaces. We propose IterativePFN (iterative point cloud filtering network), which consists of multiple IterationModules that model the true iterative filtering process internally, within a single network. We train our IterativePFN network using a novel loss function that utilizes an adaptive ground truth target at each iteration to capture the relationship between intermediate filtering results during training. This ensures that the filtered results converge faster to the clean surfaces. Our method is able to obtain better performance compared to state-of-the-art methods. The source code can be found at: https://github.com/ddsediri/IterativePFN.

On the Convergence of IRLS and Its Variants in Outlier-Robust Estimation
Peng, LiangzuandK\"ummerle, ChristianandVidal, Ren\'e



研究问题:如何在存在异常值的情况下,从数据样本中估计一些参数(如3D旋转),并解决其非凸和非光滑的问题。
动机:经典的迭代加权最小二乘法及其变体在处理这个问题上表现出了优秀的性能,但对其为何能如此有效的原因尚不明确。
方法:将主要化和分级非凸性(GNC)引入到迭代加权最小二乘法框架中,证明得到的变体是一个收敛的异常值稳健估计方法。
效果:实验结果证实了我们的理论,并表明对于典型的异常值稳健估计问题实例,所提出的方法在5-10次迭代内收敛,而最先进的方法至少需要30次迭代。

Outlier-robust estimation involves estimating some parameters (e.g., 3D rotations) from data samples in the presence of outliers, and is typically formulated as a non-convex and non-smooth problem. For this problem, the classical method called iteratively reweighted least-squares (IRLS) and its variants have shown impressive performance. This paper makes several contributions towards understanding why these algorithms work so well. First, we incorporate majorization and graduated non-convexity (GNC) into the IRLS framework and prove that the resulting IRLS variant is a convergent method for outlier-robust estimation. Moreover, in the robust regression context with a constant fraction of outliers, we prove this IRLS variant converges to the ground truth at a global linear and local quadratic rate for a random Gaussian feature matrix with high probability. Experiments corroborate our theory and show that the proposed IRLS variant converges within 5-10 iterations for typical problem instances of outlier-robust estimation, while state-of-the-art methods need at least 30 iterations. A basic implementation of our method is provided: https://github.com/liangzu/IRLS-CVPR2023

Deep Incomplete Multi-View Clustering With Cross-View Partial Sample and Prototype Alignment
Jin, JiaqiandWang, SiweiandDong, ZhibinandLiu, XinwangandZhu, En



研究问题:现有的多视角聚类方法在面对样本不完整时表现不佳,且现有解决不完整多视角聚类(IMVC)的方法存在忽视视图差异和表示灵活性以及可能导致错误融合的问题。
动机:由于数据损坏或传感器故障等原因,现实中的多视角样本往往是部分可用的,这导致现有的多视角聚类方法无法很好地处理这种情况。
方法:我们提出了一种名为Cross-view Partial Sample and Prototype Alignment Network (CPSPAN)的深度不完整多视角聚类方法。该方法通过使用对观察到的数据对齐作为'代理监督信号'来指导视图之间的实例到实例对应关系的构建,并针对IMVC中偏移的原型设计了一种原型对齐模块来实现跨视图的不完整分布校准。
效果:大量的实验结果表明,我们提出的模块非常有效,与现有的IMVC竞争对手相比,在基准数据集上取得了显著的性能改进。

The success of existing multi-view clustering relies on the assumption of sample integrity across multiple views. However, in real-world scenarios, samples of multi-view are partially available due to data corruption or sensor failure, which leads to incomplete multi-view clustering study (IMVC). Although several attempts have been proposed to address IMVC, they suffer from the following drawbacks: i) Existing methods mainly adopt cross-view contrastive learning forcing the representations of each sample across views to be exactly the same, which might ignore view discrepancy and flexibility in representations; ii) Due to the absence of non-observed samples across multiple views, the obtained prototypes of clusters might be unaligned and biased, leading to incorrect fusion. To address the above issues, we propose a Cross-view Partial Sample and Prototype Alignment Network (CPSPAN) for Deep Incomplete Multi-view Clustering. Firstly, unlike existing contrastive-based methods, we adopt pair-observed data alignment as 'proxy supervised signals' to guide instance-to-instance correspondence construction among views. Then, regarding of the shifted prototypes in IMVC, we further propose a prototype alignment module to achieve incomplete distribution calibration across views. Extensive experimental results showcase the effectiveness of our proposed modules, attaining noteworthy performance improvements when compared to existing IMVC competitors on benchmark datasets.

Object Pose Estimation With Statistical Guarantees: Conformal Keypoint Detection and Geometric Uncertainty Propagation
Yang, HengandPavone, Marco



研究问题:现有的两阶段物体姿态估计方法在标准基准上表现良好,但对估计的质量和不确定性没有提供可证明的保证。
动机:为了解决这一问题,本文对两阶段范式进行了两个根本性的改变,即保形关键点检测和几何不确定性传播,并提出了第一个能够提供可证明和可计算的最坏情况误差界限的姿态估计器。
方法:首先,保形关键点检测将归纳性保形预测的统计机制应用于将启发式关键点检测转化为覆盖真实关键点的用户指定边缘概率(例如90%)的圆形或椭圆形预测集。其次,几何不确定性传播将关键点上的几何约束传播到6D对象姿态,从而产生一个保证以相同概率覆盖真实姿态的姿态不确定性集(PURSE)。然后,通过随机样本平均(RANSAG)计算平均姿态,并通过半定松弛法上界最坏情况误差。
效果:在LineMOD遮挡数据集上,我们证明了:(i)PURSE以有效的概率覆盖了真实姿态;(ii)最坏情况误差界限提供了正确的不确定性量化;(iii)平均姿态实现了与基于稀疏关键点的代表方法相当或更好的精度。

The two-stage object pose estimation paradigm first detects semantic keypoints on the image and then estimates the 6D pose by minimizing reprojection errors. Despite performing well on standard benchmarks, existing techniques offer no provable guarantees on the quality and uncertainty of the estimation. In this paper, we inject two fundamental changes, namely conformal keypoint detection and geometric uncertainty propagation, into the two-stage paradigm and propose the first pose estimator that endows an estimation with provable and computable worst-case error bounds. On one hand, conformal keypoint detection applies the statistical machinery of inductive conformal prediction to convert heuristic keypoint detections into circular or elliptical prediction sets that cover the groundtruth keypoints with a user-specified marginal probability (e.g., 90%). Geometric uncertainty propagation, on the other, propagates the geometric constraints on the keypoints to the 6D object pose, leading to a Pose UnceRtainty SEt (PURSE) that guarantees coverage of the groundtruth pose with the same probability. The PURSE, however, is a nonconvex set that does not directly lead to estimated poses and uncertainties. Therefore, we develop RANdom SAmple averaGing (RANSAG) to compute an average pose and apply semidefinite relaxation to upper bound the worst-case errors between the average pose and the groundtruth. On the LineMOD Occlusion dataset we demonstrate: (i) the PURSE covers the groundtruth with valid probabilities; (ii) the worst-case error bounds provide correct uncertainty quantification; and (iii) the average pose achieves better or similar accuracy as representative methods based on sparse keypoints.

Learning Joint Latent Space EBM Prior Model for Multi-Layer Generator
Cui, JialiandWu, YingNianandHan, Tian



研究问题:学习多层生成器模型的基本问题。
动机:现有的多层生成器模型通过在生成器顶部建立多个潜变量层作为先验模型,有助于学习复杂的数据分布和分层表示。然而,这种先验模型通常通过假设非信息性(条件)高斯分布来建模潜变量层之间的相互关系,这可能在模型表现力上存在限制。
方法:提出一种基于能量的模型(EBM),在多层生成器的所有潜变量层的联合潜空间上进行训练。这种联合潜空间EBM先验模型通过逐层的能量项捕获每层的层内上下文关系,并联合校正不同层的潜变量。开发了一种基于最大似然估计(MLE)的联合训练方案,其中涉及从不同层的潜在变量中进行马尔科夫链蒙特卡罗(MCMC)采样的先验和后验分布。为确保高效的推理和学习,进一步提出了一种变分训练方案,其中使用推理模型来摊销昂贵的后验MCMC采样。
效果:实验结果表明,学习到的模型在生成高质量图像和捕获分层特征以进行更好的异常检测方面具有表现力。

This paper studies the fundamental problem of learning multi-layer generator models. The multi-layer generator model builds multiple layers of latent variables as a prior model on top of the generator, which benefits learning complex data distribution and hierarchical representations. However, such a prior model usually focuses on modeling inter-layer relations between latent variables by assuming non-informative (conditional) Gaussian distributions, which can be limited in model expressivity. To tackle this issue and learn more expressive prior models, we propose an energy-based model (EBM) on the joint latent space over all layers of latent variables with the multi-layer generator as its backbone. Such joint latent space EBM prior model captures the intra-layer contextual relations at each layer through layer-wise energy terms, and latent variables across different layers are jointly corrected. We develop a joint training scheme via maximum likelihood estimation (MLE), which involves Markov Chain Monte Carlo (MCMC) sampling for both prior and posterior distributions of the latent variables from different layers. To ensure efficient inference and learning, we further propose a variational training scheme where an inference model is used to amortize the costly posterior MCMC sampling. Our experiments demonstrate that the learned model can be expressive in generating high-quality images and capturing hierarchical features for better outlier detection.

Unsupervised Visible-Infrared Person Re-Identification via Progressive Graph Matching and Alternate Learning
Wu, ZesenandYe, Mang



研究问题:解决无监督可见光-红外跨模态行人重识别问题,由于模态差距大和缺乏跨模态对应关系。
动机:跨模态对应关系对于弥合模态差距至关重要。现有的一些工作尝试挖掘跨模态对应关系,但只关注局部信息,没有充分利用全局身份关系,限制了挖掘的对应关系的质量。
方法:设计了一种渐进式图匹配方法(PGM)来在聚类不平衡的情况下全局挖掘跨模态对应关系。PGM将对应关系挖掘视为图匹配过程,通过最小化全局匹配成本(衡量聚类之间的差异)来考虑全局信息。此外,PGM采用渐进策略通过多个动态匹配过程来解决不平衡问题。基于PGM,设计了一个交替交叉对比学习(ACCL)模块,利用挖掘的跨模态对应关系减小模态差距,并通过交替方案减轻对应关系中的噪声影响。
效果:大量实验证明生成的对应关系的可靠性以及该方法的有效性。

Unsupervised visible-infrared person re-identification is a challenging task due to the large modality gap and the unavailability of cross-modality correspondences. Cross-modality correspondences are very crucial to bridge the modality gap. Some existing works try to mine cross-modality correspondences, but they focus only on local information. They do not fully exploit the global relationship across identities, thus limiting the quality of the mined correspondences. Worse still, the number of clusters of the two modalities is often inconsistent, exacerbating the unreliability of the generated correspondences. In response, we devise a Progressive Graph Matching method to globally mine cross-modality correspondences under cluster imbalance scenarios. PGM formulates correspondences mining as a graph matching process and considers the global information by minimizing the global matching cost, where the matching cost measures the dissimilarity of clusters. Besides, PGM adopts a progressive strategy to address the imbalance issue with multiple dynamic matching processes. Based on PGM, we design an Alternate Cross Contrastive Learning (ACCL) module to reduce the modality gap with the mined cross-modality correspondences, while mitigating the effect of noise in correspondences through an alternate scheme. Extensive experiments demonstrate the reliability of the generated correspondences and the effectiveness of our method.

Training Debiased Subnetworks With Contrastive Weight Pruning
Park, GeonYeongandLee, SangminandLee, SangWanandYe, JongChul



研究问题:在严重偏置的网络中,是否存在最优无偏的功能性子网络?如果存在,如何提取这样的子网络?
动机:神经网络往往偏向于误导性相关的特征,这引发了一个问题:在严重偏置的网络中,是否存在最优无偏的功能性子网络?如果存在,如何提取这样的子网络?
方法:我们提出了一种去偏对比权重剪枝(DCWP)算法,该算法在没有昂贵群体注释的情况下探索无偏子网络。
效果:实验结果表明,尽管我们的方法是参数数量的显著减少,但它显著优于最先进的去偏方法。

Neural networks are often biased to spuriously correlated features that provide misleading statistical evidence that does not generalize. This raises an interesting question: "Does an optimal unbiased functional subnetwork exist in a severely biased network? If so, how to extract such subnetwork?" While empirical evidence has been accumulated about the existence of such unbiased subnetworks, these observations are mainly based on the guidance of ground-truth unbiased samples. Thus, it is unexplored how to discover the optimal subnetworks with biased training datasets in practice. To address this, here we first present our theoretical insight that alerts potential limitations of existing algorithms in exploring unbiased subnetworks in the presence of strong spurious correlations. We then further elucidate the importance of bias-conflicting samples on structure learning. Motivated by these observations, we propose a Debiased Contrastive Weight Pruning (DCWP) algorithm, which probes unbiased subnetworks without expensive group annotations. Experimental results demonstrate that our approach significantly outperforms state-of-the-art debiasing methods despite its considerable reduction in the number of parameters.

STAR Loss: Reducing Semantic Ambiguity in Facial Landmark Detection
Zhou, ZhenglinandLi, HuaxiaandLiu, HongandWang, NanyangandYu, GangandJi, Rongrong



研究问题:深度学习在面部关键点检测中取得了显著进步,但语义歧义问题降低了检测性能。
动机:语义歧义导致标注不一致,影响模型的收敛,使预测准确性和稳定性下降。
方法:提出一种自适应歧义消减(STAR)损失函数,利用语义歧义的特性,通过设计衡量预测分布各向异性的损失函数来解决这个问题。
效果:实验证明,STAR损失函数在三个基准测试中优于现有方法,且计算开销可忽略不计。

Recently, deep learning-based facial landmark detection has achieved significant improvement. However, the semantic ambiguity problem degrades detection performance. Specifically, the semantic ambiguity causes inconsistent annotation and negatively affects the model's convergence, leading to worse accuracy and instability prediction. To solve this problem, we propose a Self-adapTive Ambiguity Reduction (STAR) loss by exploiting the properties of semantic ambiguity. We find that semantic ambiguity results in the anisotropic predicted distribution, which inspires us to use predicted distribution to represent semantic ambiguity. Based on this, we design the STAR loss that measures the anisotropism of the predicted distribution. Compared with the standard regression loss, STAR loss is encouraged to be small when the predicted distribution is anisotropic and thus adaptively mitigates the impact of semantic ambiguity. Moreover, we propose two kinds of eigenvalue restriction methods that could avoid both distribution's abnormal change and the model's premature convergence. Finally, the comprehensive experiments demonstrate that STAR loss outperforms the state-of-the-art methods on three benchmarks, i.e., COFW, 300W, and WFLW, with negligible computation overhead. Code is at https://github.com/ZhenglinZhou/STAR

A Meta-Learning Approach to Predicting Performance and Data Requirements
Jain, AchinandSwaminathan, GurumurthyandFavaro, PaoloandYang, HaoandRavichandran, AvinashandHarutyunyan, HrayrandAchille, AlessandroandDabeer, OnkarandSchiele, BerntandSwaminathan, AshwinandSoatto, Stefano



研究问题:如何准确估计模型达到目标性能所需的样本数量。
动机:目前的模型性能评估原则——幂定律,在小数据集(如每个类别5个样本)外推时会导致大误差。
方法:提出一种新的分段幂定律(PPL)来处理两种数据情况,并使用元学习训练的随机森林回归器来估计PPL的参数,该回归器可以适用于各种分类/检测任务和不同的网络架构及初始化方式。
效果:在16个分类数据集和10个检测数据集中,PPL相比幂定律平均提高了37%和33%的性能估计精度,同时通过提供置信区间和使用它来限制预测范围,将分类和检测数据集的数据过度估计降低了76%和91%。

We propose an approach to estimate the number of samples required for a model to reach a target performance. We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset (e.g., 5 samples per class) for extrapolation. This is because the log-performance error against the log-dataset size follows a nonlinear progression in the few-shot regime followed by a linear progression in the high-shot regime. We introduce a novel piecewise power law (PPL) that handles the two data regimes differently. To estimate the parameters of the PPL, we introduce a random forest regressor trained via meta learning that generalizes across classification/detection tasks, ResNet/ViT based architectures, and random/pre-trained initializations. The PPL improves the performance estimation on average by 37% across 16 classification datasets and 33% across 10 detection datasets, compared to the power law. We further extend the PPL to provide a confidence bound and use it to limit the prediction horizon that reduces over-estimation of data by 76% on classification and 91% on detection datasets.

Semi-Supervised Domain Adaptation With Source Label Adaptation
Yu, Yu-ChuandLin, Hsuan-Tien



研究问题:现有的半监督领域适应方法(SSDA)通常通过特征空间映射和伪标签分配将目标数据与标记的源数据对齐,但这种方法可能会将目标数据错误地对齐到错误的类别上,降低分类性能。
动机:本文提出了一种新的源适应范式,将源数据视为理想目标数据的有噪声版本,并设计了一个从目标角度出发的鲁棒清理组件来动态清理标签噪声。
方法:通过在源数据上应用这个清理组件,使源数据更好地匹配目标数据。由于这种范式与现有SSDA方法的核心思想非常不同,因此可以很容易地将其与现有方法结合以提高其性能。
效果:在两个最先进的SSDA方法上进行实证测试,结果表明,该方法能有效清理源标签中的噪声,并在基准数据集上表现出优于这些方法的性能。

Semi-Supervised Domain Adaptation (SSDA) involves learning to classify unseen target data with a few labeled and lots of unlabeled target data, along with many labeled source data from a related domain. Current SSDA approaches usually aim at aligning the target data to the labeled source data with feature space mapping and pseudo-label assignments. Nevertheless, such a source-oriented model can sometimes align the target data to source data of the wrong classes, degrading the classification performance. This paper presents a novel source-adaptive paradigm that adapts the source data to match the target data. Our key idea is to view the source data as a noisily-labeled version of the ideal target data. Then, we propose an SSDA model that cleans up the label noise dynamically with the help of a robust cleaner component designed from the target perspective. Since the paradigm is very different from the core ideas behind existing SSDA approaches, our proposed model can be easily coupled with them to improve their performance. Empirical results on two state-of-the-art SSDA approaches demonstrate that the proposed model effectively cleans up the noise within the source labels and exhibits superior performance over those approaches across benchmark datasets. Our code is available at https://github.com/chu0802/SLA.

Conflict-Based Cross-View Consistency for Semi-Supervised Semantic Segmentation
Wang, ZichengandZhao, ZhenandXing, XiaoxiaandXu, DongandKong, XiangyuandZhou, Luping



研究问题:半监督语义分割(SSS)需要大量全标注的训练数据,但现有方法常受伪标签过程的确认偏差影响。
动机:为了减少对大量全标注训练数据的依赖,并解决伪标签过程中的确认偏差问题,本文提出了一种基于共同训练框架的冲突基于跨视图一致性(CCVC)方法。
方法:首先,提出一种新的跨视图一致性(CVC)策略,通过引入特征差异损失,鼓励两个子网络从同一输入中学习不同的特征,同时期望这些不同特征生成一致的输入预测分数。其次,进一步提出一种冲突基于伪标签(CPL)方法,确保模型从冲突的预测中学习更多有用的信息,从而实现稳定的训练过程。
效果:在SSS基准数据集上验证了新的CCVC方法,该方法实现了新的最先进的性能。

Semi-supervised semantic segmentation (SSS) has recently gained increasing research interest as it can reduce the requirement for large-scale fully-annotated training data. The current methods often suffer from the confirmation bias from the pseudo-labelling process, which can be alleviated by the co-training framework. The current co-training-based SSS methods rely on hand-crafted perturbations to prevent the different sub-nets from collapsing into each other, but these artificial perturbations cannot lead to the optimal solution. In this work, we propose a new conflict-based cross-view consistency (CCVC) method based on a two-branch co-training framework which aims at enforcing the two sub-nets to learn informative features from irrelevant views. In particular, we first propose a new cross-view consistency (CVC) strategy that encourages the two sub-nets to learn distinct features from the same input by introducing a feature discrepancy loss, while these distinct features are expected to generate consistent prediction scores of the input. The CVC strategy helps to prevent the two sub-nets from stepping into the collapse. In addition, we further propose a conflict-based pseudo-labelling (CPL) method to guarantee the model will learn more useful information from conflicting predictions, which will lead to a stable training process. We validate our new CCVC approach on the SSS benchmark datasets where our method achieves new state-of-the-art performance. Our code is available at https://github.com/xiaoyao3302/CCVC.

Boosting Transductive Few-Shot Fine-Tuning With Margin-Based Uncertainty Weighting and Probability Regularization
Tao, RanandChen, HaoandSavvides, Marios



研究问题:如何利用少量数据进行高效的模型微调,特别是在处理类别不平衡和分布外数据时。
动机:近年来,Few-Shot Learning(FSL)快速发展,可能消除了对大量数据获取的需求。我们发现FSL方法在学习过程中存在类别边际分布不平衡的问题,这进一步激发我们提出新的解决方法。
方法:我们提出了一种基于边缘不确定性权重和概率正则化的转导微调方法(TF-MP)。该方法首先根据边缘不确定性分数对测试数据进行样本加权,然后对每个测试样本的分类概率进行正则化。
效果:在Meta数据集的内/外分布评估中,TF-MP实现了最先进的性能,并且比之前的转导方法大幅提高了性能。

Few-Shot Learning (FSL) has been rapidly developed in recent years, potentially eliminating the requirement for significant data acquisition. Few-shot fine-tuning has been demonstrated to be practically efficient and helpful, especially for out-of-distribution datum. In this work, we first observe that the few-shot fine-tuned methods are learned with the imbalanced class marginal distribution. This observation further motivates us to propose the Transductive Fine-tuning with Margin-based uncertainty weighting and Probability regularization (TF-MP), which learns a more balanced class marginal distribution. We first conduct sample weighting on the testing data with margin-based uncertainty scores and further regularize each test sample's categorical probability. TF-MP achieves state-of-the-art performance on in- / out-of-distribution evaluations of Meta-Dataset and surpasses previous transductive methods by a large margin.

PRISE: Demystifying Deep Lucas-Kanade With Strongly Star-Convex Constraints for Multimodel Image Alignment
Zhang, YiqingandHuang, XinmingandZhang, Ziming



研究问题:针对图像对齐中经典的Lucas-Kanade (LK) 方法在面对大变形图像对时易陷入局部最优的问题。
动机:提出一种新的Deep Star-Convexified Lucas-Kanade (PRISE) 方法,通过将强星形凸约束引入优化问题,解决多模型图像对齐问题。
方法:利用神经网络近似学习真实值周围的星形凸损失景观,以促进LK方法通过由网络定义的高维空间向真实值的收敛。同时,为了训练,将由于强星形凸性定义产生的对比(铰链)损失附加到原始损失上。
效果:在MSCOCO、GoogleEarth和GoogleMap等基准数据集上进行评估,实验结果展示出PRISE方法具有优秀的性能,尤其在小像素误差方面表现突出。

The Lucas-Kanade (LK) method is a classic iterative homography estimation algorithm for image alignment, but often suffers from poor local optimality especially when image pairs have large distortions. To address this challenge, in this paper we propose a novel Deep Star-Convexified Lucas-Kanade (PRISE) method for multimodel image alignment by introducing strongly star-convex constraints into the optimization problem. Our basic idea is to enforce the neural network to approximately learn a star-convex loss landscape around the ground truth give any data to facilitate the convergence of the LK method to the ground truth through the high dimensional space defined by the network. This leads to a minimax learning problem, with contrastive (hinge) losses due to the definition of strong star-convexity that are appended to the original loss for training. We also provide an efficient sampling based algorithm to leverage the training cost, as well as some analysis on the quality of the solutions from PRISE. We further evaluate our approach on benchmark datasets such as MSCOCO, GoogleEarth, and GoogleMap, and demonstrate state-of-the-art results, especially for small pixel errors. Demo code is attached.

BiasAdv: Bias-Adversarial Augmentation for Model Debiasing
Lim, JonginandKim, YoungdongandKim, ByungjaiandAhn, ChanhoandShin, JinwooandYang, EunhoandHan, Seungju



研究问题:神经网络容易受到数据集中固有的虚假相关性的影响,无法进行无偏的泛化。
动机:解决此问题的一个关键挑战是缺乏与偏见冲突的训练数据(即没有虚假相关性的样本)。
方法:本文提出了一种名为“偏见对抗性增强”(BiasAdv)的新型数据增强方法,通过生成对抗性图像来补充与偏见冲突的样本。
效果:实验结果表明,BiasAdv可以生成令人惊讶的有用的合成偏见冲突样本,使去偏模型能够学习可泛化的表示。此外,BiasAdv不需要任何偏见注释或对偏见类型的先验知识,这使其能够广泛应用于现有的去偏方法以改善其性能。在四个流行的基准数据集上进行的广泛实验表明,BiasAdv具有优越的性能,在各种偏见领域中实现了最先进的性能。

Neural networks are often prone to bias toward spurious correlations inherent in a dataset, thus failing to generalize unbiased test criteria. A key challenge to resolving the issue is the significant lack of bias-conflicting training data (i.e., samples without spurious correlations). In this paper, we propose a novel data augmentation approach termed Bias-Adversarial augmentation (BiasAdv) that supplements bias-conflicting samples with adversarial images. Our key idea is that an adversarial attack on a biased model that makes decisions based on spurious correlations may generate synthetic bias-conflicting samples, which can then be used as augmented training data for learning a debiased model. Specifically, we formulate an optimization problem for generating adversarial images that attack the predictions of an auxiliary biased model without ruining the predictions of the desired debiased model. Despite its simplicity, we find that BiasAdv can generate surprisingly useful synthetic bias-conflicting samples, allowing the debiased model to learn generalizable representations. Furthermore, BiasAdv does not require any bias annotations or prior knowledge of the bias type, which enables its broad applicability to existing debiasing methods to improve their performances. Our extensive experimental results demonstrate the superiority of BiasAdv, achieving state-of-the-art performance on four popular benchmark datasets across various bias domains.

Learning To Retain While Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation
Patel, GauravandMopuri, KondaReddyandQiu, Qiang



研究问题:如何在没有训练数据的情况下,通过教师神经网络向学生神经网络进行知识转移。
动机:在对抗性DFKD框架中,由于多个生成器更新的伪样本分布的非平稳性,学生网络的准确性会受到影响。
方法:提出一种元学习启发式框架,将知识获取(从新生成的样本中学习)和知识保留(保留以前遇到的例子的知识)作为元训练和元测试,分别进行处理。
效果:通过在多个数据集上与先前的艺术进行广泛的评估和比较,证明了该方法的有效性。

Data-free Knowledge Distillation (DFKD) has gained popularity recently, with the fundamental idea of carrying out knowledge transfer from a Teacher neural network to a Student neural network in the absence of training data. However, in the Adversarial DFKD framework, the student network's accuracy, suffers due to the non-stationary distribution of the pseudo-samples under multiple generator updates. To this end, at every generator update, we aim to maintain the student's performance on previously encountered examples while acquiring knowledge from samples of the current distribution. Thus, we propose a meta-learning inspired framework by treating the task of Knowledge-Acquisition (learning from newly generated samples) and Knowledge-Retention (retaining knowledge on previously met samples) as meta-train and meta-test, respectively. Hence, we dub our method as Learning to Retain while Acquiring. Moreover, we identify an implicit aligning factor between the Knowledge-Retention and Knowledge-Acquisition tasks indicating that the proposed student update strategy enforces a common gradient direction for both tasks, alleviating interference between the two objectives. Finally, we support our hypothesis by exhibiting extensive evaluation and comparison of our method with prior arts on multiple datasets.

Why Is the Winner the Best?
Eisenmann, MatthiasandReinke, AnnikaandWeru, ViviennandTizabi, MinuD.andIsensee, FabianandAdler, TimJ.andAli, SharibandAndrearczyk, VincentandAubreville, MarcandBaid, UjjwalandBakas, SpyridonandBalu, NiranjanandBano, SophiaandBernal, JorgeandBodenstedt, SebastianandCasella, AlessandroandCheplygina, VeronikaandDaum, MarieanddeBruijne, MarleenandDepeursinge, AdrienandDorent, ReubenandEgger, JanandEllis, DavidG.andEngelhardt, SandyandGanz, MelanieandGhatwary, NohaandGirard, GabrielandGodau, PatrickandGupta, AnubhaandHansen, LasseandHarada, KanakoandHeinrich, MattiasP.andHeller, NicholasandHering, AlessaandHuaulm\'e, ArnaudandJannin, PierreandKavur, AliEmreandKodym, Old\v{r



研究问题:国际基准竞赛在图像分析方法的比较性能评估中起着基本作用,但关于这些竞赛能带来什么学习成果的研究却鲜有关注。
动机:本文旨在填补这一研究空白,通过对所有在IEEE ISBI 2021和MICCAI 2021上进行的80个竞赛进行多中心研究,探讨获胜解决方案优于竞争方法的原因。
方法:通过对提交的算法及其排名的全面描述进行统计分析,揭示获胜解决方案的共同特征。
效果:研究发现,获胜方案通常包括使用多任务学习和/或多阶段管道(各占63%和61%),并注重增强(100%)、图像预处理(97%)、数据策划(79%)和后处理(66%)。此外,团队领导通常是具有博士学位、五年生物医学图像分析经验和四年深度学习经验的计算机科学家。对于排名较高的团队,两个核心通用开发策略是:将指标反映在方法设计和专注于分析和处理失败案例。根据组织者的说法,43%的获胜算法超过了现有技术水平,但只有11%完全解决了各自的领域问题。本研究的见解可以帮助研究人员(1)改进解决新问题的算法开发策略;(2)关注这项工作揭示的开放研究问题。

International benchmarking competitions have become fundamental for the comparative performance assessment of image analysis methods. However, little attention has been given to investigating what can be learnt from these competitions. Do they really generate scientific progress? What are common and successful participation strategies? What makes a solution superior to a competing method? To address this gap in the literature, we performed a multi-center study with all 80 competitions that were conducted in the scope of IEEE ISBI 2021 and MICCAI 2021. Statistical analyses performed based on comprehensive descriptions of the submitted algorithms linked to their rank as well as the underlying participation strategies revealed common characteristics of winning solutions. These typically include the use of multi-task learning (63%) and/or multi-stage pipelines (61%), and a focus on augmentation (100%), image preprocessing (97%), data curation (79%), and postprocessing (66%). The "typical" lead of a winning team is a computer scientist with a doctoral degree, five years of experience in biomedical image analysis, and four years of experience in deep learning. Two core general development strategies stood out for highly-ranked teams: the reflection of the metrics in the method design and the focus on analyzing and handling failure cases. According to the organizers, 43% of the winning algorithms exceeded the state of the art but only 11% completely solved the respective domain problem. The insights of our study could help researchers (1) improve algorithm development strategies when approaching new problems, and (2) focus on open research questions revealed by this work.

Good Is Bad: Causality Inspired Cloth-Debiasing for Cloth-Changing Person Re-Identification
Yang, ZhengweiandLin, MengandZhong, XianandWu, YuandWang, Zheng



研究问题:如何在传统的人员再识别(ReID)中消除服装对身份的负面影响,以实现鲁棒的换衣人员再识别(CC-ReID)。
动机:由于缺乏理论和难以准确分离影响,消除服装对身份的负面影响仍然具有挑战性。
方法:提出了一种基于因果关系的自动干预模型(AIM),通过分析服装对模型推理的影响并采用双分支模型模拟因果干预,逐步自动消除服装偏见。
效果:在两个标准的CC-ReID数据集上进行的大量实验表明,所提出的AIM优于其他最先进的方法。

Entangled representation of clothing and identity (ID)-intrinsic clues are potentially concomitant in conventional person Re-IDentification (ReID). Nevertheless, eliminating the negative impact of clothing on ID remains challenging due to the lack of theory and the difficulty of isolating the exact implications. In this paper, a causality-based Auto-Intervention Model, referred to as AIM, is first proposed to mitigate clothing bias for robust cloth-changing person ReID (CC-ReID). Specifically, we analyze the effect of clothing on the model inference and adopt a dual-branch model to simulate causal intervention. Progressively, clothing bias is eliminated automatically with model training. AIM is encouraged to learn more discriminative ID clues that are free from clothing bias. Extensive experiments on two standard CC-ReID datasets demonstrate the superiority of the proposed AIM over other state-of-the-art methods.

Use Your Head: Improving Long-Tail Video Recognition
Perrett, TobyandSinha, SaptarshiandBurghardt, TiloandMirmehdi, MajidandDamen, Dima



研究问题:当前视频基准测试在多个长尾属性上存在不足,特别是在尾部缺乏少数镜头类别。
动机:为了解决这一问题,我们提出了新的视频基准测试,并开发了一种名为“长尾混合重建”(LMR)的方法。
方法:通过从SSv2和VideoLT两个数据集中采样子集来更好地评估长尾识别。然后,我们提出一种方法,即通过将少数镜头类别的样本重建为头部类别样本的加权组合来减少对少数镜头类别实例的过拟合。最后,LMR采用标签混合来学习稳健的决策边界。
效果:实验结果表明,LMR在EPIC-KITCHENS以及我们提出的SSv2-LT和VideoLT-LT上取得了最先进的平均类别准确率。

This paper presents an investigation into long-tail video recognition. We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties. Most critically, they lack few-shot classes in their tails. In response, we propose new video benchmarks that better assess long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT. We then propose a method, Long-Tail Mixed Reconstruction (LMR), which reduces overfitting to instances from few-shot classes by reconstructing them as weighted combinations of samples from head classes. LMR then employs label mixing to learn robust decision boundaries. It achieves state-of-the-art average class accuracy on EPIC-KITCHENS and the proposed SSv2-LT and VideoLT-LT. Benchmarks and code at: github.com/tobyperrett/lmr

Beyond mAP: Towards Better Evaluation of Instance Segmentation
Jena, RohitandZhornyak, LukasandDoiphode, NehalandChaudhari, PratikandBuch, VivekandGee, JamesandShi, Jianbo



研究问题:目前实例分割的准确性评估方法存在缺陷,无法有效区分正确定位但分类错误的实例。
动机:为了解决这一问题,我们提出了一种新的度量方法,并设计了一种基于像素占用匹配方案的模块来消除这些重复预测。
方法:我们回顾了文献中的替代度量方法,并提出了两种新的方法来明确测量空间和类别重复预测的数量。我们还提出了一种语义排序和NMS模块来消除这些重复预测。
效果:实验表明,现代分割网络在AP上取得了显著的增益,但也包含了相当数量的重复预测。我们的语义排序和NMS可以作为一个即插即用的模块来减轻过度预测并保持AP。

Correctness of instance segmentation constitutes counting the number of objects, correctly localizing all predictions and classifying each localized prediction. Average Precision is the de-facto metric used to measure all these constituents of segmentation. However, this metric does not penalize duplicate predictions in the high-recall range, and cannot distinguish instances that are localized correctly but categorized incorrectly. This weakness has inadvertently led to network designs that achieve significant gains in AP but also introduce a large number of false positives. We therefore cannot rely on AP to choose a model that provides an optimal tradeoff between false positives and high recall. To resolve this dilemma, we review alternative metrics in the literature and propose two new measures to explicitly measure the amount of both spatial and categorical duplicate predictions. We also propose a Semantic Sorting and NMS module to remove these duplicates based on a pixel occupancy matching scheme. Experiments show that modern segmentation networks have significant gains in AP, but also contain a considerable amount of duplicates. Our Semantic Sorting and NMS can be added as a plug-and-play module to mitigate hedged predictions and preserve AP.

Diversity-Measurable Anomaly Detection
Liu, WenruiandChang, HongandMa, BingpengandShan, ShiguangandChen, Xilin



研究问题:本文旨在解决重建型异常检测模型在处理正常模式多样性不足和异常信息传递的问题。
动机:目前的重建型异常检测模型在处理正常模式多样性和异常信息传递上存在问题,影响了其效果。
方法:本文提出了一种可测量多样性的异常检测(DMAD)框架,设计了金字塔形变模块(PDM),通过估计从重建参考到原始输入的多尺度形变场来建模多样化的正常模式并衡量异常的严重性。
效果:实验结果表明,该方法在监控视频和工业图像上都表现出良好的效果,并且在面对被污染的数据和类似异常的正常样本时也能同样有效工作。

Reconstruction-based anomaly detection models achieve their purpose by suppressing the generalization ability for anomaly. However, diverse normal patterns are consequently not well reconstructed as well. Although some efforts have been made to alleviate this problem by modeling sample diversity, they suffer from shortcut learning due to undesired transmission of abnormal information. In this paper, to better solve the tradeoff problem, we propose Diversity-Measurable Anomaly Detection (DMAD) framework to enhance reconstruction diversity while avoid the undesired generalization on anomalies. To this end, we design Pyramid Deformation Module (PDM), which models diverse normals and measures the severity of anomaly by estimating multi-scale deformation fields from reconstructed reference to original input. Integrated with an information compression module, PDM essentially decouples deformation from prototypical embedding and makes the final anomaly score more reliable. Experimental results on both surveillance videos and industrial images demonstrate the effectiveness of our method. In addition, DMAD works equally well in front of contaminated data and anomaly-like normal samples.

FreeNeRF: Improving Few-Shot Neural Rendering With Free Frequency Regularization
Yang, JiaweiandPavone, MarcoandWang, Yue



研究问题:如何利用稀疏输入进行新颖视图合成是神经辐射场(NeRF)面临的挑战。
动机:最近的一些努力通过引入外部监督,如预训练模型和额外的深度信号,或使用非平凡的基于补丁的渲染来缓解这个问题。
方法:我们提出了频率规整NeRF(FreeNeRF),这是一个令人惊讶的简单基线,通过最小限度地修改普通的NeRF就能超越之前的方法。我们分析了少量神经渲染的关键挑战,发现频率在NeRF的训练中起着重要的作用。基于这个分析,我们提出了两种正则化项:一种用于规范NeRF输入的频率范围,另一种用于惩罚近摄像机密度场。这两种技术都是“免费午餐”,不需要额外的计算成本。
效果:即使只改变一行代码,原始的NeRF也能在少量设置中实现与其他复杂方法相当的性能。FreeNeRF在包括Blender、DTU和LLFF在内的各种数据集上实现了最先进的性能。我们希望这个简单的基线能重新思考频率在NeRF训练中的基本作用,无论是在低数据环境下还是超出其范围之外。

Novel view synthesis with sparse inputs is a challenging problem for neural radiance fields (NeRF). Recent efforts alleviate this challenge by introducing external supervision, such as pre-trained models and extra depth signals, or by using non-trivial patch-based rendering. In this paper, we present Frequency regularized NeRF (FreeNeRF), a surprisingly simple baseline that outperforms previous methods with minimal modifications to plain NeRF. We analyze the key challenges in few-shot neural rendering and find that frequency plays an important role in NeRF's training. Based on this analysis, we propose two regularization terms: one to regularize the frequency range of NeRF's inputs, and the other to penalize the near-camera density fields. Both techniques are "free lunches" that come at no additional computational cost. We demonstrate that even with just one line of code change, the original NeRF can achieve similar performance to other complicated methods in the few-shot setting. FreeNeRF achieves state-of-the-art performance across diverse datasets, including Blender, DTU, and LLFF. We hope that this simple baseline will motivate a rethinking of the fundamental role of frequency in NeRF's training, under both the low-data regime and beyond. This project is released at https://jiawei-yang.github.io/FreeNeRF/.

VNE: An Effective Method for Improving Deep Representation by Manipulating Eigenvalue Distribution
Kim, JaeillandKang, SuhyunandHwang, DuhunandShin, JungwookandRhee, Wonjong



研究问题:如何改善深度学习中表示的质量,如去相关、白化、解纠缠、排名、各向同性和互信息等。
动机:现有的表示属性操作在实施效果和通用性方面具有挑战性。
方法:提出对表示的冯·诺依曼熵进行正则化(VNE)。首先,证明了VNE在有效操纵表示自相关矩阵的特征值方面的优越性。然后,通过研究领域泛化、元学习、自我监督学习和生成模型,证明其在改进最先进的算法或流行的基准算法方面的广泛应用性。此外,还与表示的排名、解纠缠和各向同性建立了理论联系。最后,讨论了VNE的维度控制和与香农熵的关系。
效果:实验结果表明,VNE在各种任务上都取得了显著的改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Since the introduction of deep learning, a wide scope of representation properties, such as decorrelation, whitening, disentanglement, rank, isotropy, and mutual information, have been studied to improve the quality of representation. However, manipulating such properties can be challenging in terms of implementational effectiveness and general applicability. To address these limitations, we propose to regularize von Neumann entropy (VNE) of representation. First, we demonstrate that the mathematical formulation of VNE is superior in effectively manipulating the eigenvalues of the representation autocorrelation matrix. Then, we demonstrate that it is widely applicable in improving state-of-the-art algorithms or popular benchmark algorithms by investigating domain-generalization, meta-learning, self-supervised learning, and generative models. In addition, we formally establish theoretical connections with rank, disentanglement, and isotropy of representation. Finally, we provide discussions on the dimension control of VNE and the relationship with Shannon entropy. Code is available at: https://github.com/jaeill/CVPR23-VNE.

Divide and Adapt: Active Domain Adaptation via Customized Learning
Huang, DuojunandLi, JichangandChen, WeikaiandHuang, JunshiandChai, ZhenhuaandLi, Guanbin



研究问题:本文旨在解决现有的预训练语言模型对结构化知识的利用不足,以及在领域适应中如何识别真正有价值的样本的问题。
动机:通过结合主动学习和领域适应技术,提高模型的适应性能。
方法:提出了一种名为Divide-and-Adapt(DiaNA)的新ADA框架,该框架将目标实例分为四类,并采用基于不确定性和领域性的新数据划分协议,以准确识别最有益的样本。
效果:实验结果表明,DiaNA可以处理具有大范围域差距的数据,并能推广到不同的领域适应设置,如无监督领域适应(UDA)、半监督领域适应(SSDA)、源自由领域适应(SFDA)等。

Active domain adaptation (ADA) aims to improve the model adaptation performance by incorporating the active learning (AL) techniques to label a maximally-informative subset of target samples. Conventional AL methods do not consider the existence of domain shift, and hence, fail to identify the truly valuable samples in the context of domain adaptation. To accommodate active learning and domain adaption, the two naturally different tasks, in a collaborative framework, we advocate that a customized learning strategy for the target data is the key to the success of ADA solutions. We present Divide-and-Adapt (DiaNA), a new ADA framework that partitions the target instances into four categories with stratified transferable properties. With a novel data subdivision protocol based on uncertainty and domainness, DiaNA can accurately recognize the most gainful samples. While sending the informative instances for annotation, DiaNA employs tailored learning strategies for the remaining categories. Furthermore, we propose an informativeness score that unifies the data partitioning criteria. This enables the use of a Gaussian mixture model (GMM) to automatically sample unlabeled data into the proposed four categories. Thanks to the "divide-and-adapt" spirit, DiaNA can handle data with large variations of domain gap. In addition, we show that DiaNA can generalize to different domain adaptation settings, such as unsupervised domain adaptation (UDA), semi-supervised domain adaptation (SSDA), source-free domain adaptation (SFDA), etc.

Towards Bridging the Performance Gaps of Joint Energy-Based Models
Yang, XiulongandSu, QingandJi, Shihao



研究问题:我们能否用一个单一的网络训练出混合判别式-生成式模型?
动机:尽管最近取得了进展,但联合能量模型(JEM)在分类准确性和图像生成质量上仍存在两个性能差距。
方法:我们引入了多种训练技术来填补JEM的准确性差距和生成质量差距。包括1) 将最近提出的锐度感知最小化(SAM)框架融入JEM的训练中,以提升JEM的能量景观平滑性和泛化能力;2) 从JEM的最大似然估计管道中排除数据增强,并减轻数据增强对图像生成质量的负面影响。
效果:我们在多个数据集上的大量实验表明,我们的SADA-JEM实现了最先进的性能,并在图像分类、图像生成、校准、分布外检测和对抗鲁棒性方面以显著的优势超越了JEM。

Can we train a hybrid discriminative-generative model with a single network? This question has recently been answered in the affirmative, introducing the field of Joint Energy-based Model (JEM), which achieves high classification accuracy and image generation quality simultaneously. Despite recent advances, there remain two performance gaps: the accuracy gap to the standard softmax classifier, and the generation quality gap to state-of-the-art generative models. In this paper, we introduce a variety of training techniques to bridge the accuracy gap and the generation quality gap of JEM. 1) We incorporate a recently proposed sharpness-aware minimization (SAM) framework to train JEM, which promotes the energy landscape smoothness and the generalization of JEM. 2) We exclude data augmentation from the maximum likelihood estimate pipeline of JEM, and mitigate the negative impact of data augmentation to image generation quality. Extensive experiments on multiple datasets demonstrate our SADA-JEM achieves state-of-the-art performances and outperforms JEM in image classification, image generation, calibration, out-of-distribution detection and adversarial robustness by a notable margin. Our code is available at https://github.com/sndnyang/SADAJEM.

Both Style and Distortion Matter: Dual-Path Unsupervised Domain Adaptation for Panoramic Semantic Segmentation
Zheng, XuandZhu, JinjingandLiu, YexinandCao, ZidongandFu, ChongandWang, Lin



研究问题:全景图像语义分割的性能受到等距投影(ERP)扭曲和像素级注释缺乏的影响。
动机:现有的方法将ERP和针孔图像视为同等,并通过无监督领域适应(UDA)从针孔图像向ERP图像转移知识,但这种方法无法处理由1)相机传感器和捕获场景的内在差异;2)不同的图像格式(如ERP和针孔图像)引起的域差距。
方法:本文提出了一种新的灵活的双路径UDA框架DPPASS,将ERP和切线投影(TP)图像作为输入。为了减小域差距,我们提出了跨投影和 intra-projection 训练。跨投影训练包括tangent-wise特征对比训练和预测一致性训练。
效果:实验结果表明,我们的DPPASS在两个基准测试上比最先进的方法提高了+1.06%的mIoU。

The ability of scene understanding has sparked active research for panoramic image semantic segmentation. However, the performance is hampered by distortion of the equirectangular projection (ERP) and a lack of pixel-wise annotations. For this reason, some works treat the ERP and pinhole images equally and transfer knowledge from the pinhole to ERP images via unsupervised domain adaptation (UDA). However, they fail to handle the domain gaps caused by: 1) the inherent differences between camera sensors and captured scenes; 2) the distinct image formats (e.g., ERP and pinhole images). In this paper, we propose a novel yet flexible dual-path UDA framework, DPPASS, taking ERP and tangent projection (TP) images as inputs. To reduce the domain gaps, we propose cross-projection and intra-projection training. The cross-projection training includes tangent-wise feature contrastive training and prediction consistency training. That is, the former formulates the features with the same projection locations as positive examples and vice versa, for the models' awareness of distortion, while the latter ensures the consistency of cross-model predictions between the ERP and TP. Moreover, adversarial intra-projection training is proposed to reduce the inherent gap, between the features of the pinhole images and those of the ERP and TP images, respectively. Importantly, the TP path can be freely removed after training, leading to no additional inference cost. Extensive experiments on two benchmarks show that our DPPASS achieves +1.06% mIoU increment than the state-of-the-art approaches.

Learning Debiased Representations via Conditional Attribute Interpolation
Zhang, Yi-KaiandWang, Qi-WeiandZhan, De-ChuanandYe, Han-Jia



研究问题:如何改善深度神经网络在有偏数据集上的表现,防止其过度依赖与目标标签无关的属性进行预测。
动机:当数据集存在偏差,即大部分样本的属性与目标标签存在虚假相关性时,深度神经网络容易通过“非预期”的属性进行预测,特别是当这些属性更容易学习时。
方法:提出一种chi^2模型来学习无偏表示。首先设计一个chi形状模式以匹配深度神经网络的训练动态,并找到中间属性样本(IASs)——即接近属性决策边界的样本,它们表明一个属性的值从一个极端到另一个极端的变化。然后使用chi结构的度量学习目标对表示进行修正。条件插值在IASs之间消除了边缘属性的负面影响,有助于保持类内的紧凑性。
效果:实验表明,chi^2模型有效地学习了无偏表示,并在各种数据集上取得了显著的改进。

An image is usually described by more than one attribute like "shape" and "color". When a dataset is biased, i.e., most samples have attributes spuriously correlated with the target label, a Deep Neural Network (DNN) is prone to make predictions by the "unintended" attribute, especially if it is easier to learn. To improve the generalization ability when training on such a biased dataset, we propose a chi^2-model to learn debiased representations. First, we design a chi-shape pattern to match the training dynamics of a DNN and find Intermediate Attribute Samples (IASs) --- samples near the attribute decision boundaries, which indicate how the value of an attribute changes from one extreme to another. Then we rectify the representation with a chi-structured metric learning objective. Conditional interpolation among IASs eliminates the negative effect of peripheral attributes and facilitates retaining the intra-class compactness. Experiments show that chi^2-model learns debiased representation effectively and achieves remarkable improvements on various datasets.

Modeling Inter-Class and Intra-Class Constraints in Novel Class Discovery
Li, WenbinandFan, ZhichenandHuo, JingandGao, Yang



研究问题:现有的类发现方法没有充分利用类发现设置的本质。
动机:提出一种新的模型,通过在未标记的数据集中发现新的类别(簇),将一个类别不相交的标记数据集的公共知识转移到另一个未标记的数据集中。
方法:基于对称Kullback-Leibler散度(sKLD)在类发现中对类间和类内约束进行建模。提出了一种类间sKLD约束,以有效利用标记和未标记类别之间的不相交关系,并在嵌入空间中强制不同类别的可分离性。此外,还提出了一种类内sKLD约束,以明确约束样本与其增强之间的关系,同时确保训练过程的稳定性。
效果:在流行的CIFAR10、CIFAR100和ImageNet基准上进行了广泛的实验,成功地证明了该方法可以建立新的最先进的状态,并可以实现显著的性能改进。例如,在CIFAR100-50数据集分割下的任务感知/无关评估协议中,比之前最先进的方法提高了3.5%/3.7%的聚类准确率。代码可在https://github.com/FanZhichen/NCD-IIC获取。

Novel class discovery (NCD) aims at learning a model that transfers the common knowledge from a class-disjoint labelled dataset to another unlabelled dataset and discovers new classes (clusters) within it. Many methods, as well as elaborate training pipelines and appropriate objectives, have been proposed and considerably boosted performance on NCD tasks. Despite all this, we find that the existing methods do not sufficiently take advantage of the essence of the NCD setting. To this end, in this paper, we propose to model both inter-class and intra-class constraints in NCD based on the symmetric Kullback-Leibler divergence (sKLD). Specifically, we propose an inter-class sKLD constraint to effectively exploit the disjoint relationship between labelled and unlabelled classes, enforcing the separability for different classes in the embedding space. In addition, we present an intra-class sKLD constraint to explicitly constrain the intra-relationship between a sample and its augmentations and ensure the stability of the training process at the same time. We conduct extensive experiments on the popular CIFAR10, CIFAR100 and ImageNet benchmarks and successfully demonstrate that our method can establish a new state of the art and can achieve significant performance improvements, e.g., 3.5%/3.7% clustering accuracy improvements on CIFAR100-50 dataset split under the task-aware/-agnostic evaluation protocol, over previous state-of-the-art methods. Code is available at https://github.com/FanZhichen/NCD-IIC.

Bayesian Posterior Approximation With Stochastic Ensembles
Balabanov, OleksandrandMehlig, BernhardandLinander, Hampus



研究问题:如何利用随机神经网络近似贝叶斯后验,并结合随机方法如dropout和深度集成。
动机:目前的贝叶斯推理方法需要大量计算资源,而随机神经网络可以更高效地进行近似。
方法:提出基于蒙特卡洛dropout、DropConnect和非参数化dropout的随机集成方法,并通过变分推断训练它们来近似贝叶斯后验。
效果:在玩具问题和CIFAR图像分类任务上,随机集成方法比其他流行基线提供了更准确的后验估计。

We introduce ensembles of stochastic neural networks to approximate the Bayesian posterior, combining stochastic methods such as dropout with deep ensembles. The stochastic ensembles are formulated as families of distributions and trained to approximate the Bayesian posterior with variational inference. We implement stochastic ensembles based on Monte Carlo dropout, DropConnect and a novel non-parametric version of dropout and evaluate them on a toy problem and CIFAR image classification. For both tasks, we test the quality of the posteriors directly against Hamiltonian Monte Carlo simulations. Our results show that stochastic ensembles provide more accurate posterior estimates than other popular baselines for Bayesian inference.

Modality-Agnostic Debiasing for Single Domain Generalization
Qu, SanqingandPan, YingweiandChen, GuangandYao, TingandJiang, ChangjunandMei, Tao



研究问题:深度神经网络在分布外(OOD)数据上通常泛化能力不佳,特别是在从单个领域到多个未见过领域的单域泛化(single-DG)的极端情况下。
动机:现有的单域泛化技术通常设计各种数据增强算法,并重塑多源领域泛化方法以学习领域通用(语义)特征。然而,这些方法通常是模态特定的,因此仅适用于一种单一模态(例如图像)。
方法:我们针对一种通用的模态无关去偏(MAD)框架进行单域泛化,该框架能够实现不同模态的泛化。技术上,MAD引入了一种新的双分支分类器:一个有偏分支鼓励分类器识别特定于领域的(表面的)特征,而一个通用分支则基于有偏分支的知识捕获领域通用的特征。
效果:我们在各种不同模态的单域泛化场景中验证了MAD的优越性,包括对一维文本、二维图像、三维点云的识别,以及对二维图像的语义分割。更值得注意的是,对于三维点云的识别和二维图像的语义分割,MAD将准确率和mIOU分别提高了2.82%和1.5%。

Deep neural networks (DNNs) usually fail to generalize well to outside of distribution (OOD) data, especially in the extreme case of single domain generalization (single-DG) that transfers DNNs from single domain to multiple unseen domains. Existing single-DG techniques commonly devise various data-augmentation algorithms, and remould the multi-source domain generalization methodology to learn domain-generalized (semantic) features. Nevertheless, these methods are typically modality-specific, thereby being only applicable to one single modality (e.g., image). In contrast, we target a versatile Modality-Agnostic Debiasing (MAD) framework for single-DG, that enables generalization for different modalities. Technically, MAD introduces a novel two-branch classifier: a biased-branch encourages the classifier to identify the domain-specific (superficial) features, and a general-branch captures domain-generalized features based on the knowledge from biased-branch. Our MAD is appealing in view that it is pluggable to most single-DG models. We validate the superiority of our MAD in a variety of single-DG scenarios with different modalities, including recognition on 1D texts, 2D images, 3D point clouds, and semantic segmentation on 2D images. More remarkably, for recognition on 3D point clouds and semantic segmentation on 2D images, MAD improves DSU by 2.82% and 1.5% in accuracy and mIOU.

Difficulty-Based Sampling for Debiased Contrastive Representation Learning
Jang, TaeukandWang, Xiaoqian



研究问题:对比学习是一种自我监督表示学习方法,在各种分类任务中取得了里程碑式的性能。然而,由于其无监督的方式,它遭受了假负样本问题的影响。
动机:假负样本问题会降低对比学习的性能,因为它与对比语义相似和不相似的对的动机相矛盾。因此,找到合法的负样本并区分真假负样本以及易难负样本的重要性引起了人们的关注。
方法:本文提出了一种去偏对比学习方法,通过参考放大偏差的对应部分来探索难负样本。我们提出了三元组损失来训练一个偏向编码器,该编码器更关注易负样本。
效果:理论上,我们证明了三元组损失会放大自我监督表示学习中的偏差。最后,我们通过实证表明,所提出的方法提高了下游分类性能。

Contrastive learning is a self-supervised representation learning method that achieves milestone performance in various classification tasks. However, due to its unsupervised fashion, it suffers from the false negative sample problem: randomly drawn negative samples that are assumed to have a different label but actually have the same label as the anchor. This deteriorates the performance of contrastive learning as it contradicts the motivation of contrasting semantically similar and dissimilar pairs. This raised the attention and the importance of finding legitimate negative samples, which should be addressed by distinguishing between 1) true vs. false negatives; 2) easy vs. hard negatives. However, previous works were limited to the statistical approach to handle false negative and hard negative samples with hyperparameters tuning. In this paper, we go beyond the statistical approach and explore the connection between hard negative samples and data bias. We introduce a novel debiased contrastive learning method to explore hard negatives by relative difficulty referencing the bias-amplifying counterpart. We propose triplet loss for training a biased encoder that focuses more on easy negative samples. We theoretically show that the triplet loss amplifies the bias in self-supervised representation learning. Finally, we empirically show the proposed method improves downstream classification performance.

Zero-Shot Model Diagnosis
Luo, JinqiandWang, ZhaoningandWu, ChenHenryandHuang, DongandDelaTorre, Fernando



研究问题:如何评估深度视觉模型对任意视觉属性的敏感性,而无需使用标注的测试集?
动机:创建平衡的测试集既耗时又昂贵,且容易出错。因此,我们需要一种不需要测试集或标签的方法来评估深度学习模型的敏感性。
方法:本文提出了一种名为Zero-shot Model Diagnosis(ZOOM)的方法,该方法不依赖测试集和标签。通过使用生成模型和CLIP,用户可以选择一个与问题相关的提示集,系统将自动搜索语义反事实图像(即在二元分类器的情况下翻转预测的合成图像)。
效果:通过在多个视觉领域中进行多种视觉任务(分类、关键点检测和分割)的评估,实验表明该方法能够生成反事实图像,并在无需测试集的情况下提供模型敏感性分析。

When it comes to deploying deep vision models, the behavior of these systems must be explicable to ensure confidence in their reliability and fairness. A common approach to evaluate deep learning models is to build a labeled test set with attributes of interest and assess how well it performs. However, creating a balanced test set (i.e., one that is uniformly sampled over all the important traits) is often time-consuming, expensive, and prone to mistakes. The question we try to address is: can we evaluate the sensitivity of deep learning models to arbitrary visual attributes without an annotated test set? This paper argues the case that Zero-shot Model Diagnosis (ZOOM) is possible without the need for a test set nor labeling. To avoid the need for test sets, our system relies on a generative model and CLIP. The key idea is enabling the user to select a set of prompts (relevant to the problem) and our system will automatically search for semantic counterfactual images (i.e., synthesized images that flip the prediction in the case of a binary classifier) using the generative model. We evaluate several visual tasks (classification, key-point detection, and segmentation) in multiple visual domains to demonstrate the viability of our methodology. Extensive experiments demonstrate that our method is capable of producing counterfactual images and offering sensitivity analysis for model diagnosis without the need for a test set.

Re-Thinking Federated Active Learning Based on Inter-Class Diversity
Kim, SangMookandBae, SangminandSong, HwanjunandYun, Se-Young



研究问题:在联邦学习中,如何有效利用大量未标注的数据?
动机:尽管联邦学习取得了显著的进步,但大多数研究都假设客户端的数据是完全标注的。然而,在现实世界中,每个客户端可能都有大量的未标注实例。
方法:提出了一种联邦主动学习框架,并探讨了全局和局部仅模型的性能优势及其原因。基于这些发现,我们提出了LoGo,这是一种能够适应不同局部异质性和全局不平衡比例的FAL采样策略,通过两步主动选择方案整合了这两种模型。
效果:实验结果表明,LoGo在38个实验设置中始终优于其他六种主动学习策略。

Although federated learning has made awe-inspiring advances, most studies have assumed that the client's data are fully labeled. However, in a real-world scenario, every client may have a significant amount of unlabeled instances. Among the various approaches to utilizing unlabeled data, a federated active learning framework has emerged as a promising solution. In the decentralized setting, there are two types of available query selector models, namely 'global' and 'local-only' models, but little literature discusses their performance dominance and its causes. In this work, we first demonstrate that the superiority of two selector models depends on the global and local inter-class diversity. Furthermore, we observe that the global and local-only models are the keys to resolving the imbalance of each side. Based on our findings, we propose LoGo, a FAL sampling strategy robust to varying local heterogeneity levels and global imbalance ratio, that integrates both models by two steps of active selection scheme. LoGo consistently outperforms six active learning strategies in the total number of 38 experimental settings.

Out-of-Distributed Semantic Pruning for Robust Semi-Supervised Learning
Wang, YuandQiao, PengchongandLiu, ChangandSong, GuoliandZheng, XiawuandChen, Jie



研究问题:本文旨在解决现有鲁棒半监督学习方法在语义层面存在OOD信息污染的问题。
动机:当前的方法主要在样本层面过滤掉OOD信息,但在语义层面的OOD信息污染问题尚未得到充分关注,限制了该领域的发展。
方法:提出一种名为OOD语义剪枝(OSP)的统一框架,通过将ID特征与具有语义重叠的OOD样本配对,并设计软正交性正则化来抑制ID特征中与配对的OOD样本语义分量共线的语义成分,从而剪除OOD语义。
效果:在TinyImageNet数据集上,OSP在ID分类和OOD检测任务上的表现均优于先前最先进的方法,准确率提高了13.7%,AUROC提高了5.9%。

Recent advances in robust semi-supervised learning (SSL) typical filters out-of-distribution (OOD) information at the sample level. We argue that an overlooked problem of robust SSL is its corrupted information on semantic level, practically limiting the development of the field. In this paper, we take an initiative step to explore and propose a unified framework termed as OOD Semantic Pruning (OSP), aims at pruning OOD semantics out from the in-distribution (ID) features. Specifically, (i) we propose an aliasing OOD matching module to pair each ID sample with an OOD sample with semantic overlap. (ii) We design a soft orthogonality regularization, which first transforms each ID feature by suppressing its semantic component that is collinear with paired OOD sample. It then forces the predictions before and after soft orthogonality transformation to be consistent. Being practically simple, our method shows a strong performance in OOD detection and ID classification on challenging benchmarks. In particular, OSP surpasses the previous state-of-the-art by 13.7% on accuracy for ID classification and 5.9% on AUROC for OOD detection on TinyImageNet dataset. Codes are available in the supplementary material.

Understanding and Improving Visual Prompting: A Label-Mapping Perspective
Chen, AochuanandYao, YuguangandChen, Pin-YuandZhang, YihuaandLiu, Sijia



研究问题:本文旨在探讨视觉提示(VP)与标签映射(LM)之间的关系,以及如何利用这种关系来提高VP在目标任务上的准确性。
动机:尽管视觉提示(VP)是一种有效的预训练源模型重编程技术,但其在目标任务上的有效性仍然取决于源类别和目标类别之间的无规则标签映射(LM)。因此,本文试图理解LM如何影响VP,并探索如何优化LM以提高VP的效果。
方法:本文提出了一种新的VP框架,称为ILM-VP(基于迭代标签映射的视觉提示),该框架自动重新映射源标签到目标标签,并逐步提高VP在目标任务上的准确性。此外,当使用对比性语言-图像预训练(CLIP)模型时,本文还提出了一种集成LM过程的方法,以帮助选择CLIP的文本提示并提高目标任务的准确性。
效果:大量实验表明,本文提出的方法显著优于现有的VP方法。例如,当将ImageNet预训练的ResNet-18重编程为13个目标任务时,该方法在迁移学习到目标花102和CIFAR100数据集上分别提高了7.9%和6.7%的准确性。此外,本文在CLIP基础上的VP建议在花102和DTD上分别提高了13.7%和7.1%的准确性。

We revisit and advance visual prompting (VP), an input prompting technique for vision tasks. VP can reprogram a fixed, pre-trained source model to accomplish downstream tasks in the target domain by simply incorporating universal prompts (in terms of input perturbation patterns) into downstream data points. Yet, it remains elusive why VP stays effective even given a ruleless label mapping (LM) between the source classes and the target classes. Inspired by the above, we ask: How is LM interrelated with VP? And how to exploit such a relationship to improve its accuracy on target tasks? We peer into the influence of LM on VP and provide an affirmative answer that a better 'quality' of LM (assessed by mapping precision and explanation) can consistently improve the effectiveness of VP. This is in contrast to the prior art where the factor of LM was missing. To optimize LM, we propose a new VP framework, termed ILM-VP (iterative label mapping-based visual prompting), which automatically re-maps the source labels to the target labels and progressively improves the target task accuracy of VP. Further, when using a contrastive language-image pretrained (CLIP) model, we propose to integrate an LM process to assist the text prompt selection of CLIP and to improve the target task accuracy. Extensive experiments demonstrate that our proposal significantly outperforms state-of-the-art VP methods. As highlighted below, we show that when reprogramming an ImageNet-pretrained ResNet-18 to 13 target tasks, our method outperforms baselines by a substantial margin, e.g., 7.9% and 6.7% accuracy improvements in transfer learning to the target Flowers102 and CIFAR100 datasets. Besides, our proposal on CLIP-based VP provides 13.7% and 7.1% accuracy improvements on Flowers102 and DTD respectively.

Understanding Deep Generative Models With Generalized Empirical Likelihoods
Ravuri, SumanandRey, M\'elanieandMohamed, ShakirandDeisenroth, MarcPeter



研究问题:如何准确评估深度生成模型对高维数据分布的捕捉能力,特别是在GANs和扩散模型等模型无法计算精确似然的情况下。
动机:当前对于深度生成模型(DGMs)的评估方法存在诸多不足,需要一种能够全面诊断模型缺陷的工具。
方法:提出了广义经验似然(GEL)方法,通过设定合适的矩条件,可以识别出哪些模式被丢弃、模型的模式失衡程度以及模型是否充分捕捉了类内多样性。
效果:结合最大均值差异和广义经验似然的技术,创建了具有样本解释性的分布测试,同时也包括了标签信息的度量。实验表明,这种方法在预测模式丢弃和模式失衡程度上比改进的精度/召回率等指标提高了60%。

Understanding how well a deep generative model captures a distribution of high-dimensional data remains an important open challenge. It is especially difficult for certain model classes, such as Generative Adversarial Networks and Diffusion Models, whose models do not admit exact likelihoods. In this work, we demonstrate that generalized empirical likelihood (GEL) methods offer a family of diagnostic tools that can identify many deficiencies of deep generative models (DGMs). We show, with appropriate specification of moment conditions, that the proposed method can identify which modes have been dropped, the degree to which DGMs are mode imbalanced, and whether DGMs sufficiently capture intra-class diversity. We show how to combine techniques from Maximum Mean Discrepancy and Generalized Empirical Likelihood to create not only distribution tests that retain per-sample interpretability, but also metrics that include label information. We find that such tests predict the degree of mode dropping and mode imbalance up to 60% better than metrics such as improved precision/recall.

Weakly-Supervised Domain Adaptive Semantic Segmentation With Prototypical Contrastive Learning
Das, AnuragandXian, YongqinandDai, DengxinandSchiele, Bernt



研究问题:如何通过使用来自目标领域的不同弱标签(如图像、点和粗略标签)来减小无监督领域适应在语义分割任务中的性能差距。
动机:尽管在改善无监督领域适应的语义分割任务性能方面已经做了很多努力,但与有监督学习相比,其性能仍有巨大差距。
方法:提出了一个通用框架,利用这些弱标签来学习更具代表性的类别特征原型,并通过对比对齐类别特征来进行改进。具体来说,我们执行了两种不同的特征对齐,首先,我们在每个域内将像素特征与原型进行对齐;其次,我们以非对称方式将源域的像素特征与目标域的原型进行对齐。这种非对称对齐在训练过程中保留了目标特征,当弱标签来自目标领域时,这是至关重要的。
效果:在标准基准测试上进行的实验表明,我们的框架与现有工作相比取得了显著的改进,并能够减小与有监督学习的性能差距。

There has been a lot of effort in improving the performance of unsupervised domain adaptation for semantic segmentation task, however there is still a huge gap in performance when compared with supervised learning. In this work, we propose a common framework to use different weak labels, e.g. image, point and coarse labels from target domain to reduce this performance gap. Specifically, we propose to learn better prototypes that are representative class features, by exploiting these weak labels. We use these improved prototypes for contrastive alignment of class features. In particular, we perform two different feature alignments, first, we align pixel features with prototypes within each domain and second, we align pixel features from source to prototype of target domain in an asymmetric way. This asymmetric alignment is beneficial as it preserves the target features during training, which is essential when weak labels are available from target domain. Our experiments on standard benchmarks shows that our framework achieves significant improvement compared to existing works and is able to reduce the performance gap with supervised learning.

Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation
Tang, YushunandZhang, CeandXu, HengandChen, ShuoshuoandCheng, JieandLeng, LuziweiandGuo, QinghaiandHe, Zhihai



研究问题:本文旨在解决深度神经网络在跨领域性能下降的问题。
动机:借鉴生物学习理论,设计了一种基于前馈学习的软Hebbian学习过程,以实现完全的测试时适应。
方法:通过引入反馈神经调制层,改进了前馈Hebbian学习的性能,形成了神经调制Hebbian学习方法。
效果:实验结果表明,该方法能显著提高网络模型的适应性能,优于现有的最先进方法。

Fully test-time adaptation aims to adapt the network model based on sequential analysis of input samples during the inference stage to address the cross-domain performance degradation problem of deep neural networks. We take inspiration from the biological plausibility learning where the neuron responses are tuned based on a local synapse-change procedure and activated by competitive lateral inhibition rules. Based on these feed-forward learning rules, we design a soft Hebbian learning process which provides an unsupervised and effective mechanism for online adaptation. We observe that the performance of this feed-forward Hebbian learning for fully test-time adaptation can be significantly improved by incorporating a feedback neuro-modulation layer. It is able to fine-tune the neuron responses based on the external feedback generated by the error back-propagation from the top inference layers. This leads to our proposed neuro-modulated Hebbian learning (NHL) method for fully test-time adaptation. With the unsupervised feed-forward soft Hebbian learning being combined with a learned neuro-modulator to capture feedback from external responses, the source model can be effectively adapted during the testing process. Experimental results on benchmark datasets demonstrate that our proposed method can significantly improve the adaptation performance of network models and outperforms existing state-of-the-art methods.

Label Information Bottleneck for Label Enhancement
Zheng, QinghaiandZhu, JihuaandTang, Haoyu



研究问题:本文旨在解决标签增强(Label Enhancement,LE)问题,即如何从逻辑标签中精确恢复标签分布。
动机:在恢复标签分布的过程中,数据集中的标签无关信息可能导致恢复性能不佳。为了解决这个问题,我们尝试挖掘关键的标签相关信息以提高恢复性能。
方法:我们提出了一种新的标签信息瓶颈(Label Information Bottleneck,LIB)方法。该方法将LE问题分为两个联合过程:1) 学习包含关键标签相关信息的表示;2) 根据学到的表示恢复标签分布。我们可以基于学到的表示形成的“瓶颈”来挖掘标签相关信息。
效果:我们在几个基准标签分布学习数据集上进行的评估实验验证了LIB方法的有效性和竞争力。

In this work, we focus on the challenging problem of Label Enhancement (LE), which aims to exactly recover label distributions from logical labels, and present a novel Label Information Bottleneck (LIB) method for LE. For the recovery process of label distributions, the label irrelevant information contained in the dataset may lead to unsatisfactory recovery performance. To address this limitation, we make efforts to excavate the essential label relevant information to improve the recovery performance. Our method formulates the LE problem as the following two joint processes: 1) learning the representation with the essential label relevant information, 2) recovering label distributions based on the learned representation. The label relevant information can be excavated based on the "bottleneck" formed by the learned representation. Significantly, both the label relevant information about the label assignments and the label relevant information about the label gaps can be explored in our method. Evaluation experiments conducted on several benchmark label distribution learning datasets verify the effectiveness and competitiveness of LIB.

Cloud-Device Collaborative Adaptation to Continual Changing Environments in the Real-World
Gan, YuluandPan, MingjieandZhang, RongyuandLing, ZijianandZhao, LingranandLiu, JiamingandZhang, Shanghang



研究问题:在现实世界中,面对不断变化的环境,客户端设备的轻量级模型在分布变化下性能严重下降。
动机:现有的设备模型的主要限制在于(1)由于设备的计算限制无法更新,(2)轻量级模型的泛化能力有限。同时,最近的大模型在云上显示出强大的泛化能力,但由于计算约束无法部署在客户端设备上。
方法:我们提出了一种新的学习范式——云-设备协作持续适应,以使设备模型能够应对不断变化的环境。在这种范式中,我们提出了一种基于不确定性的视觉提示适应(U-VPA)的教师-学生模型,以鼓励云和设备之间的协作并提高设备模型的泛化能力。
效果:我们在两个持续变化的环境中的对象检测数据集上进行了广泛的实验。我们的U-VPA教师-学生框架优于先前最先进的测试时间适应和设备-云协作方法。

When facing changing environments in the real world, the lightweight model on client devices suffer from severe performance drop under distribution shifts. The main limitations of existing device model lie in: (1) unable to update due to the computation limit of the device, (2) limited generalization ability of the lightweight model. Meanwhile, recent large models have shown strong generalization capability on cloud while they can not be deployed on client devices due to the poor computation constraint. To enable the device model to deal with changing environments, we propose a new learning paradigm of Cloud-Device Collaborative Continual Adaptation. To encourage collaboration between cloud and device and improve the generalization of device model, we propose an Uncertainty-based Visual Prompt Adapted (U-VPA) teacher-student model in such paradigm. Specifically, we first design the Uncertainty Guided Sampling (UGS) to screen out challenging data continuously and transmit the most out-of-distribution samples from the device to the cloud. To further transfer the generalization capability of the large model on the cloud to the device model, we propose a Visual Prompt Learning Strategy with Uncertainty guided updating (VPLU) to specifically deal with the selected samples with more distribution shifts. Then, we transmit the visual prompts to the device and concatenate them with the incoming data to pull the device testing distribution closer to the cloud training distribution. We conduct extensive experiments on two object detection datasets with continually changing environments. Our proposed U-VPA teacher-student framework outperforms previous state-of-the-art test time adaptation and device-cloud collaboration methods. The code and datasets will be released.

Ingredient-Oriented Multi-Degradation Learning for Image Restoration
Zhang, JinghaoandHuang, JieandYao, MingdeandYang, ZizhengandYu, HuandZhou, ManandZhao, Feng



研究问题:如何利用图像恢复任务之间的关系,挖掘退化的内在成分。
动机:现有的一体化方法在处理多种图像退化时,往往忽视了任务之间的关联性,导致可扩展性差。
方法:提出一种新颖的、以成分为导向的退化重构框架(IDR),包括任务导向的知识收集和成分导向的知识整合两个阶段。第一阶段根据物理原理对不同的退化进行特设操作,并为每种类型的退化建立相应的先验枢纽。第二阶段通过学习主成分分析逐步将前一阶段的任务导向枢纽转化为单一成分导向枢纽,并采用动态路由机制进行概率未知退化消除。
效果:实验证明,该方法在各种图像恢复任务上具有有效性和可扩展性,且对未知下游任务具有良好的泛化能力。

Learning to leverage the relationship among diverse image restoration tasks is quite beneficial for unraveling the intrinsic ingredients behind the degradation. Recent years have witnessed the flourish of various All-in-one methods, which handle multiple image degradations within a single model. In practice, however, few attempts have been made to excavate task correlations in that exploring the underlying fundamental ingredients of various image degradations, resulting in poor scalability as more tasks are involved. In this paper, we propose a novel perspective to delve into the degradation via an ingredients-oriented rather than previous task-oriented manner for scalable learning. Specifically, our method, named Ingredients-oriented Degradation Reformulation framework (IDR), consists of two stages, namely task-oriented knowledge collection and ingredients-oriented knowledge integration. In the first stage, we conduct ad hoc operations on different degradations according to the underlying physics principles, and establish the corresponding prior hubs for each type of degradation. While the second stage progressively reformulates the preceding task-oriented hubs into single ingredients-oriented hub via learnable Principal Component Analysis (PCA), and employs a dynamic routing mechanism for probabilistic unknown degradation removal. Extensive experiments on various image restoration tasks demonstrate the effectiveness and scalability of our method. More importantly, our IDR exhibits the favorable generalization ability to unknown downstream tasks.

How To Prevent the Continuous Damage of Noises To Model Training?
Yu, XiaotianandJiang, YangandShi, TianqiandFeng, ZunleiandWang, YuexuanandSong, MingliandSun, Li



研究问题:深度学习中存在噪声标签的问题在许多情况下是挑战性的,且不可避免。
动机:现有的方法通过降低不确定样本的损失权重或过滤潜在的噪声样本来减少噪声样本的影响,但这高度依赖于模型识别噪声样本的优越判别能力。然而,训练阶段的模型是不完美的,会错过许多噪声样本,从而对模型训练造成持续的伤害。
方法:本文提出了一种梯度切换策略(GSS),通过将每个样本的当前梯度方向切换到从包含所有类别梯度方向的新梯度方向池中选择的新方向,来防止噪声样本对分类器的持续伤害。
效果:实验表明,使用GSS训练的模型可以达到与使用干净数据训练的模型相当的性能。此外,提出的GSS可以插入到现有的噪声标签学习框架中。这可以为未来的噪声标签学习提供新的视角。

Deep learning with noisy labels is challenging and inevitable in many circumstances. Existing methods reduce the impact of noise samples by reducing loss weights of uncertain samples or by filtering out potential noise samples, which highly rely on the model's superior discriminative power for identifying noise samples. However, in the training stage, the trainee model is imperfect will miss many noise samples, which cause continuous damage to the model training. Consequently, there is a large performance gap between existing anti-noise models trained with noisy samples and models trained with clean samples. In this paper, we put forward a Gradient Switching Strategy (GSS) to prevent the continuous damage of noise samples to the classifier. Theoretical analysis shows that the damage comes from the misleading gradient direction computed from the noise samples. The trainee model will deviate from the correct optimization direction under the influence of the accumulated misleading gradient of noise samples. To address this problem, the proposed GSS alleviates the damage by switching the current gradient direction of each sample to a new direction selected from a gradient direction pool, which contains all-class gradient directions with different probabilities. During training, the trainee model is optimized along switched gradient directions generated by GSS, which assigns higher probabilities to potential principal directions for high-confidence samples. Conversely, uncertain samples have a relatively uniform probability distribution for all gradient directions, which can cancel out the misleading gradient directions. Extensive experiments show that a model trained with GSS can achieve comparable performance with a model trained with clean data. Moreover, the proposed GSS is pluggable for existing frameworks for noisy-label learning. This work can provide a new perspective for future noisy-label learning.

ActMAD: Activation Matching To Align Distributions for Test-Time-Training
Mirza, MuhammadJehanzebandSoneira, PolJan\'eandLin, WeiandKozinski, MateuszandPossegger, HorstandBischof, Horst



研究问题:如何通过在测试时训练(TTT)来应对分布外(OOD)数据,并适应测试时发生的分布变化。
动机:现有的方法主要对特征提取器最内层的所有通道的分布进行建模,而我们的方法旨在更精细地对网络中各层的每个特征的分布进行建模。
方法:我们提出了一种名为激活匹配(ActMAD)的方法,通过分析模型的激活情况并将OOD测试数据的激活统计与训练数据的激活统计进行对齐来进行模型调整。
效果:实验结果表明,ActMAD在CIFAR-100C和Imagenet-C上取得了最先进的性能,并且在KITTI-trained物体检测器在KITTI-Fog上的评估中比之前的方法提高了15.4%。此外,ActMAD可以应用于在线适应现实场景,且只需少量数据即可达到其全部性能。

Test-Time-Training (TTT) is an approach to cope with out-of-distribution (OOD) data by adapting a trained model to distribution shifts occurring at test-time. We propose to perform this adaptation via Activation Matching (ActMAD): We analyze activations of the model and align activation statistics of the OOD test data to those of the training data. In contrast to existing methods, which model the distribution of entire channels in the ultimate layer of the feature extractor, we model the distribution of each feature in multiple layers across the network. This results in a more fine-grained supervision and makes ActMAD attain state of the art performance on CIFAR-100C and Imagenet-C. ActMAD is also architecture- and task-agnostic, which lets us go beyond image classification, and score 15.4% improvement over previous approaches when evaluating a KITTI-trained object detector on KITTI-Fog. Our experiments highlight that ActMAD can be applied to online adaptation in realistic scenarios, requiring little data to attain its full performance.

Guided Recommendation for Model Fine-Tuning
Li, HaoandFowlkes, CharlessandYang, HaoandDabeer, OnkarandTu, ZhuowenandSoatto, Stefano



研究问题:如何有效地在大规模模型库中选择最适合的预训练模型进行下游任务。
动机:现有的手工设计模型选择标准由于无效的假设和内在限制可能失败,且难以将关于模型容量和数据集的先验知识整合到现有标准中。
方法:将模型选择转化为推荐问题,并从过去的训练历史中学习。具体来说,我们将数据集和模型的元信息特征化,并使用其迁移学习性能作为指导分数。通过数千个历史训练作业,可以学习一个推荐系统,以预测给定数据集和模型特征的模型选择分数。
效果:通过22个预训练模型和40个下游任务的广泛评估,我们的方法在有相关训练历史可用时,可以显著优于先前的手工设计模型选择方法。

Model selection is essential for reducing the search cost of the best pre-trained model over a large-scale model zoo for a downstream task. After analyzing recent hand-designed model selection criteria with 400+ ImageNet pre-trained models and 40 downstream tasks, we find that they can fail due to invalid assumptions and intrinsic limitations. The prior knowledge on model capacity and dataset also can not be easily integrated into the existing criteria. To address these issues, we propose to convert model selection as a recommendation problem and to learn from the past training history. Specifically, we characterize the meta information of datasets and models as features, and use their transfer learning performance as the guided score. With thousands of historical training jobs, a recommendation system can be learned to predict the model selection score given the features of the dataset and the model as input. Our approach enables integrating existing model selection scores as additional features and scales with more historical data. We evaluate the prediction accuracy with 22 pre-trained models over 40 downstream tasks. With extensive evaluations, we show that the learned approach can outperform prior hand-designed model selection methods significantly when relevant training history is available.

Masked Image Training for Generalizable Deep Image Denoising
Chen, HaoyuandGu, JinjinandLiu, YihaoandMagid, SalmaAbdelandDong, ChaoandWang, QiongandPfister, HanspeterandZhu, Lei



研究问题:如何提高深度学习模型在图像去噪任务上的泛化能力。
动机:现有的深度学习方法在处理不同噪声分布的图像去噪任务时,往往表现不佳。
方法:提出一种名为“掩蔽训练”的新方法,通过在训练过程中随机遮蔽输入图像的像素并重建缺失信息,以及遮蔽自注意力层的特征,避免训练-测试不一致的影响。
效果:实验表明,该方法比其他深度学习模型具有更好的泛化能力,并且可以直接应用于真实世界的场景。同时,我们的可解释性分析也证明了该方法的优越性。

When capturing and storing images, devices inevitably introduce noise. Reducing this noise is a critical task called image denoising. Deep learning has become the de facto method for image denoising, especially with the emergence of Transformer-based models that have achieved notable state-of-the-art results on various image tasks. However, deep learning-based methods often suffer from a lack of generalization ability. For example, deep models trained on Gaussian noise may perform poorly when tested on other noise distributions. To address this issue, we present a novel approach to enhance the generalization performance of denoising networks, known as masked training. Our method involves masking random pixels of the input image and reconstructing the missing information during training. We also mask out the features in the self-attention layers to avoid the impact of training-testing inconsistency. Our approach exhibits better generalization ability than other deep learning models and is directly applicable to real-world scenarios. Additionally, our interpretability analysis demonstrates the superiority of our method.

OT-Filter: An Optimal Transport Filter for Learning With Noisy Labels
Feng, ChuanwenandRen, YilongandXie, Xike



研究问题:如何通过优化传输理论改进深度学习模型在有噪声标签数据上的性能。
动机:现有的深度学习模型在有噪声标签的数据上性能会显著下降,因为网络会因记忆噪声标签而产生确认偏误。
方法:提出一种基于最优传输理论的样本选择方法,称为OT-Filter,该方法通过提供几何有意义的距离和保留分布模式来测量数据差异,从而减轻确认偏误。
效果:在Clothing1M、ANIMAL-10N等基准测试中,OT-Filter的性能优于其他方法。同时,在CIFAR-10/100等具有合成标签的基准测试中,OT-Filter在处理高噪声数据标签方面表现出优越性。

The success of deep learning is largely attributed to the training over clean data. However, data is often coupled with noisy labels in practice. Learning with noisy labels is challenging because the performance of the deep neural networks (DNN) drastically degenerates, due to confirmation bias caused by the network memorization over noisy labels. To alleviate that, a recent prominent direction is on sample selection, which retrieves clean data samples from noisy samples, so as to enhance the model's robustness and tolerance to noisy labels. In this paper, we revamp the sample selection from the perspective of optimal transport theory and propose a novel method, called the OT-Filter. The OT-Filter provides geometrically meaningful distances and preserves distribution patterns to measure the data discrepancy, thus alleviating the confirmation bias. Extensive experiments on benchmarks, such as Clothing1M and ANIMAL-10N, show that the performance of the OT- Filter outperforms its counterparts. Meanwhile, results on benchmarks with synthetic labels, such as CIFAR-10/100, show the superiority of the OT-Filter in handling data labels of high noise.

Rebalancing Batch Normalization for Exemplar-Based Class-Incremental Learning
Cha, SungminandCho, SungjunandHwang, DasolandHong, SunwonandLee, MoontaeandMoon, Taesup



研究问题:本研究旨在解决在持续学习中,批量归一化(BN)的问题。
动机:尽管批量归一化在各种计算机视觉任务中被广泛研究,但在持续学习中的应用却相对较少。特别是在基于样例的类别增量学习(CIL)中,BN的主要问题是在一个小批量中,当前任务和过去任务的训练数据不平衡,导致BN的经验均值和方差以及可学习的仿射变换参数严重偏向当前任务,从而引发对过去任务的遗忘。
方法:我们开发了一种新的BN更新补丁,专门针对基于样例的类别增量学习(CIL)。我们提出了一种无超参数的变体,称为任务平衡BN(TBBN),通过在训练过程中使用重塑和重复操作进行水平拼接的任务平衡批次,以更准确地解决数据不平衡问题。
效果:我们在CIFAR-100、ImageNet-100和五个不同任务数据集上进行的实验表明,TBBN在推理时与普通BN完全相同,易于应用于大多数现有的基于样例的离线CIL算法,并始终优于其他BN变体。

Batch Normalization (BN) and its variants has been extensively studied for neural nets in various computer vision tasks, but relatively little work has been dedicated to studying the effect of BN in continual learning. To that end, we develop a new update patch for BN, particularly tailored for the exemplar-based class-incremental learning (CIL). The main issue of BN in CIL is the imbalance of training data between current and past tasks in a mini-batch, which makes the empirical mean and variance as well as the learnable affine transformation parameters of BN heavily biased toward the current task --- contributing to the forgetting of past tasks. While one of the recent BN variants has been developed for "online" CIL, in which the training is done with a single epoch, we show that their method does not necessarily bring gains for "offline" CIL, in which a model is trained with multiple epochs on the imbalanced training data. The main reason for the ineffectiveness of their method lies in not fully addressing the data imbalance issue, especially in computing the gradients for learning the affine transformation parameters of BN. Accordingly, our new hyperparameter-free variant, dubbed as Task-Balanced BN (TBBN), is proposed to more correctly resolve the imbalance issue by making a horizontally-concatenated task-balanced batch using both reshape and repeat operations during training. Based on our experiments on class incremental learning of CIFAR-100, ImageNet-100, and five dissimilar task datasets, we demonstrate that our TBBN, which works exactly the same as the vanilla BN in the inference time, is easily applicable to most existing exemplar-based offline CIL algorithms and consistently outperforms other BN variants.

The Treasure Beneath Multiple Annotations: An Uncertainty-Aware Edge Detector
Zhou, CaixiaandHuang, YapingandPu, MengyangandGuan, QingjiandHuang, LiandLing, Haibin



研究问题:深度学习的边缘检测器严重依赖像素级的标签,这些标签通常由多个注释者提供。
动机:现有的方法通过简单的投票过程融合多个注释,忽略了边缘的固有模糊性和注释者的标注偏见。
方法:本文提出了一种新的不确定性感知边缘检测器(UAED),该检测器利用不确定性来研究不同注释的主观性和模糊性。具体来说,我们首先将确定性标签空间转换为可学习的高斯分布,其方差度量不同注释之间的模糊度。然后,我们将学习的方差视为预测边缘图的估计不确定性,具有较高不确定性的像素可能是边缘检测的难题。因此,我们设计了一个自适应加权损失来强调从那些具有高不确定性的像素中学习,这有助于网络逐渐集中在重要的像素上。
效果:UAED可以与各种编码器-解码器骨干结合使用,广泛的实验证明,UAED在多个边缘检测基准测试中始终表现出优越的性能。源代码可在https://github.com/ZhouCX117/UAED获取。

Deep learning-based edge detectors heavily rely on pixel-wise labels which are often provided by multiple annotators. Existing methods fuse multiple annotations using a simple voting process, ignoring the inherent ambiguity of edges and labeling bias of annotators. In this paper, we propose a novel uncertainty-aware edge detector (UAED), which employs uncertainty to investigate the subjectivity and ambiguity of diverse annotations. Specifically, we first convert the deterministic label space into a learnable Gaussian distribution, whose variance measures the degree of ambiguity among different annotations. Then we regard the learned variance as the estimated uncertainty of the predicted edge maps, and pixels with higher uncertainty are likely to be hard samples for edge detection. Therefore we design an adaptive weighting loss to emphasize the learning from those pixels with high uncertainty, which helps the network to gradually concentrate on the important pixels. UAED can be combined with various encoder-decoder backbones, and the extensive experiments demonstrate that UAED achieves superior performance consistently across multiple edge detection benchmarks. The source code is available at https://github.com/ZhouCX117/UAED.

Fair Federated Medical Image Segmentation via Client Contribution Estimation
Jiang, MeiruiandRoth, HolgerR.andLi, WenqiandYang, DongandZhao, CanandNath, VishweshandXu, DaguangandDou, QiandXu, Ziyue



研究问题:如何在联邦学习中确保公平性,包括贡献公平性和性能公平性。
动机:尽管已有研究在单独考虑贡献公平性和性能公平性时取得了进展,但我们认为同时考虑这两者对于激励更多多样化的客户端参与并生成高质量的全局模型至关重要。
方法:我们提出了一种优化这两种公平性的方法,即通过估计贡献来执行联邦学习(FedCE)。具体来说,我们在梯度空间和数据空间中估计客户端的贡献。在梯度空间中,我们监控每个客户端相对于其他客户端的梯度方向差异。在数据空间中,我们使用辅助模型测量客户端数据的预测误差。基于这种贡献估计,我们提出了一种联邦学习方法,即使用估计作为全局模型聚合权重。
效果:我们的理论分析和实验评估表明,我们的方法在两个真实世界的医疗数据集上取得了显著的性能改进,更好的协作公平性和性能公平性,以及全面的分析研究。

How to ensure fairness is an important topic in federated learning (FL). Recent studies have investigated how to reward clients based on their contribution (collaboration fairness), and how to achieve uniformity of performance across clients (performance fairness). Despite achieving progress on either one, we argue that it is critical to consider them together, in order to engage and motivate more diverse clients joining FL to derive a high-quality global model. In this work, we propose a novel method to optimize both types of fairness simultaneously. Specifically, we propose to estimate client contribution in gradient and data space. In gradient space, we monitor the gradient direction differences of each client with respect to others. And in data space, we measure the prediction error on client data using an auxiliary model. Based on this contribution estimation, we propose a FL method, federated training via contribution estimation (FedCE), i.e., using estimation as global model aggregation weights. We have theoretically analyzed our method and empirically evaluated it on two real-world medical datasets. The effectiveness of our approach has been validated with significant performance improvements, better collaboration fairness, better performance fairness, and comprehensive analytical studies.

AsyFOD: An Asymmetric Adaptation Paradigm for Few-Shot Domain Adaptive Object Detection
Gao, YipengandLin, Kun-YuandYan, JunkaiandWang, YaoweiandZheng, Wei-Shi



研究问题:本文旨在解决在只有少量目标标注图像的情况下进行领域自适应物体检测的问题。
动机:在少数目标标注图像的情况下,源域和目标域之间的数据不平衡可能导致过度适应,这是由于传统的特征对齐方法无法有效处理这个问题。
方法:提出了一种不对称适应的范式,即AsyFOD,通过利用源域和目标域的不同视角来解决数据不平衡问题。具体来说,AsyFOD首先通过目标分布估计识别出与目标相似的源实例,用于扩充有限的目标任务实例。然后,我们对与目标不相似的源实例和增强的目标实例进行异步对齐,这种方法简单而有效,可以缓解过度适应的问题。
效果:实验表明,所提出的AsyFOD在所有四个FSDAOD基准测试中都优于所有最先进的方法,例如,在Cityscapes-to-FoggyCityscapes上提高了3.1%的mAP,在Sim10k-to-Cityscapes上提高了2.9%的mAP。代码可在https://github.com/Hlings/AsyFOD获取。

In this work, we study few-shot domain adaptive object detection (FSDAOD), where only a few target labeled images are available for training in addition to sufficient source labeled images. Critically, in FSDAOD, the data-scarcity in the target domain leads to an extreme data imbalance between the source and target domains, which potentially causes over-adaptation in traditional feature alignment. To address the data imbalance problem, we propose an asymmetric adaptation paradigm, namely AsyFOD, which leverages the source and target instances from different perspectives. Specifically, by using target distribution estimation, the AsyFOD first identifies the target-similar source instances, which serves for augmenting the limited target instances. Then, we conduct asynchronous alignment between target-dissimilar source instances and augmented target instances, which is simple yet effective for alleviating the over-adaptation. Extensive experiments demonstrate that the proposed AsyFOD outperforms all state-of-the-art methods on four FSDAOD benchmarks with various environmental variances, e.g., 3.1% mAP improvement on Cityscapes-to-FoggyCityscapes and 2.9% mAP increase on Sim10k-to-Cityscapes. The code is available at https://github.com/Hlings/AsyFOD.

Block Selection Method for Using Feature Norm in Out-of-Distribution Detection
Yu, YeongukandShin, SunghoandLee, SeongjuandJun, ChanghyunandLee, Kyoobin



研究问题:如何有效地检测神经网络推理阶段的分布外(OOD)输入?
动机:目前的方法主要依赖于网络输出的高激活特征图,但这种方法存在局限性。
方法:本文提出了一个简单框架,包括特征图的范数(FeatureNorm)和ID与OOD的范数比(NormRatio),用于测量每个块的OOD检测性能。通过创建伪OOD的jigsaw拼图从ID训练样本中,并计算NormRatio来选择提供ID和OOD特征图范数最大差异的块。
效果:实验结果表明,使用FeatureNorm进行OOD检测比其他方法更有效,在CIFAR10基准测试中FPR95降低了最多52.77%,在ImageNet基准测试中降低了最多48.53%。此外,该框架可以适用于各种架构,并且块选择的重要性可以提高先前的OOD检测方法。

Detecting out-of-distribution (OOD) inputs during the inference stage is crucial for deploying neural networks in the real world. Previous methods commonly relied on the output of a network derived from the highly activated feature map. In this study, we first revealed that a norm of the feature map obtained from the other block than the last block can be a better indicator of OOD detection. Motivated by this, we propose a simple framework consisting of FeatureNorm: a norm of the feature map and NormRatio: a ratio of FeatureNorm for ID and OOD to measure the OOD detection performance of each block. In particular, to select the block that provides the largest difference between FeatureNorm of ID and FeatureNorm of OOD, we create jigsaw puzzles as pseudo OOD from ID training samples and calculate NormRatio, and the block with the largest value is selected. After the suitable block is selected, OOD detection with the FeatureNorm outperforms other OOD detection methods by reducing FPR95 by up to 52.77% on CIFAR10 benchmark and by up to 48.53% on ImageNet benchmark. We demonstrate that our framework can generalize to various architectures and the importance of block selection, which can improve previous OOD detection methods as well.

Frustratingly Easy Regularization on Representation Can Boost Deep Reinforcement Learning
He, QiangandSu, HuangyuanandZhang, JieyuandHou, Xinwen



研究问题:深度强化学习中的Q网络及其目标网络的表示是否应满足良好的区分性表示属性。
动机:当前的深度强化学习代理可能会违反这一属性,导致次优策略。
方法:提出了一种名为“Policy Evaluation with Easy Regularization on Representation”的简单有效的正则化器PEER,通过显式地对内部表示进行正则化来保持区分性表示属性。
效果:实验证明,将PEER引入深度强化学习可以显著提高性能和样本效率。在PyBullet的所有4个环境中,DMControl的12个任务中的9个以及Atari的26个游戏中的19个,PEER都达到了最先进的性能。

Deep reinforcement learning (DRL) gives the promise that an agent learns good policy from high-dimensional information, whereas representation learning removes irrelevant and redundant information and retains pertinent information. In this work, we demonstrate that the learned representation of the Q-network and its target Q-network should, in theory, satisfy a favorable distinguishable representation property. Specifically, there exists an upper bound on the representation similarity of the value functions of two adjacent time steps in a typical DRL setting. However, through illustrative experiments, we show that the learned DRL agent may violate this property and lead to a sub-optimal policy. Therefore, we propose a simple yet effective regularizer called Policy Evaluation with Easy Regularization on Representation (PEER), which aims to maintain the distinguishable representation property via explicit regularization on internal representations. And we provide the convergence rate guarantee of PEER. Implementing PEER requires only one line of code. Our experiments demonstrate that incorporating PEER into DRL can significantly improve performance and sample efficiency. Comprehensive experiments show that PEER achieves state-of-the-art performance on all 4 environments on PyBullet, 9 out of 12 tasks on DMControl, and 19 out of 26 games on Atari. To the best of our knowledge, PEER is the first work to study the inherent representation property of Q-network and its target. Our code is available at https://sites.google.com/view/peer-cvpr2023/.

StyleAdv: Meta Style Adversarial Training for Cross-Domain Few-Shot Learning
Fu, YuqianandXie, YuandFu, YanweiandJiang, Yu-Gang



研究问题:跨领域少样本学习(CD-FSL)是一种新的任务,旨在将源数据集上学到的先验知识转移到新的目标任务集上。
动机:CD-FSL任务面临的主要挑战是不同数据集之间的巨大域差距,这种差距主要来自视觉风格的改变。现有的方法通过交换两个图像的风格来解决这个问题,但这种方法生成的风格仍然属于源风格集,因此效果有限。
方法:受传统对抗性学习的启发,我们提出了一种模型无关的元风格对抗训练(StyleAdv)方法和一种新的风格对抗攻击方法。特别是,我们的风格攻击方法通过使用有符号的风格梯度对原始风格进行扰动,为模型训练合成了"虚拟"和"硬"的对抗风格。
效果:我们在八个不同的目标数据集上进行了广泛的实验,无论是基于ResNet还是ViT,我们都取得了新的最先进的结果,证明了我们的方法的有效性。

Cross-Domain Few-Shot Learning (CD-FSL) is a recently emerging task that tackles few-shot learning across different domains. It aims at transferring prior knowledge learned on the source dataset to novel target datasets. The CD-FSL task is especially challenged by the huge domain gap between different datasets. Critically, such a domain gap actually comes from the changes of visual styles, and wave-SAN empirically shows that spanning the style distribution of the source data helps alleviate this issue. However, wave-SAN simply swaps styles of two images. Such a vanilla operation makes the generated styles "real" and "easy", which still fall into the original set of the source styles. Thus, inspired by vanilla adversarial learning, a novel model-agnostic meta Style Adversarial training (StyleAdv) method together with a novel style adversarial attack method is proposed for CD-FSL. Particularly, our style attack method synthesizes both "virtual" and "hard" adversarial styles for model training. This is achieved by perturbing the original style with the signed style gradients. By continually attacking styles and forcing the model to recognize these challenging adversarial styles, our model is gradually robust to the visual styles, thus boosting the generalization ability for novel target datasets. Besides the typical CNN-based backbone, we also employ our StyleAdv method on large-scale pretrained vision transformer. Extensive experiments conducted on eight various target datasets show the effectiveness of our method. Whether built upon ResNet or ViT, we achieve the new state of the art for CD-FSL. Code is available at https://github.com/lovelyqian/StyleAdv-CDFSL.

Long-Tailed Visual Recognition via Self-Heterogeneous Integration With Knowledge Excavation
Jin, YanandLi, MengkeandLu, YangandCheung, Yiu-mingandWang, Hanzi



研究问题:现有的深度学习模型在处理长尾分布的现实世界数据时,往往对多数类有严重偏好。
动机:为了解决这个问题,本文提出了一种基于专家混合的方法(MoE),该方法可以关注长尾分布的不同部分。
方法:首先,我们提出了深度知识融合(DKF)来融合不同专家网络中浅层和深层之间的特征,使每个专家在表示上更加多样化。然后,我们进一步提出了动态知识转移(DKT),以减少最难的负类别对我们MoE框架中尾部类别的影响。
效果:实验结果表明,SHIKE在CIFAR100-LT、ImageNet-LT、iNaturalist 2018和Places-LT等数据集上的分类准确率得到了显著提高,特别是尾部类别的准确率,达到了最先进的56.3%、60.3%、75.4%和41.9%。

Deep neural networks have made huge progress in the last few decades. However, as the real-world data often exhibits a long-tailed distribution, vanilla deep models tend to be heavily biased toward the majority classes. To address this problem, state-of-the-art methods usually adopt a mixture of experts (MoE) to focus on different parts of the long-tailed distribution. Experts in these methods are with the same model depth, which neglects the fact that different classes may have different preferences to be fit by models with different depths. To this end, we propose a novel MoE-based method called Self-Heterogeneous Integration with Knowledge Excavation (SHIKE). We first propose Depth-wise Knowledge Fusion (DKF) to fuse features between different shallow parts and the deep part in one network for each expert, which makes experts more diverse in terms of representation. Based on DKF, we further propose Dynamic Knowledge Transfer (DKT) to reduce the influence of the hardest negative class that has a non-negligible impact on the tail classes in our MoE framework. As a result, the classification accuracy of long-tailed data can be significantly improved, especially for the tail classes. SHIKE achieves the state-of-the-art performance of 56.3%, 60.3%, 75.4%, and 41.9% on CIFAR100-LT (IF100), ImageNet-LT, iNaturalist 2018, and Places-LT, respectively. The source code is available at https://github.com/jinyan-06/SHIKE.

GeoNet: Benchmarking Unsupervised Adaptation Across Geographies
Kalluri, TarunandXu, WangdongandChandraker, Manmohan



研究问题:如何提高视觉模型在训练期间未见过的领域的稳健性,特别是在新地理区域中部署的模型。
动机:解决在训练数据集未充分代表的新地理区域中部署模型时面临的直接挑战,以实现公平和包容的计算机视觉。
方法:介绍了一个大规模的地理适应数据集GeoNet,包括场景识别(GeoPlaces)、图像分类(GeoImNet)和通用适应(GeoUniDA)等任务的基准测试。调查了地理适应问题中典型的分布变化的性质,并假设地理间的主要领域变化源于场景上下文(上下文变化)、对象设计(设计变化)和标签分布(先验变化)的巨大差异。
效果:对几种最先进的无监督领域适应算法和架构进行了广泛的评估,发现它们不足以进行地理适应,大规模预训练的大型视觉模型也不能带来地理稳健性。

In recent years, several efforts have been aimed at improving the robustness of vision models to domains and environments unseen during training. An important practical problem pertains to models deployed in a new geography that is under-represented in the training dataset, posing a direct challenge to fair and inclusive computer vision. In this paper, we study the problem of geographic robustness and make three main contributions. First, we introduce a large-scale dataset GeoNet for geographic adaptation containing benchmarks across diverse tasks like scene recognition (GeoPlaces), image classification (GeoImNet) and universal adaptation (GeoUniDA). Second, we investigate the nature of distribution shifts typical to the problem of geographic adaptation and hypothesize that the major source of domain shifts arise from significant variations in scene context (context shift), object design (design shift) and label distribution (prior shift) across geographies. Third, we conduct an extensive evaluation of several state-of-the-art unsupervised domain adaptation algorithms and architectures on GeoNet, showing that they do not suffice for geographical adaptation, and that large-scale pre-training using large vision models also does not lead to geographic robustness. Our dataset is publicly available at https://tarun005.github.io/GeoNet.

Learning Transformation-Predictive Representations for Detection and Description of Local Features
Wang, ZihaoandWu, ChunxuandYang, YifeiandLi, Zhen



研究问题:关键点检测和描述的任务是估计局部特征的稳定位置和区别性表示,这对图像匹配至关重要。然而,由图像之间的一对一对应关系生成的粗糙的硬正或负标签带来了无法区分的样本,称为伪正或负样本,这在学习和匹配中使用关键点时会产生不一致的监督。这种伪标记样本阻止了深度神经网络学习用于准确匹配的区别性描述。
动机:为了解决这个问题,我们提出了一种使用自我监督对比学习的转换预测表示学习方法。我们通过不使用任何负样本对(包括真实和伪负样本)并避免解决方案崩溃,来最大化相同3D点(地标)的对应视图之间的相似性。然后,我们设计了一个可学习的标签预测机制,将硬正标签软化为软连续目标。积极更新的软标签广泛解决了训练瓶颈(源自伪正样本的标签噪声),使模型可以在更强的增强范例下进行训练。
方法:我们的方法是一种自我监督方法,通过最大化相同3D点的对应视图之间的相似性,并设计一个可学习的标签预测机制,将硬正标签软化为软连续目标,从而避免了伪正或负样本带来的不一致监督。
效果:我们的自我监督方法在标准的图像匹配基准测试中优于最先进的方法,并且在不同的下游任务上表现出优秀的泛化能力。

The task of key-points detection and description is to estimate the stable location and discriminative representation of local features, which is essential for image matching. However, either the rough hard positive or negative labels generated from one-to-one correspondences among images bring indistinguishable samples, called pseudo positives or negatives, which act as inconsistent supervisions while learning key-points used for matching. Such pseudo-labeled samples prevent deep neural networks from learning discriminative descriptions for accurate matching. To tackle this challenge, we propose to learn transformation-predictive representations with self-supervised contrastive learning. We maximize the similarity between corresponded views of the same 3D point (landmark) by using none of the negative sample pairs (including true and pseudo negatives) and avoiding collapsing solutions. Then we design a learnable label prediction mechanism to soften the hard positive labels into soft continuous targets. The aggressively updated soft labels extensively deal with the training bottleneck (derived from the label noise of pseudo positives) and make the model can be trained under a stronger augmentation paradigm. Our self-supervised method outperforms the state-of-the-art on the standard image matching benchmarks by noticeable margins and shows excellent generalization capability on multiple downstream tasks.

Two-Way Multi-Label Loss
Kobayashi, Takumi



研究问题:如何有效地处理自然图像中的多标签分类问题。
动机:现有的单标签分类方法无法有效处理多标签分类问题,而现有的多标签处理方法存在类别不平衡等问题。
方法:提出了一种基于相对比较的多标签损失函数,该函数可以同时区分类别和样本,增强了特征的判别能力。
效果:实验结果表明,该方法在多标签分类任务上具有竞争力的性能,并且在ImageNet的单标签训练中提供了可转移的特征。

A natural image frequently contains multiple classification targets, accordingly providing multiple class labels rather than a single label per image. While the single-label classification is effectively addressed by applying a softmax cross-entropy loss, the multi-label task is tackled mainly in a binary cross-entropy (BCE) framework. In contrast to the softmax loss, the BCE loss involves issues regarding imbalance as multiple classes are decomposed into a bunch of binary classifications; recent works improve the BCE loss to cope with the issue by means of weighting. In this paper, we propose a multi-label loss by bridging a gap between the softmax loss and the multi-label scenario. The proposed loss function is formulated on the basis of relative comparison among classes which also enables us to further improve discriminative power of features by enhancing classification margin. The loss function is so flexible as to be applicable to a multi-label setting in two ways for discriminating classes as well as samples. In the experiments on multi-label classification, the proposed method exhibits competitive performance to the other multi-label losses, and it also provides transferrable features on single-label ImageNet training. Codes are available at https://github.com/tk1980/TwowayMultiLabelLoss.

Dionysus: Recovering Scene Structures by Dividing Into Semantic Pieces
Wang, LikangandChen, Lei



研究问题:大多数现有的3D重建方法在效率和细节保留上存在不足。
动机:在自动驾驶和增强现实等真实世界应用中,效果和效率同样重要,而现有的方法往往在处理无价值深度样本时浪费资源。
方法:提出一种名为Dionysus的新型基于学习的3D重建框架,通过从估计的语义图中找出最有希望的深度候选对象来解决问题。
效果:通过对跨视图语义一致性进行检查以区分不可靠的深度候选对象,并通过在像素之间重新分配深度提名者来实现自适应采样。实验结果证实了所提出的框架的有效性。

Most existing 3D reconstruction methods result in either detail loss or unsatisfying efficiency. However, effectiveness and efficiency are equally crucial in real-world applications, e.g., autonomous driving and augmented reality. We argue that this dilemma comes from wasted resources on valueless depth samples. This paper tackles the problem by proposing a novel learning-based 3D reconstruction framework named Dionysus. Our main contribution is to find out the most promising depth candidates from estimated semantic maps. This strategy simultaneously enables high effectiveness and efficiency by attending to the most reliable nominators. Specifically, we distinguish unreliable depth candidates by checking the cross-view semantic consistency and allow adaptive sampling by redistributing depth nominators among pixels. Experiments on the most popular datasets confirm our proposed framework's effectiveness.

Noisy Correspondence Learning With Meta Similarity Correction
Han, HaochenandMiao, KaiyaoandZheng, QinghuaandLuo, Minnan



研究问题:尽管多模态学习在跨模态检索任务中取得了成功,但其依赖于多媒体数据之间的正确对应关系。然而,收集这种理想的数据既昂贵又耗时。
动机:在实践中,大多数广泛使用的数据集是从互联网上收集的,不可避免地包含不匹配的对。在这种噪声关联数据集上进行训练会导致性能下降,因为跨模态检索方法可能会错误地强制使不匹配的数据相似。
方法:我们提出了一种元相似性校正网络(MSCN)来提供可靠的相似度分数。我们将二分类任务视为元过程,鼓励MSCN从正负元数据中学习区分能力。为了进一步减轻噪声的影响,我们设计了一种有效的数据净化策略,使用元数据作为先验知识来去除噪声样本。
效果:通过在Flickr30K、MS-COCO和Conceptual Captions等合成和真实世界噪声中进行大量实验,我们的方法展示了其优势。

Despite the success of multimodal learning in cross-modal retrieval task, the remarkable progress relies on the correct correspondence among multimedia data. However, collecting such ideal data is expensive and time-consuming. In practice, most widely used datasets are harvested from the Internet and inevitably contain mismatched pairs. Training on such noisy correspondence datasets causes performance degradation because the cross-modal retrieval methods can wrongly enforce the mismatched data to be similar. To tackle this problem, we propose a Meta Similarity Correction Network (MSCN) to provide reliable similarity scores. We view a binary classification task as the meta-process that encourages the MSCN to learn discrimination from positive and negative meta-data. To further alleviate the influence of noise, we design an effective data purification strategy using meta-data as prior knowledge to remove the noisy samples. Extensive experiments are conducted to demonstrate the strengths of our method in both synthetic and real-world noises, including Flickr30K, MS-COCO, and Conceptual Captions.

PCR: Proxy-Based Contrastive Replay for Online Class-Incremental Continual Learning
Lin, HuiweiandZhang, BaoquanandFeng, ShanshanandLi, XutaoandYe, Yunming



研究问题:在线类别增量持续学习是一种特定的持续学习任务,旨在从数据流中持续学习新类别,但数据流的样本只能看一次,这会遭受灾难性遗忘问题的影响。
动机:现有的重播方法通过以代理或对比为基础的重播方式保存和重播部分旧数据,有效地缓解了这个问题。然而,这两种重播方式都有其局限性,前者由于类别不平衡问题倾向于新的类别,后者由于样本数量有限而不稳定且难以收敛。
方法:本文对这两种重播方式进行了全面分析,发现它们可以互补。受此发现的启发,我们提出了一种新的基于重播的方法,称为代理对比重播(PCR)。其关键操作是以对比的方式用相应的代理替换锚点的对比样本。
效果:我们在三个真实世界的基准数据集上进行了广泛的实验,实验结果一致地证明了PCR优于各种最先进的方法。

Online class-incremental continual learning is a specific task of continual learning. It aims to continuously learn new classes from data stream and the samples of data stream are seen only once, which suffers from the catastrophic forgetting issue, i.e., forgetting historical knowledge of old classes. Existing replay-based methods effectively alleviate this issue by saving and replaying part of old data in a proxy-based or contrastive-based replay manner. Although these two replay manners are effective, the former would incline to new classes due to class imbalance issues, and the latter is unstable and hard to converge because of the limited number of samples. In this paper, we conduct a comprehensive analysis of these two replay manners and find that they can be complementary. Inspired by this finding, we propose a novel replay-based method called proxy-based contrastive replay (PCR). The key operation is to replace the contrastive samples of anchors with corresponding proxies in the contrastive-based way. It alleviates the phenomenon of catastrophic forgetting by effectively addressing the imbalance issue, as well as keeps a faster convergence of the model. We conduct extensive experiments on three real-world benchmark datasets, and empirical results consistently demonstrate the superiority of PCR over various state-of-the-art methods.

Multi-View Adversarial Discriminator: Mine the Non-Causal Factors for Object Detection in Unseen Domains
Xu, MingjunandQin, LingyunandChen, WeijieandPu, ShiliangandZhang, Lei



研究问题:领域转移会降低对象检测模型在实际应用程序中的性能。
动机:先前的领域对抗学习(DAL)方法忽略了隐藏在公共特征中的非因果无关因素,这主要是由于DAL的单视图特性。
方法:我们提出了一种基于多视图对抗训练的领域泛化模型,通过在源域上进行多视图对抗训练来消除公共特征中的非因果无关因素。
效果:我们在六个基准测试上进行的大量实验表明,我们的MAD模型取得了最先进的性能。

Domain shift degrades the performance of object detection models in practical applications. To alleviate the influence of domain shift, plenty of previous work try to decouple and learn the domain-invariant (common) features from source domains via domain adversarial learning (DAL). However, inspired by causal mechanisms, we find that previous methods ignore the implicit insignificant non-causal factors hidden in the common features. This is mainly due to the single-view nature of DAL. In this work, we present an idea to remove non-causal factors from common features by multi-view adversarial training on source domains, because we observe that such insignificant non-causal factors may still be significant in other latent spaces (views) due to the multi-mode structure of data. To summarize, we propose a Multi-view Adversarial Discriminator (MAD) based domain generalization model, consisting of a Spurious Correlations Generator (SCG) that increases the diversity of source domain by random augmentation and a Multi-View Domain Classifier (MVDC) that maps features to multiple latent spaces, such that the non-causal factors are removed and the domain-invariant features are purified. Extensive experiments on six benchmarks show our MAD obtains state-of-the-art performance.

MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset
Feng, ChenandPatras, Ioannis



研究问题:如何利用粗粒度标签训练模型以解决细粒度标签问题。
动机:标注大规模数据集既昂贵又困难,特别是需要精细标签的专业化领域。而粗粒度标签更容易获取,不需要专业知识。
方法:提出一种对比学习方法,称为masked contrastive learning(MaskCon),在对比学习框架中,为每个样本生成基于粗粒度标签和其他样本以及该样本的另一种增强视图的软标签。
效果:实验表明,该方法在CIFAR10、CIFAR100、ImageNet-1K、Stanford Online Products和Stanford Cars196等数据集上取得了显著改进,优于现有最先进的方法。

Deep learning has achieved great success in recent years with the aid of advanced neural network structures and large-scale human-annotated datasets. However, it is often costly and difficult to accurately and efficiently annotate large-scale datasets, especially for some specialized domains where fine-grained labels are required. In this setting, coarse labels are much easier to acquire as they do not require expert knowledge. In this work, we propose a contrastive learning method, called masked contrastive learning (MaskCon) to address the under-explored problem setting, where we learn with a coarse-labelled dataset in order to address a finer labelling problem. More specifically, within the contrastive learning framework, for each sample our method generates soft-labels with the aid of coarse labels against other samples and another augmented view of the sample in question. By contrast to self-supervised contrastive learning where only the sample's augmentations are considered hard positives, and in supervised contrastive learning where only samples with the same coarse labels are considered hard positives, we propose soft labels based on sample distances, that are masked by the coarse labels. This allows us to utilize both inter-sample relations and coarse labels. We demonstrate that our method can obtain as special cases many existing state-of-the-art works and that it provides tighter bounds on the generalization error. Experimentally, our method achieves significant improvement over the current state-of-the-art in various datasets, including CIFAR10, CIFAR100, ImageNet-1K, Standford Online Products and Stanford Cars196 datasets. Code and annotations are available at https://github.com/MrChenFeng/MaskCon_CVPR2023.

Instance Relation Graph Guided Source-Free Domain Adaptive Object Detection
VS, VibashanandOza, PoojanandPatel, VishalM.



研究问题:如何让预训练的语言模型更好地利用结构化知识,提升语言理解能力?
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用。
方法:提出一种增强的语言表示模型ERNIE,通过大规模文本语料库和知识图谱进行联合训练,以充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Unsupervised Domain Adaptation (UDA) is an effective approach to tackle the issue of domain shift. Specifically, UDA methods try to align the source and target representations to improve generalization on the target domain. Further, UDA methods work under the assumption that the source data is accessible during the adaptation process. However, in real-world scenarios, the labelled source data is often restricted due to privacy regulations, data transmission constraints, or proprietary data concerns. The Source-Free Domain Adaptation (SFDA) setting aims to alleviate these concerns by adapting a source-trained model for the target domain without requiring access to the source data. In this paper, we explore the SFDA setting for the task of adaptive object detection. To this end, we propose a novel training strategy for adapting a source-trained object detector to the target domain without source data. More precisely, we design a novel contrastive loss to enhance the target representations by exploiting the objects relations for a given target domain input. These object instance relations are modelled using an Instance Relation Graph (IRG) network, which are then used to guide the contrastive representation learning. In addition, we utilize a student-teacher to effectively distill knowledge from source-trained model to target domain. Extensive experiments on multiple object detection benchmark datasets show that the proposed approach is able to efficiently adapt source-trained object detectors to the target domain, outperforming state-of-the-art domain adaptive detection methods. Code and models are provided in https://viudomain.github.io/irg-sfda-web/

DiGA: Distil To Generalize and Then Adapt for Domain Adaptive Semantic Segmentation
Shen, FengyiandGurram, AkhilandLiu, ZiyuanandWang, HeandKnoll, Alois



研究问题:本文旨在解决领域自适应语义分割方法中各阶段的挑战,包括预训练阶段的有限性能提升和自我训练阶段的阈值选择问题。
动机:目前的领域自适应语义分割方法在预训练阶段采用的对抗性训练由于盲目的特征对齐导致性能提升有限,而在自我训练阶段寻找合适的类别阈值非常棘手。
方法:作者提出了一种新的对称知识蒸馏模块来替代预训练阶段的对抗性训练,使模型具有领域泛化能力。同时,还提出了一种无阈值的动态伪标签选择机制来解决自我训练阶段的阈值问题,使模型更好地适应目标领域。
效果:实验结果表明,该方法在流行的基准测试上取得了显著且一致的性能提升,优于现有技术。

Domain adaptive semantic segmentation methods commonly utilize stage-wise training, consisting of a warm-up and a self-training stage. However, this popular approach still faces several challenges in each stage: for warm-up, the widely adopted adversarial training often results in limited performance gain, due to blind feature alignment; for self-training, finding proper categorical thresholds is very tricky. To alleviate these issues, we first propose to replace the adversarial training in the warm-up stage by a novel symmetric knowledge distillation module that only accesses the source domain data and makes the model domain generalizable. Surprisingly, this domain generalizable warm-up model brings substantial performance improvement, which can be further amplified via our proposed cross-domain mixture data augmentation technique. Then, for the self-training stage, we propose a threshold-free dynamic pseudo-label selection mechanism to ease the aforementioned threshold problem and make the model better adapted to the target domain. Extensive experiments demonstrate that our framework achieves remarkable and consistent improvements compared to the prior arts on popular benchmarks. Codes and models are available at https://github.com/fy-vision/DiGA

Crossing the Gap: Domain Generalization for Image Captioning
Ren, YuchenandMao, ZhendongandFang, ShanchengandLu, YanandHe, TongandDu, HaoandZhang, YongdongandOuyang, Wanli



研究问题:现有的图像描述方法假设训练和测试数据来自同一领域,或者目标领域的数据(即测试数据所在的领域)是可访问的。然而,在现实世界的应用中,这个假设是无效的,因为目标领域的数据在学习过程中是不可见的。
动机:为了解决这个问题,我们引入了一种新的设置,叫做“用于图像描述的领域泛化”(DGIC),其中目标领域的数据在训练过程中是不可见的。
方法:我们首先为DGIC构建了一个基准数据集,帮助我们研究模型在未见过的目标领域中的领域泛化能力。在这个新基准的支持下,我们进一步提出了一个名为“语言引导的语义度量学习”(LSML)的新框架来处理DGIC设置。
效果:我们在多个数据集上进行实验,证明了这项任务的挑战性以及我们新提出的基准和LSML框架的有效性。

Existing image captioning methods are under the assumption that the training and testing data are from the same domain or that the data from the target domain (i.e., the domain that testing data lie in) are accessible. However, this assumption is invalid in real-world applications where the data from the target domain is inaccessible. In this paper, we introduce a new setting called Domain Generalization for Image Captioning (DGIC), where the data from the target domain is unseen in the learning process. We first construct a benchmark dataset for DGIC, which helps us to investigate models' domain generalization (DG) ability on unseen domains. With the support of the new benchmark, we further propose a new framework called language-guided semantic metric learning (LSML) for the DGIC setting. Experiments on multiple datasets demonstrate the challenge of the task and the effectiveness of our newly proposed benchmark and LSML framework.

Quantum Multi-Model Fitting
Farina, MatteoandMagri, LucaandMenapace, WilliandRicci, ElisaandGolyanik, VladislavandArrigoni, Federica



研究问题:本文旨在解决计算机视觉中几何模型拟合的挑战,特别是多模型拟合的问题。
动机:量子优化已被证明能提高单一模型的鲁棒拟合能力,但对于多模型拟合的问题尚未解决。
方法:本文提出了首个量子多模型拟合(MMF)方法,将MMF问题转化为可以通过现代的绝热量子计算机有效采样的问题,并提出了迭代和分解版本的该方法以支持实际大小的问题。
效果:实验结果表明,该方法在各种数据集上取得了良好的效果。

Geometric model fitting is a challenging but fundamental computer vision problem. Recently, quantum optimization has been shown to enhance robust fitting for the case of a single model, while leaving the question of multi-model fitting open. In response to this challenge, this paper shows that the latter case can significantly benefit from quantum hardware and proposes the first quantum approach to multi-model fitting (MMF). We formulate MMF as a problem that can be efficiently sampled by modern adiabatic quantum computers without the relaxation of the objective function. We also propose an iterative and decomposed version of our method, which supports real-world-sized problems. The experimental evaluation demonstrates promising results on a variety of datasets. The source code is available at https://github.com/FarinaMatteo/qmmf.

Learning a Deep Color Difference Metric for Photographic Images
Chen, HaoyuandWang, ZhihuaandYang, YangandSun, QilinandMa, Kede



研究问题:如何为摄影图像构建一种深度颜色差异(CD)度量,使其能够准确地计算图像间的颜色差异,同时满足视觉科学中颜色和形状紧密相连的观察结果,以及数学上的合理性、对轻微几何变形的鲁棒性等要求。
动机:现有的颜色差异度量大多是手工制作的,并且是针对均匀着色的补丁进行校准的,这并不能很好地推广到具有自然场景复杂性的摄影图像。因此,为摄影图像构建一种深度颜色差异度量是一个活跃的研究课题。
方法:通过学习一个多尺度自回归正则化流进行特征转换,然后使用与人类感知颜色差异成线性比例的欧几里得距离,来满足所有的属性要求。
效果:在大规模的SPCD数据集上进行的定量和定性实验表明,所学习的CD度量具有潜力。

Most well-established and widely used color difference (CD) metrics are handcrafted and subject-calibrated against uniformly colored patches, which do not generalize well to photographic images characterized by natural scene complexities. Constructing CD formulae for photographic images is still an active research topic in imaging/illumination, vision science, and color science communities. In this paper, we aim to learn a deep CD metric for photographic images with four desirable properties. First, it well aligns with the observations in vision science that color and form are linked inextricably in visual cortical processing. Second, it is a proper metric in the mathematical sense. Third, it computes accurate CDs between photographic images, differing mainly in color appearances. Fourth, it is robust to mild geometric distortions (e.g., translation or due to parallax), which are often present in photographic images of the same scene captured by different digital cameras. We show that all these properties can be satisfied at once by learning a multi-scale autoregressive normalizing flow for feature transform, followed by the Euclidean distance which is linearly proportional to the human perceptual CD. Quantitative and qualitative experiments on the large-scale SPCD dataset demonstrate the promise of the learned CD metric.

Self-Correctable and Adaptable Inference for Generalizable Human Pose Estimation
Kan, ZhehanandChen, ShuoshuoandZhang, CeandTang, YushunandHe, Zhihai



研究问题:人类姿态估计以及其他机器学习和预测任务中普遍存在的泛化问题。
动机:当前的网络预测模型无法对预测错误进行特性描述,生成反馈信息并实时修正每个测试样本的预测错误,导致泛化性能下降。
方法:我们引入了一种自我校正和适应的推理(SCAI)方法来解决这个问题,并以人体姿态估计为例进行演示。我们学习了一个校正网络,根据健身反馈误差来修正预测结果。这个反馈误差是由一个学习到的健身反馈网络产生的,它将预测结果映射回原始输入域并与原始输入进行比较。
效果:我们在人体姿态估计上的大量实验结果表明,提出的SCAI方法能够显著提高预测的泛化能力和性能。

A central challenge in human pose estimation, as well as in many other machine learning and prediction tasks, is the generalization problem. The learned network does not have the capability to characterize the prediction error, generate feedback information from the test sample, and correct the prediction error on the fly for each individual test sample, which results in degraded performance in generalization. In this work, we introduce a self-correctable and adaptable inference (SCAI) method to address the generalization challenge of network prediction and use human pose estimation as an example to demonstrate its effectiveness and performance. We learn a correction network to correct the prediction result conditioned by a fitness feedback error. This feedback error is generated by a learned fitness feedback network which maps the prediction result to the original input domain and compares it against the original input. Interestingly, we find that this self-referential feedback error is highly correlated with the actual prediction error. This strong correlation suggests that we can use this error as feedback to guide the correction process. It can be also used as a loss function to quickly adapt and optimize the correction network during the inference process. Our extensive experimental results on human pose estimation demonstrate that the proposed SCAI method is able to significantly improve the generalization capability and performance of human pose estimation.

Few-Shot Learning With Visual Distribution Calibration and Cross-Modal Distribution Alignment
Wang, RunqiandZheng, HaoandDuan, XiaoyueandLiu, JianzhuangandLu, YuningandWang, TianandXu, SongcenandZhang, Baochang



研究问题:预训练的视觉语言模型在少量学习中存在两个关键问题:图像中的视觉研究问题:预训练的视觉语言模型在少量学习中存在两个关键问题:图像中的视觉特征分布容易受到类别无关信息的干扰,以及视觉和语言特征分布之间的对齐困难。
动机:为了解决干扰问题,我们提出了一个选择性攻击模块,通过可训练的适配器生成图像的空间注意力图来指导对类别无关的图像区域的干扰。通过扰乱这些区域,可以捕获关键特征并校准图像特征的视觉分布。为了更好地对齐描述同一对象类的视觉和语言特征分布,我们提出了跨模态分布对齐模块,引入每个类的视觉-语言原型来对齐分布,并采用地球移动距离(EMD)优化原型。
方法:我们的方法包括选择性攻击模块、跨模态分布对齐模块和图像文本提示的增强策略。
效果:我们在11个数据集上进行了广泛的实验,结果表明我们的方法在少量学习中始终优于现有技术。

Pre-trained vision-language models have inspired much research on few-shot learning. However, with only a few training images, there exist two crucial problems: (1) the visual feature distributions are easily distracted by class-irrelevant information in images, and (2) the alignment between the visual and language feature distributions is difficult. To deal with the distraction problem, we propose a Selective Attack module, which consists of trainable adapters that generate spatial attention maps of images to guide the attacks on class-irrelevant image areas. By messing up these areas, the critical features are captured and the visual distributions of image features are calibrated. To better align the visual and language feature distributions that describe the same object class, we propose a cross-modal distribution alignment module, in which we introduce a vision-language prototype for each class to align the distributions, and adopt the Earth Mover's Distance (EMD) to optimize the prototypes. For efficient computation, the upper bound of EMD is derived. In addition, we propose an augmentation strategy to increase the diversity of the images and the text prompts, which can reduce overfitting to the few-shot training images. Extensive experiments on 11 datasets demonstrate that our method consistently outperforms prior arts in few-shot learning.

A Strong Baseline for Generalized Few-Shot Semantic Segmentation
Hajimiri, SinaandBoudiaf, MalikandBenAyed, IsmailandDolz, Jose



研究问题:本文旨在提出一种泛化的小样本分割框架,具有直接的训练过程和易于优化的推理阶段。
动机:现有的小样本分割模型训练过程复杂,推理阶段不易优化。
方法:提出了一种基于信息最大化原则的简单有效模型,通过最大化学习到的特征表示与其对应预测之间的互信息(MI),并结合知识蒸馏项来保留基类的知识。
效果:在流行的小样本分割基准测试PASCAL-5^i和COCO-20^i上,提出的推理模型取得了显著改进,特别是在新类别中,1-shot和5-shot场景下的改进收益分别在7%到26%(PASCAL-5^i)和3%到12%(COCO-20^i)。

This paper introduces a generalized few-shot segmentation framework with a straightforward training process and an easy-to-optimize inference phase. In particular, we propose a simple yet effective model based on the well-known InfoMax principle, where the Mutual Information (MI) between the learned feature representations and their corresponding predictions is maximized. In addition, the terms derived from our MI-based formulation are coupled with a knowledge distillation term to retain the knowledge on base classes. With a simple training process, our inference model can be applied on top of any segmentation network trained on base classes. The proposed inference yields substantial improvements on the popular few-shot segmentation benchmarks, PASCAL-5^i and COCO-20^i. Particularly, for novel classes, the improvement gains range from 7% to 26% (PASCAL-5^i) and from 3% to 12% (COCO-20^i) in the 1-shot and 5-shot scenarios, respectively. Furthermore, we propose a more challenging setting, where performance gaps are further exacerbated. Our code is publicly available at https://github.com/sinahmr/DIaM.

Bias-Eliminating Augmentation Learning for Debiased Federated Learning
Xu, Yuan-YiandLin, Ci-SiangandWang, Yu-ChiangFrank



研究问题:训练在有偏数据集上的模型往往会观察到类别和不良特征之间的相关性,导致性能下降。
动机:现有的去偏学习模型主要针对集中式机器学习设计,无法直接应用于联邦学习等分布式设置。
方法:提出一种新的联邦学习框架——消除偏见的增强学习(FedBEAL),用于生成每个客户特有的偏见冲突样本。
效果:通过在不同类型偏见的数据集上进行图像分类实验,证实了FedBEAL的有效性和适用性,其表现优于最先进的去偏化和联邦学习方法。

Learning models trained on biased datasets tend to observe correlations between categorical and undesirable features, which result in degraded performances. Most existing debiased learning models are designed for centralized machine learning, which cannot be directly applied to distributed settings like federated learning (FL), which collects data at distinct clients with privacy preserved. To tackle the challenging task of debiased federated learning, we present a novel FL framework of Bias-Eliminating Augmentation Learning (FedBEAL), which learns to deploy Bias-Eliminating Augmenters (BEA) for producing client-specific bias-conflicting samples at each client. Since the bias types or attributes are not known in advance, a unique learning strategy is presented to jointly train BEA with the proposed FL framework. Extensive image classification experiments on datasets with various bias types confirm the effectiveness and applicability of our FedBEAL, which performs favorably against state-of-the-art debiasing and FL methods for debiased FL.

Generalist: Decoupling Natural and Robust Generalization
Wang, HongjunandWang, Yisen



研究问题:如何同时提高模型的自然泛化能力和对抗性泛化能力,避免在防御对抗性样本时自然泛化能力的下降。
动机:现有的对抗性训练方法虽然能提高模型的对抗性泛化能力,但会导致自然泛化能力降低。
方法:提出一种双专家框架Generalist,将自然泛化和对抗性泛化分离,分别对两种泛化进行训练。通过收集并结合基础学习器的参数形成全局学习器,再将其作为初始化参数分发给基础学习器继续训练。
效果:实验证明,当基础学习器训练良好时,Generalist的风险会降低。大量实验验证了Generalist在保持对抗性鲁棒性的同时,能够实现高自然泛化准确率。

Deep neural networks obtained by standard training have been constantly plagued by adversarial examples. Although adversarial training demonstrates its capability to defend against adversarial examples, unfortunately, it leads to an inevitable drop in the natural generalization. To address the issue, we decouple the natural generalization and the robust generalization from joint training and formulate different training strategies for each one. Specifically, instead of minimizing a global loss on the expectation over these two generalization errors, we propose a bi-expert framework called Generalist where we simultaneously train base learners with task-aware strategies so that they can specialize in their own fields. The parameters of base learners are collected and combined to form a global learner at intervals during the training process. The global learner is then distributed to the base learners as initialized parameters for continued training. Theoretically, we prove that the risks of Generalist will get lower once the base learners are well trained. Extensive experiments verify the applicability of Generalist to achieve high accuracy on natural examples while maintaining considerable robustness to adversarial ones. Code is available at https://github.com/PKU-ML/Generalist.

Learning Decorrelated Representations Efficiently Using Fast Fourier Transform
Shigeto, YutaroandShimbo, MasashiandYoshikawa, YuyaandTakeuchi, Akikazu



研究问题:如何降低自监督表示学习模型的训练复杂度。
动机:Barlow Twins和VICReg等自监督表示学习模型虽然效果良好,但高维嵌入的特征相关性需要大量的计算资源。
方法:提出一种放松的去相关正则化器,通过快速傅里叶变换在O(n d log d)时间内进行计算,并设计了一种低成本的方法来缓解可能出现的不良局部极小值。
效果:所提出的正则化器在下游任务中的准确性与现有正则化器相当,但其训练需要更少的内存,并且对于大d值的训练速度更快。

Barlow Twins and VICReg are self-supervised representation learning models that use regularizers to decorrelate features. Although these models are as effective as conventional representation learning models, their training can be computationally demanding if the dimension d of the projected embeddings is high. As the regularizers are defined in terms of individual elements of a cross-correlation or covariance matrix, computing the loss for n samples takes O(n d^2) time. In this paper, we propose a relaxed decorrelating regularizer that can be computed in O(n d log d) time by Fast Fourier Transform. We also propose an inexpensive technique to mitigate undesirable local minima that develop with the relaxation. The proposed regularizer exhibits accuracy comparable to that of existing regularizers in downstream tasks, whereas their training requires less memory and is faster for large d. The source code is available.

Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning
Kotovenko, DmytroandMa, PingchuanandMilbich, TimoandOmmer, Bj\"orn



研究问题:如何训练出能产生图像语义相似性并能推广到未见过测试类别的紧凑图像嵌入?
动机:当前的深度度量学习(DML)方法在将丰富的局部化图像特征图映射到紧凑嵌入向量时面临挑战,即在计算两幅图像之间的相似性之前,会忽略一幅图像中的信息。
方法:提出在训练过程中,将一个图像的嵌入条件化为我们希望比较的另一幅图像。使用交叉注意力,使得一幅图像可以识别另一幅图像中的相关特征,从而建立层次的条件嵌入,逐渐整合关于元组的信息来指导单个图像的表示。
效果:实验表明,这种方法显著改善了基础的标准DML流程,并在已建立的DML基准测试上超越了最先进的技术。

Learning compact image embeddings that yield semantic similarities between images and that generalize to unseen test classes, is at the core of deep metric learning (DML). Finding a mapping from a rich, localized image feature map onto a compact embedding vector is challenging: Although similarity emerges between tuples of images, DML approaches marginalize out information in an individual image before considering another image to which similarity is to be computed. Instead, we propose during training to condition the embedding of an image on the image we want to compare it to. Rather than embedding by a simple pooling as in standard DML, we use cross-attention so that one image can identify relevant features in the other image. Consequently, the attention mechanism establishes a hierarchy of conditional embeddings that gradually incorporates information about the tuple to steer the representation of an individual image. The cross-attention layers bridge the gap between the original unconditional embedding and the final similarity and allow backpropagtion to update encodings more directly than through a lossy pooling layer. At test time we use the resulting improved unconditional embeddings, thus requiring no additional parameters or computational overhead. Experiments on established DML benchmarks show that our cross-attention conditional embedding during training improves the underlying standard DML pipeline significantly so that it outperforms the state-of-the-art.

Class-Balancing Diffusion Models
Qin, YimingandZheng, HuangjieandYao, JiangchaoandZhou, MingyuanandZhang, Ya



研究问题:扩散模型在处理长尾数据分布时,会出现多样性和保真度显著下降的问题。
动机:现有的扩散模型在处理类别不平衡的数据分布时,尾部类别的生成结果会失去多样性,且存在严重的模式崩溃问题。
方法:提出一种基于类别平衡的扩散模型(CBDM),通过引入分布调整正则化器进行训练。
效果:实验证明,CBDM生成的图像在数量和质量上都具有更高的多样性和质量,并在CIFAR100/CIFAR100LT数据集上的生成结果以及下游识别任务上都表现出优异的性能。

Diffusion-based models have shown the merits of generating high-quality visual data while preserving better diversity in recent studies. However, such observation is only justified with curated data distribution, where the data samples are nicely pre-processed to be uniformly distributed in terms of their labels. In practice, a long-tailed data distribution appears more common and how diffusion models perform on such class-imbalanced data remains unknown. In this work, we first investigate this problem and observe significant degradation in both diversity and fidelity when the diffusion model is trained on datasets with class-imbalanced distributions. Especially in tail classes, the generations largely lose diversity and we observe severe mode-collapse issues. To tackle this problem, we set from the hypothesis that the data distribution is not class-balanced, and propose Class-Balancing Diffusion Models (CBDM) that are trained with a distribution adjustment regularizer as a solution. Experiments show that images generated by CBDM exhibit higher diversity and quality in both quantitative and qualitative ways. Our method benchmarked the generation results on CIFAR100/CIFAR100LT dataset and shows outstanding performance on the downstream recognition task.

Feature Alignment and Uniformity for Test Time Adaptation
Wang, ShuaiandZhang, DaoanandYan, ZipeiandZhang, JianguoandLi, Rui



研究问题:本文旨在解决深度神经网络在接收分布外测试样本时的问题。
动机:由于源领域和目标领域之间的领域差距,我们首次将TTA视为特征修订问题。
方法:我们提出了一种测试时间自我蒸馏策略来保证当前批次和所有先前批次的表示一致性,以及一种记忆空间局部聚类策略来对齐即将到来的批次的邻居样本的表示。同时,为了处理常见的噪声标签问题,我们提出了熵和一致性过滤器来选择和丢弃可能的噪声标签。
效果:实验结果表明,我们的方法不仅稳定地提高了基线性能,而且优于现有的最先进的测试时间适应方法。

Test time adaptation (TTA) aims to adapt deep neural networks when receiving out of distribution test domain samples. In this setting, the model can only access online unlabeled test samples and pre-trained models on the training domains. We first address TTA as a feature revision problem due to the domain gap between source domains and target domains. After that, we follow the two measurements alignment and uniformity to discuss the test time feature revision. For test time feature uniformity, we propose a test time self-distillation strategy to guarantee the consistency of uniformity between representations of the current batch and all the previous batches. For test time feature alignment, we propose a memorized spatial local clustering strategy to align the representations among the neighborhood samples for the upcoming batch. To deal with the common noisy label problem, we propound the entropy and consistency filters to select and drop the possible noisy labels. To prove the scalability and efficacy of our method, we conduct experiments on four domain generalization benchmarks and four medical image segmentation tasks with various backbones. Experiment results show that our method not only improves baseline stably but also outperforms existing state-of-the-art test time adaptation methods.

Balanced Product of Calibrated Experts for Long-Tailed Recognition
Aimar, EmanuelSanchezandJonnarth, ArviandFelsberg, MichaelandKuhlmann, Marco



研究问题:现实世界的识别问题往往具有长尾标签分布,这对表示学习提出了挑战。
动机:由于训练分布与测试分布(如均匀分布和长尾分布)的差异,需要解决分布偏移的问题。
方法:提出平衡专家产品(BalPoE),通过调整专家的logit来鼓励多样性,并结合一系列不同的测试时间目标分布。
效果:在CIFAR-100-LT、ImageNet-LT和iNaturalist-2018三个长尾数据集上,该方法取得了新的最先进的结果。

Many real-world recognition problems are characterized by long-tailed label distributions. These distributions make representation learning highly challenging due to limited generalization over the tail classes. If the test distribution differs from the training distribution, e.g. uniform versus long-tailed, the problem of the distribution shift needs to be addressed. A recent line of work proposes learning multiple diverse experts to tackle this issue. Ensemble diversity is encouraged by various techniques, e.g. by specializing different experts in the head and the tail classes. In this work, we take an analytical approach and extend the notion of logit adjustment to ensembles to form a Balanced Product of Experts (BalPoE). BalPoE combines a family of experts with different test-time target distributions, generalizing several previous approaches. We show how to properly define these distributions and combine the experts in order to achieve unbiased predictions, by proving that the ensemble is Fisher-consistent for minimizing the balanced error. Our theoretical analysis shows that our balanced ensemble requires calibrated experts, which we achieve in practice using mixup. We conduct extensive experiments and our method obtains new state-of-the-art results on three long-tailed datasets: CIFAR-100-LT, ImageNet-LT, and iNaturalist-2018. Our code is available at https://github.com/emasa/BalPoE-CalibratedLT.

RMLVQA: A Margin Loss Approach for Visual Question Answering With Language Biases
Basu, AbhipsaandAddepalli, SravantiandBabu, R.Venkatesh



研究问题:视觉问答模型存在语言偏见,即模型在学习问题和答案之间的关联时忽视了图像。
动机:早期的工作试图使用仅问题模型或数据增强来减少这种偏见,但效果不佳。因此,我们提出了一种具有两个组件的自适应边距损失方法。
方法:该方法的第一个组件考虑了训练数据中某一类型问题的答案频率,以解决类别不平衡引起的语言偏见问题。第二个组件则通过学习实例特定的边距,使模型能够区分不同复杂度的样本。我们还在模型中引入了一个偏见注入组件,并从该组件的置信度计算实例特定的边距。我们将这些与估计的边距相结合,以在训练损失中同时考虑答案频率和任务复杂性。
效果:实验结果表明,虽然边距损失对分布外(ood)数据有效,但偏见注入组件对于泛化到分布内(id)数据至关重要。我们的RMLVQA方法在基准VQA数据集上优于无增强的方法,同时在id数据上保持了竞争力,使其成为所有可比方法中最稳健的一种。

Visual Question Answering models have been shown to suffer from language biases, where the model learns a correlation between the question and the answer, ignoring the image. While early works attempted to use question-only models or data augmentations to reduce this bias, we propose an adaptive margin loss approach having two components. The first component considers the frequency of answers within a question type in the training data, which addresses the concern of the class-imbalance causing the language biases. However, it does not take into account the answering difficulty of the samples, which impacts their learning. We address this through the second component, where instance-specific margins are learnt, allowing the model to distinguish between samples of varying complexity. We introduce a bias-injecting component to our model, and compute the instance-specific margins from the confidence of this component. We combine these with the estimated margins to consider both answer-frequency and task-complexity in the training loss. We show that, while the margin loss is effective for out-of-distribution (ood) data, the bias-injecting component is essential for generalising to in-distribution (id) data. Our proposed approach, Robust Margin Loss for Visual Question Answering (RMLVQA) improves upon the existing state-of-the-art results when compared to augmentation-free methods on benchmark VQA datasets suffering from language biases, while maintaining competitive performance on id data, making our method the most robust one among all comparable methods.

Gradient-Based Uncertainty Attribution for Explainable Bayesian Deep Learning
Wang, HanjingandJoshi, DhirajandWang, ShiqiangandJi, Qiang



研究问题:深度学习模型的预测容易受到数据扰动、对抗性攻击和分布外输入的影响,因此需要准确量化预测的不确定性。
动机:为了构建可信赖的AI系统,识别并缓解不确定性来源对预测的影响至关重要。
方法:我们提出了一种可解释且可操作的贝叶斯深度学习方法,不仅可以准确量化不确定性,还可以解释这些不确定性,识别其来源,并提出策略来减轻不确定性的影响。具体来说,我们引入了一种基于梯度的不确定性归因方法,以确定导致预测不确定性的最有问题的输入区域。
效果:与现有方法相比,我们提出的方法具有竞争力的准确性、宽松的假设和高效率。此外,我们还提出了一种利用归因结果作为注意力来进一步改善模型性能的不确定性缓解策略。定性和定量评估都证明了我们提出的方法的有效性。

Predictions made by deep learning models are prone to data perturbations, adversarial attacks, and out-of-distribution inputs. To build a trusted AI system, it is therefore critical to accurately quantify the prediction uncertainties. While current efforts focus on improving uncertainty quantification accuracy and efficiency, there is a need to identify uncertainty sources and take actions to mitigate their effects on predictions. Therefore, we propose to develop explainable and actionable Bayesian deep learning methods to not only perform accurate uncertainty quantification but also explain the uncertainties, identify their sources, and propose strategies to mitigate the uncertainty impacts. Specifically, we introduce a gradient-based uncertainty attribution method to identify the most problematic regions of the input that contribute to the prediction uncertainty. Compared to existing methods, the proposed UA-Backprop has competitive accuracy, relaxed assumptions, and high efficiency. Moreover, we propose an uncertainty mitigation strategy that leverages the attribution results as attention to further improve the model performance. Both qualitative and quantitative evaluations are conducted to demonstrate the effectiveness of our proposed methods.

Manipulating Transfer Learning for Property Inference
Tian, YulongandSuya, FnuandSuri, AnshumanandXu, FengyuanandEvans, David



研究问题:本文研究了在迁移学习中,如何利用控制上游模型的敌手对下游模型进行属性推断攻击。
动机:目前预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Transfer learning is a popular method for tuning pretrained (upstream) models for different downstream tasks using limited data and computational resources. We study how an adversary with control over an upstream model used in transfer learning can conduct property inference attacks on a victim's tuned downstream model. For example, to infer the presence of images of a specific individual in the downstream training set. We demonstrate attacks in which an adversary can manipulate the upstream model to conduct highly effective and specific property inference attacks (AUC score > 0.9), without incurring significant performance loss on the main task. The main idea of the manipulation is to make the upstream model generate activations (intermediate features) with different distributions for samples with and without a target property, thus enabling the adversary to distinguish easily between downstream models trained with and without training examples that have the target property. Our code is available at https://github.com/yulongt23/Transfer-Inference.

Class Adaptive Network Calibration
Liu, BingyuanandRony, J\'er\^omeandGaldran, AdrianandDolz, JoseandBenAyed, Ismail



研究问题:训练现代深度神经网络时,除了常规的准确度外,还应考虑校准问题。
动机:目前的方法在处理学习过程中的不校准问题时存在两个主要缺点:1)所有类别的标量平衡权重相同,阻碍了解决不同类别内部困难或类别间不平衡的能力;2)平衡权重通常固定,没有自适应策略,可能无法在准确度和校准之间达到最佳折衷,并且需要为每个应用进行超参数搜索。
方法:我们提出了一种用于校准深度网络的类自适应标签平滑(CALS)方法,该方法允许在训练期间学习类别特定的乘数,从而成为常见的标签平滑惩罚的强大替代方案。
效果:我们在各种基准上进行了全面评估和多次比较,包括标准的和长尾的图像分类、语义分割和文本分类,结果证明了所提出的方法的优越性。

Recent studies have revealed that, beyond conventional accuracy, calibration should also be considered for training modern deep neural networks. To address miscalibration during learning, some methods have explored different penalty functions as part of the learning objective, alongside a standard classification loss, with a hyper-parameter controlling the relative contribution of each term. Nevertheless, these methods share two major drawbacks: 1) the scalar balancing weight is the same for all classes, hindering the ability to address different intrinsic difficulties or imbalance among classes; and 2) the balancing weight is usually fixed without an adaptive strategy, which may prevent from reaching the best compromise between accuracy and calibration, and requires hyper-parameter search for each application. We propose Class Adaptive Label Smoothing (CALS) for calibrating deep networks, which allows to learn class-wise multipliers during training, yielding a powerful alternative to common label smoothing penalties. Our method builds on a general Augmented Lagrangian approach, a well-established technique in constrained optimization, but we introduce several modifications to tailor it for large-scale, class-adaptive training. Comprehensive evaluation and multiple comparisons on a variety of benchmarks, including standard and long-tailed image classification, semantic segmentation, and text classification, demonstrate the superiority of the proposed method. The code is available at https://github.com/by-liu/CALS.

TeSLA: Test-Time Self-Learning With Automatic Adversarial Augmentation
Tomar, DevavratandVray, GuillaumeandBozorgtabar, BehzadandThiran, Jean-Philippe



研究问题:如何将预训练的源模型适应到未标记的流式测试数据。
动机:现有的测试时间适应方法主要关注分类任务,使用专门的网络架构,破坏模型校准或依赖于源领域的轻量级信息。
方法:本文提出了一种新的测试时间自我学习方法,通过自动对抗性增强(TeSLA)来适应预训练的源模型到未标记的流式测试数据。与传统基于交叉熵的自我学习方法不同,我们引入了一个新的测试时间损失函数,通过与互信息和在线知识蒸馏的隐式紧密联系来实现。此外,我们还提出了一种可学习的高效对抗性增强模块,通过模拟高熵增强图像来进一步增强在线知识蒸馏。
效果:该方法在几个基准和类型的领域转移上取得了最先进的分类和分割结果,特别是在具有挑战性的医学图像测量转移上。与其他竞争方法相比,TeSLA还具有一些理想的特性,包括校准、不确定性度量、对模型架构的不敏感性和源训练策略,所有这些都得到了广泛的消融实验的支持。

Most recent test-time adaptation methods focus on only classification tasks, use specialized network architectures, destroy model calibration or rely on lightweight information from the source domain. To tackle these issues, this paper proposes a novel Test-time Self-Learning method with automatic Adversarial augmentation dubbed TeSLA for adapting a pre-trained source model to the unlabeled streaming test data. In contrast to conventional self-learning methods based on cross-entropy, we introduce a new test-time loss function through an implicitly tight connection with the mutual information and online knowledge distillation. Furthermore, we propose a learnable efficient adversarial augmentation module that further enhances online knowledge distillation by simulating high entropy augmented images. Our method achieves state-of-the-art classification and segmentation results on several benchmarks and types of domain shifts, particularly on challenging measurement shifts of medical images. TeSLA also benefits from several desirable properties compared to competing methods in terms of calibration, uncertainty metrics, insensitivity to model architectures, and source training strategies, all supported by extensive ablations. Our code and models are available at https://github.com/devavratTomar/TeSLA.

Promoting Semantic Connectivity: Dual Nearest Neighbors Contrastive Learning for Unsupervised Domain Generalization
Liu, YuchenandWang, YaomingandChen, YaboandDai, WenruiandLi, ChenglinandZou, JunniandXiong, Hongkai



研究问题:当前无监督领域泛化(UDG)方法严重依赖昂贵的有标签源数据,而未标记的数据更易获取。因此,本研究探讨了更具实际意义的无监督领域泛化(UDG)问题。
动机:对比学习从不同视角学习不变的视觉表示在同域无监督学习中具有良好语义特性,但在跨域场景中表现不佳。因此,本研究旨在解决这一问题。
方法:我们首先深入探讨了普通对比学习的失败原因,并指出语义连通性是UDG的关键。具体来说,抑制同域连通性并鼓励同类间的连通性有助于学习领域不变的语义信息。然后,我们提出了一种新的无监督领域泛化方法,即带有强增强的双重最近邻对比学习(DN^2A)。
效果:实验结果表明,我们的DN^2A大幅超越了最先进的技术,例如,仅使用1%的标签,在PACS和DomainNet上的线性评估准确率分别提高了12.01%和13.11%。

Domain Generalization (DG) has achieved great success in generalizing knowledge from source domains to unseen target domains. However, current DG methods rely heavily on labeled source data, which are usually costly and unavailable. Since unlabeled data are far more accessible, we study a more practical unsupervised domain generalization (UDG) problem. Learning invariant visual representation from different views, i.e., contrastive learning, promises well semantic features for in-domain unsupervised learning. However, it fails in cross-domain scenarios. In this paper, we first delve into the failure of vanilla contrastive learning and point out that semantic connectivity is the key to UDG. Specifically, suppressing the intra-domain connectivity and encouraging the intra-class connectivity help to learn the domain-invariant semantic information. Then, we propose a novel unsupervised domain generalization approach, namely Dual Nearest Neighbors contrastive learning with strong Augmentation (DN^2A). Our DN^2A leverages strong augmentations to suppress the intra-domain connectivity and proposes a novel dual nearest neighbors search strategy to find trustworthy cross domain neighbors along with in-domain neighbors to encourage the intra-class connectivity. Experimental results demonstrate that our DN^2A outperforms the state-of-the-art by a large margin, e.g., 12.01% and 13.11% accuracy gain with only 1% labels for linear evaluation on PACS and DomainNet, respectively.

Exploring and Utilizing Pattern Imbalance
Mei, ShibinandZhao, ChenglongandYuan, ShengchaoandNi, Bingbing



研究问题:本文旨在解决模式不平衡问题,并开发新的训练方案来避免模式偏好和虚假相关性。
动机:现有的方法主要关注类别或领域的粒度,忽视了数据集中可能存在的更精细的结构。
方法:我们提出了一种新的种子类别定义,作为区分同一类别或领域中不同模式的优化单位。
效果:在各种规模的领域泛化数据集上的大量实验证明了所提出方法的有效性。

In this paper, we identify pattern imbalance from several aspects, and further develop a new training scheme to avert pattern preference as well as spurious correlation. In contrast to prior methods which are mostly concerned with category or domain granularity, ignoring the potential finer structure that existed in datasets, we give a new definition of seed category as an appropriate optimization unit to distinguish different patterns in the same category or domain. Extensive experiments on domain generalization datasets of diverse scales demonstrate the effectiveness of the proposed method.

Are Data-Driven Explanations Robust Against Out-of-Distribution Data?
Li, TangandQiao, FengchunandMa, MengmengandPeng, Xi



研究问题:现有的数据驱动解释方法是否对分布偏移具有鲁棒性?
动机:随着黑箱模型在高风险应用中的影响力日益增强,各种数据驱动的解释方法被引入。然而,机器学习模型不断受到分布偏移的挑战。
方法:我们提出了一个端到端的模型无关学习框架——分布稳健解释(DRE)。其核心思想是,受自我监督学习的启发,充分利用分布间信息,为无需人工标注的解释学习提供监督信号。
效果:我们在广泛的任务和数据类型上进行了大量实验,包括图像和科学表格数据的分类和回归。实验结果表明,该方法显著提高了模型在解释和预测鲁棒性方面对抗分布偏移的能力。

As black-box models increasingly power high-stakes applications, a variety of data-driven explanation methods have been introduced. Meanwhile, machine learning models are constantly challenged by distributional shifts. A question naturally arises: Are data-driven explanations robust against out-of-distribution data? Our empirical results show that even though predict correctly, the model might still yield unreliable explanations under distributional shifts. How to develop robust explanations against out-of-distribution data? To address this problem, we propose an end-to-end model-agnostic learning framework Distributionally Robust Explanations (DRE). The key idea is, inspired by self-supervised learning, to fully utilizes the inter-distribution information to provide supervisory signals for the learning of explanations without human annotation. Can robust explanations benefit the model's generalization capability? We conduct extensive experiments on a wide range of tasks and data types, including classification and regression on image and scientific tabular data. Our results demonstrate that the proposed method significantly improves the model's performance in terms of explanation and prediction robustness against distributional shifts.

Curvature-Balanced Feature Manifold Learning for Long-Tailed Classification
Ma, YanbiaoandJiao, LichengandLiu, FangandYang, ShuyuanandLiu, XuandLi, Lingling



研究问题:本文旨在解决深度学习模型在处理长尾分类问题时存在的模型偏见问题。
动机:尽管已有的研究提出了一些方法来减少模型的偏见,但最近的研究表明,长尾类别并非总是难以学习,并且在样本平衡的数据集中也观察到了模型偏见,暗示存在其他影响模型偏见的因素。
方法:本文系统地提出了一系列用于深度神经网络感知流形的几何度量,并探索了感知流形的几何特性对分类难度的影响以及学习如何塑造感知流形的几何特性。
效果:研究发现,类准确度与感知流形的分离程度之间的相关性在训练过程中逐渐减小,而与曲率的负相关性逐渐增大,这表明曲率失衡会导致模型偏见。因此,本文提出了曲率正则化方法,以帮助模型学习曲率平衡且更平坦的感知流形。在多个长尾和非长尾数据集上的评估表明,该方法具有出色的性能和令人兴奋的通用性,特别是在基于当前最先进的技术实现显著的性能改进方面。

To address the challenges of long-tailed classification, researchers have proposed several approaches to reduce model bias, most of which assume that classes with few samples are weak classes. However, recent studies have shown that tail classes are not always hard to learn, and model bias has been observed on sample-balanced datasets, suggesting the existence of other factors that affect model bias. In this work, we systematically propose a series of geometric measures for perceptual manifolds in deep neural networks, and then explore the effect of the geometric characteristics of perceptual manifolds on classification difficulty and how learning shapes the geometric characteristics of perceptual manifolds. An unanticipated finding is that the correlation between the class accuracy and the separation degree of perceptual manifolds gradually decreases during training, while the negative correlation with the curvature gradually increases, implying that curvature imbalance leads to model bias. Therefore, we propose curvature regularization to facilitate the model to learn curvature-balanced and flatter perceptual manifolds. Evaluations on multiple long-tailed and non-long-tailed datasets show the excellent performance and exciting generality of our approach, especially in achieving significant performance improvements based on current state-of-the-art techniques. Our work reminds researchers to pay attention to model bias not only on long-tailed datasets but also on non-long-tailed and even data-balanced datasets, which can improve model performance from another perspective.

topic-4

Topic words :  point,  features,  feature,  attention,  local,  transformer,  network,  propose

CXTrack: Improving 3D Point Cloud Tracking With Contextual Information
Xu, Tian-XingandGuo, Yuan-ChenandLai, Yu-KunandZhang, Song-Hai



研究问题:如何有效利用上下文信息进行3D物体追踪。
动机:由于外观变化大、遮挡和传感器能力限制导致的点稀疏,使得现有的方法经常忽略并裁剪掉包含有用信息的点,导致重要上下文知识的使用不足。
方法:提出一种基于变压器的3D物体追踪网络CXTrack,通过直接从连续两帧的点特征和之前的边界框中获取上下文信息,以改善追踪结果。设计了一种目标为中心的变压器网络,用于探索上下文信息和隐式传播目标线索。
效果:在KITTI、nuScenes和Waymo Open Dataset三个大规模数据集上的大量实验表明,CXTrack在运行速度为34FPS的同时,实现了最先进的追踪性能。

3D single object tracking plays an essential role in many applications, such as autonomous driving. It remains a challenging problem due to the large appearance variation and the sparsity of points caused by occlusion and limited sensor capabilities. Therefore, contextual information across two consecutive frames is crucial for effective object tracking. However, points containing such useful information are often overlooked and cropped out in existing methods, leading to insufficient use of important contextual knowledge. To address this issue, we propose CXTrack, a novel transformer-based network for 3D object tracking, which exploits ConteXtual information to improve the tracking results. Specifically, we design a target-centric transformer network that directly takes point features from two consecutive frames and the previous bounding box as input to explore contextual information and implicitly propagate target cues. To achieve accurate localization for objects of all sizes, we propose a transformer-based localization head with a novel center embedding module to distinguish the target from distractors. Extensive experiments on three large-scale datasets, KITTI, nuScenes and Waymo Open Dataset, show that CXTrack achieves state-of-the-art tracking performance while running at 34 FPS.

Revisiting Self-Similarity: Structural Embedding for Image Retrieval
Lee, SeongwonandLee, SuhyeonandSeong, HongjeandKim, Euntai



研究问题:尽管全球图像表示取得了进展,但现有的图像检索方法在全局检索阶段很少考虑几何结构。
动机:我们重新审视了传统的自相似性描述符,从卷积的角度出发,对图像的视觉和结构线索进行编码,以实现全局图像表示。
方法:我们提出了一种名为Structural Embedding Network(SENet)的网络,该网络能够捕获图像的内部结构,并在学习各种图像的多样化结构的同时,将这些结构逐渐压缩成密集的自相似性描述符。这些自相似性描述符和原始图像特征被融合并池化为全局嵌入,使全局嵌入能够同时表示图像的几何和视觉线索。
效果:通过这种新颖的结构嵌入,我们的网络在几个图像检索基准测试中设置了新的最先进的性能,证明了其对类似干扰物的鲁棒性。

Despite advances in global image representation, existing image retrieval approaches rarely consider geometric structure during the global retrieval stage. In this work, we revisit the conventional self-similarity descriptor from a convolutional perspective, to encode both the visual and structural cues of the image to global image representation. Our proposed network, named Structural Embedding Network (SENet), captures the internal structure of the images and gradually compresses them into dense self-similarity descriptors while learning diverse structures from various images. These self-similarity descriptors and original image features are fused and then pooled into global embedding, so that global embedding can represent both geometric and visual cues of the image. Along with this novel structural embedding, our proposed network sets new state-of-the-art performances on several image retrieval benchmarks, convincing its robustness to look-alike distractors. The code and models are available: https://github.com/sungonce/SENet.

Decoupling-and-Aggregating for Image Exposure Correction
Wang, YangandPeng, LongandLi, LiangandCao, YangandZha, Zheng-Jun



研究问题:如何改善在曝光条件不佳下拍摄的图片的对比度和细节?
动机:曝光不足会导致图片的低频率和高频率组件混合,限制了统计和结构建模的能力。
方法:提出在每个卷积过程中分离对比度增强和细节恢复的方法。通过添加/差分操作,将CA单元和DA单元插入到现有的CNN-based曝光校正网络中,以改善性能。
效果:实验证明,该方法可以全面提高现有方法的性能,且不增加额外的计算成本。

The images captured under improper exposure conditions often suffer from contrast degradation and detail distortion. Contrast degradation will destroy the statistical properties of low-frequency components, while detail distortion will disturb the structural properties of high-frequency components, leading to the low-frequency and high-frequency components being mixed and inseparable. This will limit the statistical and structural modeling capacity for exposure correction. To address this issue, this paper proposes to decouple the contrast enhancement and detail restoration within each convolution process. It is based on the observation that, in the local regions covered by convolution kernels, the feature response of low-/high-frequency can be decoupled by addition/difference operation. To this end, we inject the addition/difference operation into the convolution process and devise a Contrast Aware (CA) unit and a Detail Aware (DA) unit to facilitate the statistical and structural regularities modeling. The proposed CA and DA can be plugged into existing CNN-based exposure correction networks to substitute the Traditional Convolution (TConv) to improve the performance. Furthermore, to maintain the computational costs of the network without changing, we aggregate two units into a single TConv kernel using structural re-parameterization. Evaluations of nine methods and five benchmark datasets demonstrate that our proposed method can comprehensively improve the performance of existing methods without introducing extra computational costs compared with the original networks. The codes will be publicly available.

MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds
Liu, JiahuiandChang, ChiruiandLiu, JianhuiandWu, XiaoyangandMa, LanandQi, Xiaojuan



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

3D semantic segmentation on multi-scan large-scale point clouds plays an important role in autonomous systems. Unlike the single-scan-based semantic segmentation task, this task requires distinguishing the motion states of points in addition to their semantic categories. However, methods designed for single-scan-based segmentation tasks perform poorly on the multi-scan task due to the lacking of an effective way to integrate temporal information. We propose MarS3D, a plug-and-play motion-aware model for semantic segmentation on multi-scan 3D point clouds. This module can be flexibly combined with single-scan models to allow them to have multi-scan perception abilities. The model encompasses two key designs: the Cross-Frame Feature Embedding module for enriching representation learning and the Motion-Aware Feature Learning module for enhancing motion awareness. Extensive experiments show that MarS3D can improve the performance of the baseline model by a large margin. The code is available at https://github.com/CVMI-Lab/MarS3D.

MSeg3D: Multi-Modal 3D Semantic Segmentation for Autonomous Driving
Li, JialeandDai, HangandHan, HaoandDing, Yong



研究问题:本文旨在解决自动驾驶中3D语义分割的两种主要模式,LiDAR和相机之间的差异性、视场交叉限制以及多模态数据增强等问题。
动机:目前,仅使用LiDAR的方法在小物体和远距离对象的分割上表现不佳,而多模态解决方案尚未得到充分探索。
方法:我们提出了一种多模态3D语义分割模型(MSeg3D),通过联合提取模内特征和融合模间特征来缓解模态异质性问题。MSeg3D的多模态融合包括基于几何的特征融合GF-Phase、跨模态特征补全以及在所有可见点上的基于语义的特征融合SF-Phase。同时,我们还对LiDAR点云和多相机图像分别应用非对称变换,以增强多模态数据增强的效果。
效果:实验结果表明,MSeg3D在nuScenes、Waymo和SemanticKITTI数据集上都取得了最先进的结果。即使在多相机输入故障和多帧点云输入的情况下,MSeg3D仍显示出强大的鲁棒性,并优于仅使用LiDAR的基线方法。

LiDAR and camera are two modalities available for 3D semantic segmentation in autonomous driving. The popular LiDAR-only methods severely suffer from inferior segmentation on small and distant objects due to insufficient laser points, while the robust multi-modal solution is under-explored, where we investigate three crucial inherent difficulties: modality heterogeneity, limited sensor field of view intersection, and multi-modal data augmentation. We propose a multi-modal 3D semantic segmentation model (MSeg3D) with joint intra-modal feature extraction and inter-modal feature fusion to mitigate the modality heterogeneity. The multi-modal fusion in MSeg3D consists of geometry-based feature fusion GF-Phase, cross-modal feature completion, and semantic-based feature fusion SF-Phase on all visible points. The multi-modal data augmentation is reinvigorated by applying asymmetric transformations on LiDAR point cloud and multi-camera images individually, which benefits the model training with diversified augmentation transformations. MSeg3D achieves state-of-the-art results on nuScenes, Waymo, and SemanticKITTI datasets. Under the malfunctioning multi-camera input and the multi-frame point clouds input, MSeg3D still shows robustness and improves the LiDAR-only baseline. Our code is publicly available at https://github.com/jialeli1/lidarseg3d.

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction
Huang, YuanhuiandZheng, WenzhaoandZhang, YunpengandZhou, JieandLu, Jiwen



研究问题:现有的视觉自动驾驶感知方法在描述3D场景的精细结构时存在困难。
动机:为了解决这个问题,我们提出了一种三视角(TPV)表示法,以补充鸟瞰图(BEV)的不足。
方法:我们通过在三个垂直平面上投影特征来对3D空间中的每个点进行建模,并使用转换器基础的TPV编码器(TPVFormer)将图像特征提升到3D TPV空间。
效果:实验表明,我们的模型在稀疏监督下有效地预测了所有体素的语义占有率,首次证明了仅使用相机输入就可以在nuScenes上的激光雷达分割任务上实现与基于激光雷达的方法相媲美的性能。

Modern methods for vision-centric autonomous driving perception widely adopt the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. To address this, we propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively. We employ the attention mechanism to aggregate the image features corresponding to each query in each TPV plane. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels. We demonstrate for the first time that using only camera inputs can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code: https://github.com/wzzheng/TPVFormer.

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference
You, HaoranandXiong, YunyangandDai, XiaoliangandWu, BichenandZhang, PeizhaoandFan, HaoqiandVajda, PeterandLin, Yingyan(Celine)



研究问题:视觉转换器(ViTs)能否在推理过程中同时学习全局和局部上下文,并提高效率?
动机:现有的高效ViTs采用局部或线性注意力,牺牲了捕捉全局或局部上下文的能力。
方法:提出一个名为Castling-ViT的框架,训练ViTs使用线性-角度注意力和基于掩码的二次注意力,但在推理时只使用线性-角度注意力。
效果:Castling-ViT利用角度内核通过光谱角度测量查询和键之间的相似性。实验证明其有效性,例如在分类任务上比使用普通softmax注意力的ViTs提高1.8%的准确率或减少40%的MACs,在检测任务上提高1.2的mAP,同时保持相似的FLOPs。

Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of input tokens. Existing efficient ViTs adopt local attention or linear attention, which sacrifice ViTs' capabilities of capturing either global or local context. In this work, we ask an important research question: Can ViTs learn both global and local context while being more efficient during inference? To this end, we propose a framework called Castling-ViT, which trains ViTs using both linear-angular attention and masked softmax-based quadratic attention, but then switches to having only linear-angular attention during inference. Our Castling-ViT leverages angular kernels to measure the similarities between queries and keys via spectral angles. And we further simplify it with two techniques: (1) a novel linear-angular attention mechanism: we decompose the angular kernels into linear terms and high-order residuals, and only keep the linear terms; and (2) we adopt two parameterized modules to approximate high-order residuals: a depthwise convolution and an auxiliary masked softmax attention to help learn global and local information, where the masks for softmax attention are regularized to gradually become zeros and thus incur no overhead during inference. Extensive experiments validate the effectiveness of our Castling-ViT, e.g., achieving up to a 1.8% higher accuracy or 40% MACs reduction on classification and 1.2 higher mAP on detection under comparable FLOPs, as compared to ViTs with vanilla softmax-based attentions. Project page is available at https://www.haoranyou.com/castling-vit.

Robust 3D Shape Classification via Non-Local Graph Attention Network
Qin, ShengweiandLi, ZhongandLiu, Ligang



研究问题:如何通过设计新的网络结构,实现对3D形状的鲁棒分类。
动机:现有的方法在处理稀疏点云和旋转不变性上存在困难,需要设计新的方法来提高分类效果。
方法:提出了一种非局部图注意力网络(NLGAT),该网络包含两个子网络。第一个子网络通过全局关系网络(GRN)捕获点之间的全局关系;第二个子网络利用几何形状注意力图增强局部特征,该图由全局结构网络(GSN)生成。所有子网络都使用不同维度的Gram矩阵作为输入,以提取更多信息并保持旋转不变性。
效果:实验结果表明,NLGAT模型在各种数据集上的分类效果优于其他最先进的模型。特别是在64个点的稀疏点云和任意SO(3)旋转噪声的情况下,NLGAT的分类结果(85.4%)比其他方法的最佳发展提高了39.4%。

We introduce a non-local graph attention network (NLGAT), which generates a novel global descriptor through two sub-networks for robust 3D shape classification. In the first sub-network, we capture the global relationships between points (i.e., point-point features) by designing a global relationship network (GRN). In the second sub-network, we enhance the local features with a geometric shape attention map obtained from a global structure network (GSN). To keep rotation invariant and extract more information from sparse point clouds, all sub-networks use the Gram matrices with different dimensions as input for working with robust classification. Additionally, GRN effectively preserves the low-frequency features and improves the classification results. Experimental results on various datasets exhibit that the classification effect of the NLGAT model is better than other state-of-the-art models. Especially, in the case of sparse point clouds (64 points) with noise under arbitrary SO(3) rotation, the classification result (85.4%) of NLGAT is improved by 39.4% compared with the best development of other methods.

Bitstream-Corrupted JPEG Images Are Restorable: Two-Stage Compensation and Alignment Framework for Image Restoration
Liu, WenyangandWang, YiandYap, Kim-HuiandChau, Lap-Pui



研究问题:本文研究了加密比特流中存在误比特的JPEG图像恢复问题。
动机:误比特在解码图像内容上带来不可预测的颜色偏移和块移位,现有的主要依赖像素域预定义退化模型的图像恢复方法无法轻易解决这些问题。
方法:提出了一个鲁棒的JPEG解码器,然后通过两阶段补偿和对齐框架来恢复比特流损坏的JPEG图像。具体来说,鲁棒的JPEG解码器采用错误弹性机制来解码损坏的JPEG比特流。两阶段框架由自我补偿和对齐(SCA)阶段和引导补偿和对齐(GCA)阶段组成。SCA根据通过图像内容相似性估计的颜色和块偏移进行自适应的块状图像颜色补偿和对齐。GCA利用从JPEG头中提取的低分辨率缩略图以粗到精的方式指导全分辨率像素级图像恢复。这是通过粗引导pix2pix网络和精引导双向拉普拉斯金字塔融合网络实现的。
效果:在三个不同误比特率基准上进行了实验。实验结果和消融研究表明了我们提出的方法的优势。代码将在https://github.com/wenyang001/Two-ACIR上发布。

In this paper, we study a real-world JPEG image restoration problem with bit errors on the encrypted bitstream. The bit errors bring unpredictable color casts and block shifts on decoded image contents, which cannot be trivially resolved by existing image restoration methods mainly relying on pre-defined degradation models in the pixel domain. To address these challenges, we propose a robust JPEG decoder, followed by a two-stage compensation and alignment framework to restore bitstream-corrupted JPEG images. Specifically, the robust JPEG decoder adopts an error-resilient mechanism to decode the corrupted JPEG bitstream. The two-stage framework is composed of the self-compensation and alignment (SCA) stage and the guided-compensation and alignment (GCA) stage. The SCA adaptively performs block-wise image color compensation and alignment based on the estimated color and block offsets via image content similarity. The GCA leverages the extracted low-resolution thumbnail from the JPEG header to guide full-resolution pixel-wise image restoration in a coarse-to-fine manner. It is achieved by a coarse-guided pix2pix network and a refine-guided bi-directional Laplacian pyramid fusion network. We conduct experiments on three benchmarks with varying degrees of bit error rates. Experimental results and ablation studies demonstrate the superiority of our proposed method. The code will be released at https://github.com/wenyang001/Two-ACIR.

Histopathology Whole Slide Image Analysis With Heterogeneous Graph Representation Learning
Chan, TsaiHorandCendra, FernandoJulioandMa, LanandYin, GuoshengandYu, Lequan



研究问题:如何利用图模型挖掘全切片组织病理学图像(WSI)中不同细胞类型的复杂结构关系。
动机:现有的方法主要关注同质图模型的WSI分析,无法充分挖掘生物实体间的复杂交互关系。
方法:提出一种新颖的异质图基框架,将WSI构建为带有"核类型"属性和语义相似性属性的异质图,设计了一种新的异质图边属性转换器(HEAT)进行信息聚合,并设计了一种新的基于伪标签的语义一致池化机制获取图级别特征。
效果:在三个公共TCGA基准数据集上的大量实验表明,该框架在各种任务上显著优于现有方法。

Graph-based methods have been extensively applied to whole slide histopathology image (WSI) analysis due to the advantage of modeling the spatial relationships among different entities. However, most of the existing methods focus on modeling WSIs with homogeneous graphs (e.g., with homogeneous node type). Despite their successes, these works are incapable of mining the complex structural relations between biological entities (e.g., the diverse interaction among different cell types) in the WSI. We propose a novel heterogeneous graph-based framework to leverage the inter-relationships among different types of nuclei for WSI analysis. Specifically, we formulate the WSI as a heterogeneous graph with "nucleus-type" attribute to each node and a semantic similarity attribute to each edge. We then present a new heterogeneous-graph edge attribute transformer (HEAT) to take advantage of the edge and node heterogeneity during massage aggregating. Further, we design a new pseudo-label-based semantic-consistent pooling mechanism to obtain graph-level features, which can mitigate the over-parameterization issue of conventional cluster-based pooling. Additionally, observing the limitations of existing association-based localization methods, we propose a causal-driven approach attributing the contribution of each node to improve the interpretability of our framework. Extensive experiments on three public TCGA benchmark datasets demonstrate that our framework outperforms the state-of-the-art methods with considerable margins on various tasks. Our codes are available at https://github.com/HKU-MedAI/WSI-HGNN.

Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning
Wu, XiaoyangandWen, XinandLiu, XihuiandZhao, Hengshuang



研究问题:如何通过对比学习进行无监督的3D表示学习,解决RGB-D帧匹配效率低下和模式崩溃的问题。
动机:现有的PointContrast方法在各种下游任务上表现出色,但大规模无监督3D学习的趋势尚未形成,主要受限于RGB-D帧匹配效率低和模式崩溃问题。
方法:我们提出了一种高效且有效的对比学习框架,通过精心策划的数据增强管道和实用的视图混合策略,直接在场景级别的点云上生成对比视图。同时,我们在对比学习框架中引入了重建学习,设计了对比交叉掩码,专门针对点颜色和曲面法线的重建。
效果:我们的Masked Scene Contrast(MSC)框架能够更高效、有效地提取全面的3D表示。它至少将预训练过程加速了3倍,与之前的工作相比,性能没有妥协。此外,MSC还支持跨多个数据集的大规模3D预训练,进一步提升了性能,并在几个下游任务上实现了最先进的微调结果,例如在ScanNet语义分割验证集上达到了75.5%的mIoU。

As a pioneering work, PointContrast conducts unsupervised 3D representation learning via leveraging contrastive learning over raw RGB-D frames and proves its effectiveness on various downstream tasks. However, the trend of large-scale unsupervised learning in 3D has yet to emerge due to two stumbling blocks: the inefficiency of matching RGB-D frames as contrastive views and the annoying mode collapse phenomenon mentioned in previous works. Turning the two stumbling blocks into empirical stepping stones, we first propose an efficient and effective contrastive learning framework, which generates contrastive views directly on scene-level point clouds by a well-curated data augmentation pipeline and a practical view mixing strategy. Second, we introduce reconstructive learning on the contrastive learning framework with an exquisite design of contrastive cross masks, which targets the reconstruction of point color and surfel normal. Our Masked Scene Contrast (MSC) framework is capable of extracting comprehensive 3D representations more efficiently and effectively. It accelerates the pre-training procedure by at least 3x and still achieves an uncompromised performance compared with previous work. Besides, MSC also enables large-scale 3D pre-training across multiple datasets, which further boosts the performance and achieves state-of-the-art fine-tuning results on several downstream tasks, e.g., 75.5% mIoU on ScanNet semantic segmentation validation set.

A Simple Baseline for Video Restoration With Grouped Spatial-Temporal Shift
Li, DasongandShi, XiaoyuandZhang, YiandCheung, KaChunandSee, SimonandWang, XiaogangandQin, HongweiandLi, Hongsheng



研究问题:本文旨在提出一种简单有效的视频恢复框架,以降低复杂网络架构带来的高计算成本。
动机:现有的深度学习方法在视频恢复中需要复杂的网络架构,如光流估计、可变形卷积和跨帧自注意力层,导致计算成本高昂。
方法:我们提出了基于分组时空位移的简单有效框架,这是一种轻量级直接的技术,可以隐式捕获多帧聚合的帧间对应关系。通过引入分组空间位移,我们可以获得广泛的有效感受野。结合基本的二维卷积,这个简单的框架可以有效地聚合帧间信息。
效果:大量实验表明,我们的框架在视频去模糊和视频去噪任务上优于先前最先进的方法,同时使用的计算成本不到其四分之一。这些结果表明,我们的方法有潜力在保持高质量结果的同时显著降低计算开销。代码可在https://github.com/dasongli1/Shift-Net获取。

Video restoration, which aims to restore clear frames from degraded videos, has numerous important applications. The key to video restoration depends on utilizing inter-frame information. However, existing deep learning methods often rely on complicated network architectures, such as optical flow estimation, deformable convolution, and cross-frame self-attention layers, resulting in high computational costs. In this study, we propose a simple yet effective framework for video restoration. Our approach is based on grouped spatial-temporal shift, which is a lightweight and straightforward technique that can implicitly capture inter-frame correspondences for multi-frame aggregation. By introducing grouped spatial shift, we attain expansive effective receptive fields. Combined with basic 2D convolution, this simple framework can effectively aggregate inter-frame information. Extensive experiments demonstrate that our framework outperforms the previous state-of-the-art method, while using less than a quarter of its computational cost, on both video deblurring and video denoising tasks. These results indicate the potential for our approach to significantly reduce computational overhead while maintaining high-quality results. Code is avaliable at https://github.com/dasongli1/Shift-Net.

SliceMatch: Geometry-Guided Aggregation for Cross-View Pose Estimation
Lentsch, TedandXia, ZiminandCaesar, HolgerandKooij, JulianF.P.



研究问题:本文旨在解决跨视角的相机位姿估计问题,即确定给定地面图像相对于局部区域航拍图像的3自由度相机位姿。
动机:现有的方法在处理跨视角的相机位姿估计问题上效果不佳,需要更高效准确的算法。
方法:提出SliceMatch方法,包括地面和航拍特征提取器、特征聚合器和位姿预测器。特征提取器从地面和航拍图像中提取密集特征;特征聚合器在给定一组候选相机位姿的情况下,构造单个地面描述符和一组与位姿相关的航拍描述符。特别的是,我们的新型航拍特征聚合器具有一个用于地面引导的航拍特征选择的跨视图注意力模块,并利用地面相机视锥体在航拍图像上的几何投影来池化特征。使用预先计算的掩码来实现高效的航拍描述符构建。SliceMatch使用对比学习进行训练,将位姿估计形式化为地面描述符和航拍描述符之间的相似性比较。
效果:与最先进的方法相比,SliceMatch在VIGOR基准测试上使用相同的VGG16主干网络实现了19%的中值定位误差降低,每秒150帧,使用ResNet50主干网络时误差降低了50%。

This work addresses cross-view camera pose estimation, i.e., determining the 3-Degrees-of-Freedom camera pose of a given ground-level image w.r.t. an aerial image of the local area. We propose SliceMatch, which consists of ground and aerial feature extractors, feature aggregators, and a pose predictor. The feature extractors extract dense features from the ground and aerial images. Given a set of candidate camera poses, the feature aggregators construct a single ground descriptor and a set of pose-dependent aerial descriptors. Notably, our novel aerial feature aggregator has a cross-view attention module for ground-view guided aerial feature selection and utilizes the geometric projection of the ground camera's viewing frustum on the aerial image to pool features. The efficient construction of aerial descriptors is achieved using precomputed masks. SliceMatch is trained using contrastive learning and pose estimation is formulated as a similarity comparison between the ground descriptor and the aerial descriptors. Compared to the state-of-the-art, SliceMatch achieves a 19% lower median localization error on the VIGOR benchmark using the same VGG16 backbone at 150 frames per second, and a 50% lower error when using a ResNet50 backbone.

Learning Rotation-Equivariant Features for Visual Correspondence
Lee, JongminandKim, ByungjinandKim, SeungwookandCho, Minsu



研究问题:如何提取具有旋转不变性的判别性局部特征,以建立图像之间的对应关系。
动机:现有的方法需要复杂的数据增强才能学习到旋转等变的特征和它们的取向,而我们的方法通过使用群等变CNNs,可以有效地学习到旋转等变的特征和它们的取向。
方法:我们提出了一种自监督学习框架,利用群等变CNNs来提取判别性的旋转不变描述符。我们的方法进一步通过群体对齐这一新颖的不变映射技术处理所得的特征和它们的取向,该技术沿着群维度按其取向移动群等变特征,从而实现旋转不变性,同时避免了群维度的塌陷和判别性的丧失。
效果:我们的方法在端到端的自我监督方式下进行训练,并在各种旋转条件下实现了最先进的匹配精度,同时也在关键点匹配和相机位姿估计任务上表现出了竞争力。

Extracting discriminative local features that are invariant to imaging variations is an integral part of establishing correspondences between images. In this work, we introduce a self-supervised learning framework to extract discriminative rotation-invariant descriptors using group-equivariant CNNs. Thanks to employing group-equivariant CNNs, our method effectively learns to obtain rotation-equivariant features and their orientations explicitly, without having to perform sophisticated data augmentations. The resultant features and their orientations are further processed by group aligning, a novel invariant mapping technique that shifts the group-equivariant features by their orientations along the group dimension. Our group aligning technique achieves rotation-invariance without any collapse of the group dimension and thus eschews loss of discriminability. The proposed method is trained end-to-end in a self-supervised manner, where we use an orientation alignment loss for the orientation estimation and a contrastive descriptor loss for robust local descriptors to geometric/photometric variations. Our method demonstrates state-of-the-art matching accuracy among existing rotation-invariant descriptors under varying rotation and also shows competitive results when transferred to the task of keypoint matching and camera pose estimation.

Dynamic Focus-Aware Positional Queries for Semantic Segmentation
He, HaoyuandCai, JianfeiandPan, ZizhengandLiu, JingandZhang, JingandTao, DachengandZhuang, Bohan



研究问题:如何提高语义分割的准确性和精度。
动机:现有的端到端训练的查询集在语义分割中取得了突破,但依赖于可学习的参数化位置查询,这可能导致数据集统计信息的编码不准确,从而影响定位准确性。
方法:提出了一种名为动态焦点感知位置查询(DFPQ)的简单而有效的查询设计,根据前解码器块的交叉注意力分数和相应图像特征的位置编码动态生成位置查询。同时,通过仅基于低分辨率交叉注意力分数聚合上下文令牌来执行局部关系聚合,以高效处理高分辨率交叉注意力。
效果:在ADE20K和Cityscapes上的大量实验表明,通过对Mask2former进行两项修改,我们的框架实现了最先进的性能,并在ADE20K验证集上分别以ResNet-50、Swin-T和Swin-B主干网络获得了1.1%、1.9%和1.1%的单尺度mIoU,明显优于Mask2former。

The DETR-like segmentors have underpinned the most recent breakthroughs in semantic segmentation, which end-to-end train a set of queries representing the class prototypes or target segments. Recently, masked attention is proposed to restrict each query to only attend to the foreground regions predicted by the preceding decoder block for easier optimization. Although promising, it relies on the learnable parameterized positional queries which tend to encode the dataset statistics, leading to inaccurate localization for distinct individual queries. In this paper, we propose a simple yet effective query design for semantic segmentation termed Dynamic Focus-aware Positional Queries (DFPQ), which dynamically generates positional queries conditioned on the cross-attention scores from the preceding decoder block and the positional encodings for the corresponding image features, simultaneously. Therefore, our DFPQ preserves rich localization information for the target segments and provides accurate and fine-grained positional priors. In addition, we propose to efficiently deal with high-resolution cross-attention by only aggregating the contextual tokens based on the low-resolution cross-attention scores to perform local relation aggregation. Extensive experiments on ADE20K and Cityscapes show that with the two modifications on Mask2former, our framework achieves SOTA performance and outperforms Mask2former by clear margins of 1.1%, 1.9%, and 1.1% single-scale mIoU with ResNet-50, Swin-T, and Swin-B backbones on the ADE20K validation set, respectively. Source code is available at https://github.com/ziplab/FASeg.

PointConvFormer: Revenge of the Point-Based Convolution
Wu, WenxuanandFuxin, LiandShan, Qi



研究问题:本文提出了一种新的点云深度学习网络架构构建模块——PointConvFormer。
动机:受到泛化理论的启发,PointConvFormer结合了基于相对位置的点卷积和利用特征注意力的Transformers的思想。
方法:在PointConvFormer中,通过计算点在邻域内的特征差来得到注意力,并用该注意力修改每个点的卷积权重。这样既保留了点卷积的不变性,又通过注意力选择了邻域内的相关信息进行卷积。
效果:我们在多个数据集上进行了语义分割和场景流估计任务的实验,包括ScanNet、SemanticKitti、FlyingThings3D和KITTI。实验结果显示,PointConvFormer在性能上大大超过了经典的卷积、常规的Transformers以及体素化的稀疏卷积方法,并且其网络更小更快。可视化结果也显示,PointConvFormer在平坦区域的表现与卷积相近,而在物体边界的选择效应更强,证明它兼具两者的优点。

We introduce PointConvFormer, a novel building block for point cloud based deep network architectures. Inspired by generalization theory, PointConvFormer combines ideas from point convolution, where filter weights are only based on relative position, and Transformers which utilize feature-based attention. In PointConvFormer, attention computed from feature difference between points in the neighborhood is used to modify the convolutional weights at each point. Hence, we preserved the invariances from point convolution, whereas attention helps to select relevant points in the neighborhood for convolution. We experiment on both semantic segmentation and scene flow estimation tasks on point clouds with multiple datasets including ScanNet, SemanticKitti, FlyingThings3D and KITTI. Our results show that PointConvFormer substantially outperforms classic convolutions, regular transformers, and voxelized sparse convolution approaches with much smaller and faster networks. Visualizations show that PointConvFormer performs similarly to convolution on flat areas, whereas the neighborhood selection effect is stronger on object boundaries, showing that it has got the best of both worlds. The code will be available with the final version.

BiFormer: Vision Transformer With Bi-Level Routing Attention
Zhu, LeiandWang, XinjiangandKe, ZhanghanandZhang, WayneandLau, RynsonW.H.



研究问题:如何降低视觉转换器中的注意力机制的计算负担和内存占用。
动机:现有的方法通过限制注意力操作在局部窗口、轴向条纹或扩张窗口内,引入了手工制作的和与内容无关的稀疏性来解决这个问题。
方法:我们提出了一种新的动态稀疏注意力机制,通过双层路由实现更灵活的内容感知计算分配。具体来说,对于查询,首先在粗粒度的区域级别上过滤掉无关的键值对,然后在剩余的候选区域(即路由区域)上应用细粒度的token-to-token注意力。
效果:我们提出的双层路由注意力机制在节省计算和内存的同时,只涉及到GPU友好的密集矩阵乘法。基于这种注意力机制,我们构建了一个新的通用视觉转换器,名为BiFormer。实验结果表明,这种方法在图像分类、目标检测和语义分割等计算机视觉任务上具有很好的性能和高效的计算效率。

As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated windows. In contrast to these approaches, we propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions (i.e., routed regions). We provide a simple yet effective implementation of the proposed bi-level routing attention, which utilizes the sparsity to save both computation and memory while involving only GPU-friendly dense matrix multiplications. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented. As BiFormer attends to a small subset of relevant tokens in a query-adaptive manner without distraction from other irrelevant ones, it enjoys both good performance and high computational efficiency, especially in dense prediction tasks. Empirical results across several computer vision tasks such as image classification, object detection, and semantic segmentation verify the effectiveness of our design. Code is available at https://github.com/rayleizhu/BiFormer.

RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo
Cai, ChangjiangandJi, PanandYan, QinganandXu, Yi



研究问题:如何从已定位的图像中进行多视角深度估计。
动机:现有的方法在处理多视角几何信息的编码上存在不足,需要改进。
方法:提出一种学习优化的方法,通过迭代索引平面扫描成本体积和利用卷积门控循环单元(GRU)回归深度图。同时,对参考图像引入变换器块打破Siamese网络的对称性以提取全局特征,并使用残差姿态网络纠正参考和源图像之间的位姿误差。
效果:在真实世界的MVS数据集上进行的大量实验表明,该方法在数据集内评估和跨数据集泛化方面均达到了最先进的性能。

This paper presents a learning-based method for multi-view depth estimation from posed images. Our core idea is a "learning-to-optimize" paradigm that iteratively indexes a plane-sweeping cost volume and regresses the depth map via a convolutional Gated Recurrent Unit (GRU). Since the cost volume plays a paramount role in encoding the multi-view geometry, we aim to improve its construction both at pixel- and frame- levels. At the pixel level, we propose to break the symmetry of the Siamese network (which is typically used in MVS to extract image features) by introducing a transformer block to the reference image (but not to the source images). Such an asymmetric volume allows the network to extract global features from the reference image to predict its depth map. Given potential inaccuracies in the poses between reference and source images, we propose to incorporate a residual pose network to correct the relative poses. This essentially rectifies the cost volume at the frame level. We conduct extensive experiments on real-world MVS datasets and show that our method achieves state-of-the-art performance in terms of both within-dataset evaluation and cross-dataset generalization.

VectorFloorSeg: Two-Stream Graph Attention Network for Vectorized Roughcast Floorplan Segmentation
Yang, BingchenandJiang, HaiyongandPan, HaoandXiao, Jun



研究问题:本文旨在解决矢量图形(VG)中典型场景——粗糙地平面图的语义分割问题,其输出可直接用于进一步的应用如室内装饰和房间空间建模。
动机:由于像素级分割忽略了矢量平面图中的常规元素(如线段),以往的语义分割工作在处理带有裸露墙体结构的粗糙地平面图时,常常产生模糊的边界和分割房间中的离群片段。
方法:我们提出的方法充分利用了矢量平面图中的常规元素进行更完整的分割。通过将线段分类为房间边界,并将由线段划分的区域划分为房间部分,我们的流程从矢量平面图预测房间分割。为了充分利用线条和区域之间的结构关系,我们使用两流图神经网络分别处理线段和分区区域,并设计了一种新颖的调制图注意力层来融合来自一个流的异构信息到另一个流。
效果:大量实验表明,直接在矢量平面图上操作,我们在mIoU和mAcc上都优于基于图像的方法。此外,我们提出了一种新的度量标准,可以捕获房间完整性和边界规则性,这证实了我们的方法产生的分割更加规则。

Vector graphics (VG) are ubiquitous in industrial designs. In this paper, we address semantic segmentation of a typical VG, i.e., roughcast floorplans with bare wall structures, whose output can be directly used for further applications like interior furnishing and room space modeling. Previous semantic segmentation works mostly process well-decorated floorplans in raster images and usually yield aliased boundaries and outlier fragments in segmented rooms, due to pixel-level segmentation that ignores the regular elements (e.g. line segments) in vector floorplans. To overcome these issues, we propose to fully utilize the regular elements in vector floorplans for more integral segmentation. Our pipeline predicts room segmentation from vector floorplans by dually classifying line segments as room boundaries, and regions partitioned by line segments as room segments. To fully exploit the structural relationships between lines and regions, we use two-stream graph neural networks to process the line segments and partitioned regions respectively, and devise a novel modulated graph attention layer to fuse the heterogeneous information from one stream to the other. Extensive experiments show that by directly operating on vector floorplans, we outperform image-based methods in both mIoU and mAcc. In addition, we propose a new metric that captures room integrity and boundary regularity, which confirms that our method produces much more regular segmentations. Source code is available at https://github.com/DrZiji/VecFloorSeg

Dynamic Aggregated Network for Gait Recognition
Ma, KangandFu, YingandZheng, DezhiandCao, ChunshuiandHu, XuecaiandHuang, Yongzhen



研究问题:本文旨在解决步态识别在现实场景中受到多种外部因素影响的问题,如携带条件、穿着外套和不同视角等。
动机:现有的深度学习步态识别方法往往只提取一个显著特征,没有充分考虑关键区域步态特征之间的关系,并忽视了完整运动模式的聚合。
方法:本文提出了一种新的观点,即实际步态特征包括多个关键区域的全局运动模式,每个全局运动模式由一系列局部运动模式组成。为此,我们提出了一种动态聚合网络(DANet)来学习更具判别性的步态特征。具体来说,我们在相邻像素的特征之间创建了一个动态注意力机制,该机制不仅自适应地关注关键区域,而且生成更具表现力的局部运动模式。此外,我们还开发了一种自注意力机制来选择代表性的局部运动模式,并进一步学习稳健的全局运动模式。
效果:在三个流行的公共步态数据集(CASIA-B、OUMVLP和Gait3D)上进行的大量实验表明,所提出的方法可以显著提高当前最先进的方法。

Gait recognition is beneficial for a variety of applications, including video surveillance, crime scene investigation, and social security, to mention a few. However, gait recognition often suffers from multiple exterior factors in real scenes, such as carrying conditions, wearing overcoats, and diverse viewing angles. Recently, various deep learning-based gait recognition methods have achieved promising results, but they tend to extract one of the salient features using fixed-weighted convolutional networks, do not well consider the relationship within gait features in key regions, and ignore the aggregation of complete motion patterns. In this paper, we propose a new perspective that actual gait features include global motion patterns in multiple key regions, and each global motion pattern is composed of a series of local motion patterns. To this end, we propose a Dynamic Aggregation Network (DANet) to learn more discriminative gait features. Specifically, we create a dynamic attention mechanism between the features of neighboring pixels that not only adaptively focuses on key regions but also generates more expressive local motion patterns. In addition, we develop a self-attention mechanism to select representative local motion patterns and further learn robust global motion patterns. Extensive experiments on three popular public gait datasets, i.e., CASIA-B, OUMVLP, and Gait3D, demonstrate that the proposed method can provide substantial improvements over the current state-of-the-art methods.

3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud
Feng, MingtaoandHou, HaoranandZhang, LiangandWu, ZijieandGuo, YulanandMian, Ajmal



研究问题:如何更准确地理解和预测3D场景中的对象关系和交互?
动机:由于3D场景包含部分扫描的对象,物理连接紧密,大小不断变化,以及各种复杂的关系,现有的方法在有限的训练样本下表现不佳。
方法:利用3D场景的物理空间固有的层次结构,将视觉内容和文本事实结合形成一个3D空间多模态知识图谱,并利用外部知识库作为基准。同时,提出一个知识驱动的场景图预测模块,利用3D空间知识有效规范关系语义空间。
效果:实验证明该方法优于当前最先进的竞争对手。

In-depth understanding of a 3D scene not only involves locating/recognizing individual objects, but also requires to infer the relationships and interactions among them. However, since 3D scenes contain partially scanned objects with physical connections, dense placement, changing sizes, and a wide variety of challenging relationships, existing methods perform quite poorly with limited training samples. In this work, we find that the inherently hierarchical structures of physical space in 3D scenes aid in the automatic association of semantic and spatial arrangements, specifying clear patterns and leading to less ambiguous predictions. Thus, they well meet the challenges due to the rich variations within scene categories. To achieve this, we explicitly unify these structural cues of 3D physical spaces into deep neural networks to facilitate scene graph prediction. Specifically, we exploit an external knowledge base as a baseline to accumulate both contextualized visual content and textual facts to form a 3D spatial multimodal knowledge graph. Moreover, we propose a knowledge-enabled scene graph prediction module benefiting from the 3D spatial knowledge to effectively regularize semantic space of relationships. Extensive experiments demonstrate the superiority of the proposed method over current state-of-the-art competitors. Our code is available at https://github.com/HHrEtvP/SMKA.

Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation
Zhang, GuozhenandZhu, YuhanandWang, HaonanandChen, YouxinandWu, GangshanandWang, Limin



研究问题:如何有效地提取视频帧插值中帧间运动和外观信息。
动机:以前的工作要么混合提取这两种类型的信息,要么为每种类型设计单独的模块,导致表示模糊和效率低下。
方法:我们提出了一种新的模块,通过统一的操作显式提取运动和外观信息。具体来说,我们重新思考了帧间注意力的信息处理过程,并重用其注意力图进行外观特征增强和运动信息提取。
效果:实验结果表明,对于固定和任意时间步长的插值,我们的方法在各种数据集上都达到了最先进的性能。同时,与性能相近的模型相比,我们的方法计算开销更小。

Effectively extracting inter-frame motion and appearance information is important for video frame interpolation (VFI). Previous works either extract both types of information in a mixed way or devise separate modules for each type of information, which lead to representation ambiguity and low efficiency. In this paper, we propose a new module to explicitly extract motion and appearance information via a unified operation. Specifically, we rethink the information process in inter-frame attention and reuse its attention map for both appearance feature enhancement and motion information extraction. Furthermore, for efficient VFI, our proposed module could be seamlessly integrated into a hybrid CNN and Transformer architecture. This hybrid pipeline can alleviate the computational complexity of inter-frame attention as well as preserve detailed low-level structure information. Experimental results demonstrate that, for both fixed- and arbitrary-timestep interpolation, our method achieves state-of-the-art performance on various datasets. Meanwhile, our approach enjoys a lighter computation overhead over models with close performance. The source code and models are available at https://github.com/MCG-NJU/EMA-VFI.

ViTs for SITS: Vision Transformers for Satellite Image Time Series
Tarasiou, MichailandChavez, ErikandZafeiriou, Stefanos



研究问题:本文介绍了一种基于视觉转换器的全注意力模型,用于处理卫星图像时间序列(SITS)。
动机:与自然图像不同,对于SITS处理,先时空后的方式更为直观。
方法:将SITS记录分割成非重叠的空间和时间块进行标记化,然后通过分解的时空编码器进行处理。同时引入了两个新的机制,即获取时间的特定时间位置编码和多个可学习的类别令牌,以提高模型的判别能力。
效果:通过广泛的消融研究评估了所有新设计选择的效果。所提出的架构在三个公开可用的SITS语义分割和分类数据集上取得了最先进的性能,大大超过了以前的方法。

In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model's discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes can be found at https://github.com/michaeltrs/DeepSatModels.

Graph Transformer GANs for Graph-Constrained House Generation
Tang, HaoandZhang, ZhenyuandShi, HumphreyandLi, BoandShao, LingandSebe, NicuandTimofte, RaduandVanGool, Luc



研究问题:如何有效地学习图节点关系,以应对具有挑战性的图约束房屋生成任务。
动机:现有的方法在处理图约束的房屋生成任务时,无法有效地捕捉图节点间的全局和局部交互信息。
方法:提出了一种新颖的图变压器生成对抗网络(GTGAN),该网络包括一个结合了图卷积和自注意力的变压器编码器,用于建模连接和非连接图节点之间的局部和全局交互。同时,还提出了一种新的基于节点分类的判别器,以保留不同房屋组件的高级别语义和判别性节点特征。
效果:在两个具有挑战性的图约束房屋生成任务(即房屋布局和屋顶生成)上进行的实验表明,GTGAN在客观定量评分和主观视觉逼真度方面均表现出良好的效果,并在这两个任务上都取得了新的最先进的结果。

We present a novel graph Transformer generative adversarial network (GTGAN) to learn effective graph node relations in an end-to-end fashion for the challenging graph-constrained house generation task. The proposed graph-Transformer-based generator includes a novel graph Transformer encoder that combines graph convolutions and self-attentions in a Transformer to model both local and global interactions across connected and non-connected graph nodes. Specifically, the proposed connected node attention (CNA) and non-connected node attention (NNA) aim to capture the global relations across connected nodes and non-connected nodes in the input graph, respectively. The proposed graph modeling block (GMB) aims to exploit local vertex interactions based on a house layout topology. Moreover, we propose a new node classification-based discriminator to preserve the high-level semantic and discriminative node features for different house components. Finally, we propose a novel graph-based cycle-consistency loss that aims at maintaining the relative spatial relationships between ground truth and predicted graphs. Experiments on two challenging graph-constrained house generation tasks (i.e., house layout and roof generation) with two public datasets demonstrate the effectiveness of GTGAN in terms of objective quantitative scores and subjective visual realism. New state-of-the-art results are established by large margins on both tasks.

LG-BPN: Local and Global Blind-Patch Network for Self-Supervised Real-World Denoising
Wang, ZichunandFu, YingandLiu, JiandZhang, Yulun



研究问题:大多数自我监督的去噪方法在真实噪声下失败,因为存在强烈的空间噪声相关性。
动机:针对现实世界的去噪方法,要么忽视了这种空间相关性,要么由于未充分考虑相关性而破坏了精细纹理。
方法:提出了一种新的自监督真实世界去噪方法LG-BPN,该方法将空间相关性统计纳入网络设计中进行局部细节恢复,并为先前基于CNN的BSN方法带来了长范围依赖性建模能力。具体包括基于相关性统计的密集采样补丁掩蔽卷积模块和允许在BSN中利用远程上下文的扩张Transformer块。
效果:广泛的实验结果证明,LG-BPN能够充分利用详细结构和全局交互,并在真实世界数据集上表现出优越的性能。

Despite the significant results on synthetic noise under simplified assumptions, most self-supervised denoising methods fail under real noise due to the strong spatial noise correlation, including the advanced self-supervised blind-spot networks (BSNs). For recent methods targeting real-world denoising, they either suffer from ignoring this spatial correlation, or are limited by the destruction of fine textures for under-considering the correlation. In this paper, we present a novel method called LG-BPN for self-supervised real-world denoising, which takes the spatial correlation statistic into our network design for local detail restoration, and also brings the long-range dependencies modeling ability to previously CNN-based BSN methods. First, based on the correlation statistic, we propose a densely-sampled patch-masked convolution module. By taking more neighbor pixels with low noise correlation into account, we enable a denser local receptive field, preserving more useful information for enhanced fine structure recovery. Second, we propose a dilated Transformer block to allow distant context exploitation in BSN. This global perception addresses the intrinsic deficiency of BSN, whose receptive field is constrained by the blind spot requirement, which can not be fully resolved by the previous CNN-based BSNs. These two designs enable LG-BPN to fully exploit both the detailed structure and the global interaction in a blind manner. Extensive results on real-world datasets demonstrate the superior performance of our method. https://github.com/Wang-XIaoDingdd/LGBPN

Self-Positioning Point-Based Transformer for Point Cloud Understanding
Park, JinyoungandLee, SanghyeokandKim, SihyeonandXiong, YunyangandKim, HyunwooJ.



研究问题:如何将Transformers直接应用于点云数据,以解决其二次成本问题。
动机:尽管Transformers在计算机视觉任务上表现出色,但直接应用于点云数据的复杂性较高。
方法:提出一种基于自我定位点的Transformer(SPoTr)架构,该架构由局部自我注意力和基于输入形状自适应定位的自我定位点全局交叉注意力组成。
效果:实验证明,SPoTr在形状分类、部分分割和场景分割等三个点云任务上均有效,并在形状分类任务上比之前的最佳模型提高了2.6%的准确率。

Transformers have shown superior performance on various computer vision tasks with their capabilities to capture long-range dependencies. Despite the success, it is challenging to directly apply Transformers on point clouds due to their quadratic cost in the number of points. In this paper, we present a Self-Positioning point-based Transformer (SPoTr), which is designed to capture both local and global shape contexts with reduced complexity. Specifically, this architecture consists of local self- attention and self-positioning point-based global cross-attention. The self-positioning points, adaptively located based on the input shape, consider both spatial and semantic information with disentangled attention to improve expressive power. With the self-positioning points, we propose a novel global cross-attention mechanism for point clouds, which improves the scalability of global self-attention by allowing the attention module to compute attention weights with only a small set of self-positioning points. Experiments show the effectiveness of SPoTr on three point cloud tasks such as shape classification, part segmentation, and scene segmentation. In particular, our proposed model achieves an accuracy gain of 2.6% over the previous best models on shape classification with ScanObjectNN. We also provide qualitative analyses to demonstrate the interpretability of self-positioning points. The code of SPoTr is available at https://github.com/mlvlab/SPoTr.

Learning Dynamic Style Kernels for Artistic Style Transfer
Xu, WenjuandLong, ChengjiangandNie, Yongwei



研究问题:如何有效地进行艺术图像生成的任意风格转换。
动机:现有的方法要么忽视局部细节全局地修改内容特征,要么过于关注局部结构细节导致风格泄露。
方法:提出一种新的“风格内核”方案,通过学习空间自适应内核进行逐像素的风格化,其中卷积核是从全局风格-内容对齐的特征动态生成的,然后应用学到的内核来调整每个空间位置的内容特征。
效果:该方法在视觉质量和效率方面优于现有方法,表现出优越的性能。

Arbitrary style transfer has been demonstrated to be efficient in artistic image generation. Previous methods either globally modulate the content feature ignoring local details, or overly focus on the local structure details leading to style leakage. In contrast to the literature, we propose a new scheme "style kernel" that learns spatially adaptive kernel for per-pixel stylization, where the convolutional kernels are dynamically generated from the global style-content aligned feature and then the learned kernels are applied to modulate the content feature at each spatial position. This new scheme allows flexible both global and local interactions between the content and style features such that the wanted styles can be easily transferred to the content image while at the same time the content structure can be easily preserved. To further enhance the flexibility of our style transfer method, we propose a Style Alignment Encoding (SAE) module complemented with a Content-based Gating Modulation (CGM) module for learning the dynamic style kernels in focusing regions. Extensive experiments strongly demonstrate that our proposed method outperforms state-of-the-art methods and exhibits superior performance in terms of visual quality and efficiency.

OcTr: Octree-Based Transformer for 3D Object Detection
Zhou, ChaoandZhang, YananandChen, JiaxinandHuang, Di



研究问题:如何从大规模的3D场景中捕获足够的特征,特别是对于遥远或被遮挡的对象。
动机:尽管Transformers具有长序列建模能力,但由于其感受野不足或全局关联粗糙,导致在准确性和效率上无法达到平衡。
方法:本文提出了一种基于八叉树的Transformer模型OcTr。首先,通过在顶层进行自注意力运算构建了一个动态的八叉树,然后递归地向下一层传播,受限于八分体,以粗到细的方式捕捉丰富的全局上下文,同时保持计算复杂度在可控范围内。此外,为了增强前景感知,我们提出了一种混合位置嵌入,由语义感知位置嵌入和注意力掩码组成,以充分利用语义和几何线索。
效果:在Waymo开放数据集和KITTI数据集上进行了大量实验,OcTr达到了新的最先进的结果。

A key challenge for LiDAR-based 3D object detection is to capture sufficient features from large scale 3D scenes especially for distant or/and occluded objects. Albeit recent efforts made by Transformers with the long sequence modeling capability, they fail to properly balance the accuracy and efficiency, suffering from inadequate receptive fields or coarse-grained holistic correlations. In this paper, we propose an Octree-based Transformer, named OcTr, to address this issue. It first constructs a dynamic octree on the hierarchical feature pyramid through conducting self-attention on the top level and then recursively propagates to the level below restricted by the octants, which captures rich global context in a coarse-to-fine manner while maintaining the computational complexity under control. Furthermore, for enhanced foreground perception, we propose a hybrid positional embedding, composed of the semantic-aware positional embedding and attention mask, to fully exploit semantic and geometry clues. Extensive experiments are conducted on the Waymo Open Dataset and KITTI Dataset, and OcTr reaches newly state-of-the-art results.

GeoMAE: Masked Geometric Target Prediction for Self-Supervised Point Cloud Pre-Training
Tian, XiaoyuandRan, HaoxiandWang, YueandZhao, Hang



研究问题:本文旨在解决点云自监督学习中的一个重要问题:在没有标注的情况下,我们应该利用什么信号来从点云中学习特征?
动机:现有的直接采用掩蔽自动编码器(MAE)并仅从被遮蔽的点云中预测原始坐标或占用率的方法,忽视了图像和点云之间的差异。
方法:本文提出了一种基于几何特征重建的点云表示学习框架。通过重新审视图像和点云的差异,确定了三个特定于点云的自监督学习目标,即质心预测、法线估计和曲率预测。这三个目标共同构成了一个非平凡的自监督学习任务,并使模型更好地推理点云的精细几何结构。
效果:在nuScene数据集上,该方法在物体检测、分割和多目标跟踪方面分别实现了3.38 mAP、2.1 mIoU和1.7 AMOTA的性能提升。在Waymo开放数据集上也取得了显著的性能改进。

This paper tries to address a fundamental question in point cloud self-supervised learning: what is a good signal we should leverage to learn features from point clouds without annotations? To answer that, we introduce a point cloud representation learning framework, based on geometric feature reconstruction. In contrast to recent papers that directly adopt masked autoencoder (MAE) and only predict original coordinates or occupancy from masked point clouds, our method revisits differences between images and point clouds and identifies three self-supervised learning objectives peculiar to point clouds, namely centroid prediction, normal estimation, and curvature prediction. Combined, these three objectives yield an nontrivial self-supervised learning task and mutually facilitate models to better reason fine-grained geometry of point clouds. Our pipeline is conceptually simple and it consists of two major steps: first, it randomly masks out groups of points, followed by a Transformer-based point cloud encoder; second, a lightweight Transformer decoder predicts centroid, normal, and curvature for points in each voxel. We transfer the pre-trained Transformer encoder to a downstream peception model. On the nuScene Datset, our model achieves 3.38 mAP improvment for object detection, 2.1 mIoU gain for segmentation, and 1.7 AMOTA gain for multi-object tracking. We also conduct experiments on the Waymo Open Dataset and achieve significant performance improvements over baselines as well.

PVT-SSD: Single-Stage 3D Object Detector With Point-Voxel Transformer
Yang, HonghuiandWang, WenxiaoandChen, MinghaoandLin, BinbinandHe, TongandChen, HuaandHe, XiaofeiandOuyang, Wanli



研究问题:现有的基于Transformer的3D物体检测器在点云特征学习上存在时间消耗大或引入量化误差的问题。
动机:提出一种新的单阶段3D检测方法,结合了点和体素两种表示的优点。
方法:首先使用体素稀疏卷积进行有效的特征编码,然后提出一个点-体素变换器(PVT)模块,从体素中获取长范围的上下文信息,同时从点中获得准确的位置信息。通过输入依赖的查询初始化模块将这两种不同的表示关联起来。
效果:实验证明该方法在自动驾驶基准测试上具有高效性和有效性。

Recent Transformer-based 3D object detectors learn point cloud features either from point- or voxel-based representations. However, the former requires time-consuming sampling while the latter introduces quantization errors. In this paper, we present a novel Point-Voxel Transformer for single-stage 3D detection (PVT-SSD) that takes advantage of these two representations. Specifically, we first use voxel-based sparse convolutions for efficient feature encoding. Then, we propose a Point-Voxel Transformer (PVT) module that obtains long-range contexts in a cheap manner from voxels while attaining accurate positions from points. The key to associating the two different representations is our introduced input-dependent Query Initialization module, which could efficiently generate reference points and content queries. Then, PVT adaptively fuses long-range contextual and local geometric information around reference points into content queries. Further, to quickly find the neighboring points of reference points, we design the Virtual Range Image module, which generalizes the native range image to multi-sensor and multi-frame. The experiments on several autonomous driving benchmarks verify the effectiveness and efficiency of the proposed method. Code will be available.

Harmonious Feature Learning for Interactive Hand-Object Pose Estimation
Lin, ZhifengandDing, ChangxingandYao, HuanandKuang, ZengshengandHuang, Shaoli



研究问题:单张图像中手和物体的姿态估计由于严重遮挡而极具挑战性。
动机:现有的方法通常首先从单个主干网络提取粗略的手和物体特征,然后通过交互模块相互参考进一步优化它们,但这些方法通常忽视了手和物体在特征学习中的竞争力。
方法:本文提出了一种新的和谐特征学习网络(HFL-Net)。HFL-Net引入了一个新的框架,结合了单流和双流主干的优缺点:共享通用ResNet-50模型的手和物体的低级别和高级别卷积层的参数,中间层不共享。这种策略使得中间层能够将手和物体作为唯一的目标进行提取,避免了它们在特征学习中的竞争力。共享的高级别层也迫使它们的特征保持和谐,从而促进它们的相互特征增强。
效果:实验结果表明,我们的方法在流行的HO3D和Dex-YCB数据库上始终优于最先进的方法。特别地,我们的模型在手部姿态估计上的性能甚至超过了只执行单手姿态估计任务的现有工作。代码可在https://github.com/lzfff12/HFL-Net获取。

Joint hand and object pose estimation from a single image is extremely challenging as serious occlusion often occurs when the hand and object interact. Existing approaches typically first extract coarse hand and object features from a single backbone, then further enhance them with reference to each other via interaction modules. However, these works usually ignore that the hand and object are competitive in feature learning, since the backbone takes both of them as foreground and they are usually mutually occluded. In this paper, we propose a novel Harmonious Feature Learning Network (HFL-Net). HFL-Net introduces a new framework that combines the advantages of single- and double-stream backbones: it shares the parameters of the low- and high-level convolutional layers of a common ResNet-50 model for the hand and object, leaving the middle-level layers unshared. This strategy enables the hand and the object to be extracted as the sole targets by the middle-level layers, avoiding their competition in feature learning. The shared high-level layers also force their features to be harmonious, thereby facilitating their mutual feature enhancement. In particular, we propose to enhance the feature of the hand via concatenation with the feature in the same location from the object stream. A subsequent self-attention layer is adopted to deeply fuse the concatenated feature. Experimental results show that our proposed approach consistently outperforms state-of-the-art methods on the popular HO3D and Dex-YCB databases. Notably, the performance of our model on hand pose estimation even surpasses that of existing works that only perform the single-hand pose estimation task. Code is available at https://github.com/lzfff12/HFL-Net.

Distilling Cross-Temporal Contexts for Continuous Sign Language Recognition
Guo, LemingandXue, WanliandGuo, QingandLiu, BoandZhang, KaihuaandYuan, TiantianandChen, Shengyong



研究问题:本文旨在解决手语识别中空间感知模块训练不足的问题。
动机:现有的手语识别方法通常包含空间感知和时间聚合两个模块,但空间感知模块往往训练不足。
方法:提出一种跨时态上下文聚合(CTCA)模型,构建双路径网络分别捕捉局部时态和全局时态的上下文信息,并设计了一种跨上下文知识蒸馏学习目标来聚合这两种类型和语言先验的上下文。
效果:实验结果表明,该方法在具有挑战性的手语识别基准测试上优于所有最先进的方法。

Continuous sign language recognition (CSLR) aims to recognize glosses in a sign language video. State-of-the-art methods typically have two modules, a spatial perception module and a temporal aggregation module, which are jointly learned end-to-end. Existing results in [9,20,25,36] have indicated that, as the frontal component of the overall model, the spatial perception module used for spatial feature extraction tends to be insufficiently trained. In this paper, we first conduct empirical studies and show that a shallow temporal aggregation module allows more thorough training of the spatial perception module. However, a shallow temporal aggregation module cannot well capture both local and global temporal context information in sign language. To address this dilemma, we propose a cross-temporal context aggregation (CTCA) model. Specifically, we build a dual-path network that contains two branches for perceptions of local temporal context and global temporal context. We further design a cross-context knowledge distillation learning objective to aggregate the two types of context and the linguistic prior. The knowledge distillation enables the resultant one-branch temporal aggregation module to perceive local-global temporal and semantic context. This shallow temporal perception module structure facilitates spatial perception module learning. Extensive experiments on challenging CSLR benchmarks demonstrate that our method outperforms all state-of-the-art methods.

ProxyFormer: Proxy Alignment Assisted Point Cloud Completion With Missing Part Sensitive Transformer
Li, ShanshanandGao, PanandTan, XiaoyangandWei, Mingqiang



研究问题:如何从部分点云中恢复完整的点云?
动机:由于设备缺陷或视角有限,捕获的点云可能是不完整的。因此,从部分点云中恢复完整的点云在许多实际任务中起着关键作用。
方法:本文提出了一种新的点云补全方法——ProxyFormer,它将点云分为现有(输入)和缺失(待预测)两部分,并通过代理进行信息交流。具体来说,我们通过特征和位置提取器将信息融合到点代理中,并从现有点代理的特征生成缺失点代理的特征。然后,为了更好地感知缺失点的位置,我们设计了一个缺失部分敏感的变压器,它将随机正态分布转换为合理的位置信息,并使用代理对齐来细化缺失的代理。这使得预测的点代理更能感知缺失部分的特征和位置,从而使这些代理更适合后续的粗到细的过程。
效果:实验结果表明,我们的方法在几个基准数据集上优于最先进的补全网络,并且具有最快的推理速度。

Problems such as equipment defects or limited viewpoints will lead the captured point clouds to be incomplete. Therefore, recovering the complete point clouds from the partial ones plays an vital role in many practical tasks, and one of the keys lies in the prediction of the missing part. In this paper, we propose a novel point cloud completion approach namely ProxyFormer that divides point clouds into existing (input) and missing (to be predicted) parts and each part communicates information through its proxies. Specifically, we fuse information into point proxy via feature and position extractor, and generate features for missing point proxies from the features of existing point proxies. Then, in order to better perceive the position of missing points, we design a missing part sensitive transformer, which converts random normal distribution into reasonable position information, and uses proxy alignment to refine the missing proxies. It makes the predicted point proxies more sensitive to the features and positions of the missing part, and thus makes these proxies more suitable for subsequent coarse-to-fine processes. Experimental results show that our method outperforms state-of-the-art completion networks on several benchmark datasets and has the fastest inference speed.

FrustumFormer: Adaptive Instance-Aware Resampling for Multi-View 3D Detection
Wang, YuqiandChen, YuntaoandZhang, Zhaoxiang



研究问题:如何有效地将2D视角空间的特征转换为3D空间,以实现多视角3D物体检测。
动机:目前的方法主要关注于视图转换的设计,但如何选择要转换的内容却很少被讨论。
方法:本文提出了一种名为FrustumFormer的新框架,通过自适应实例感知重采样,更加关注实例区域的特征。模型利用图像视图对象提案在鸟瞰图上获取实例锥体,并学习实例锥体内的自适应占用掩码来细化实例位置。此外,时间锥体交会可以进一步减少对象的定位不确定性。
效果:在nuScenes数据集上的全面实验证明了FrustumFormer的有效性,并在基准测试中实现了新的最先进的性能。

The transformation of features from 2D perspective space to 3D space is essential to multi-view 3D object detection. Recent approaches mainly focus on the design of view transformation, either pixel-wisely lifting perspective view features into 3D space with estimated depth or grid-wisely constructing BEV features via 3D projection, treating all pixels or grids equally. However, choosing what to transform is also important but has rarely been discussed before. The pixels of a moving car are more informative than the pixels of the sky. To fully utilize the information contained in images, the view transformation should be able to adapt to different image regions according to their contents. In this paper, we propose a novel framework named FrustumFormer, which pays more attention to the features in instance regions via adaptive instance-aware resampling. Specifically, the model obtains instance frustums on the bird's eye view by leveraging image view object proposals. An adaptive occupancy mask within the instance frustum is learned to refine the instance location. Moreover, the temporal frustum intersection could further reduce the localization uncertainty of objects. Comprehensive experiments on the nuScenes dataset demonstrate the effectiveness of FrustumFormer, and we achieve a new state-of-the-art performance on the benchmark. Codes and models will be made available at https://github.com/Robertwyq/Frustum.

Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation
Zhang, NingandNex, FrancescoandVosselman, GeorgeandKerle, Norman



研究问题:如何设计一种轻量级但有效的自我监督单目深度估计模型,以部署在边缘设备上。
动机:现有的架构大多使用更重的骨干网络,但牺牲了模型的尺寸。因此,设计轻量级的模型具有很高的研究价值。
方法:本文提出了一种名为Lite-Mono的混合架构,该架构结合了CNNs和Transformers。其中,连续扩张卷积(CDC)模块用于提取丰富的多尺度局部特征,局部-全局特征交互(LGFI)模块则利用自注意力机制将长范围的全局信息编码到特征中。
效果:实验证明,Lite-Mono在准确性上大大超过了Monodepth2,且参数数量减少了约80%。

Self-supervised monocular depth estimation that does not require ground truth for training has attracted attention in recent years. It is of high interest to design lightweight but effective models so that they can be deployed on edge devices. Many existing architectures benefit from using heavier backbones at the expense of model sizes. This paper achieves comparable results with a lightweight architecture. Specifically, the efficient combination of CNNs and Transformers is investigated, and a hybrid architecture called Lite-Mono is presented. A Consecutive Dilated Convolutions (CDC) module and a Local-Global Features Interaction (LGFI) module are proposed. The former is used to extract rich multi-scale local features, and the latter takes advantage of the self-attention mechanism to encode long-range global information into the features. Experiments demonstrate that Lite-Mono outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters. Our codes and models are available at https://github.com/noahzn/Lite-Mono.

Starting From Non-Parametric Networks for 3D Point Cloud Analysis
Zhang, RenruiandWang, LiuhuiandWang, YaliandGao, PengandLi, HongshengandShi, Jianbo



研究问题:本文旨在开发一种无需参数或训练的非参数网络Point-NN,用于3D点云分析。
动机:现有的3D模型需要大量的参数和训练,而Point-NN通过使用最远点采样、k近邻和池化操作等非学习组件,以及三角函数,可以无需参数或训练就能在各种3D任务上表现良好,甚至超越已完全训练的模型。
方法:Point-NN由最远点采样、k近邻、池化操作和三角函数等非学习组件构成,无需参数或训练。此外,作者还提出了两种扩展方式:一是将Point-NN作为基础架构框架,插入线性层以构建参数网络;二是将Point-NN视为已训练3D模型的即插即用模块,在推理过程中增强现有方法的性能。
效果:实验结果表明,Point-NN在各种3D任务上表现优秀,且其派生的Point-PN在性能效率方面具有优秀的权衡,只需少量可学习的参数。同时,Point-NN还可以作为已训练3D模型的插件模块,在不重新训练的情况下提升现有方法的性能。

We present a Non-parametric Network for 3D point cloud analysis, Point-NN, which consists of purely non-learnable components: farthest point sampling (FPS), k-nearest neighbors (k-NN), and pooling operations, with trigonometric functions. Surprisingly, it performs well on various 3D tasks, requiring no parameters or training, and even surpasses existing fully trained models. Starting from this basic non-parametric model, we propose two extensions. First, Point-NN can serve as a base architectural framework to construct Parametric Networks by simply inserting linear layers on top. Given the superior non-parametric foundation, the derived Point-PN exhibits a high performance-efficiency trade-off with only a few learnable parameters. Second, Point-NN can be regarded as a plug-and-play module for the already trained 3D models during inference. Point-NN captures the complementary geometric knowledge and enhances existing methods for different 3D benchmarks without re-training. We hope our work may cast a light on the community for understanding 3D point clouds with non-parametric methods. Code is available at https://github.com/ZrrSkywalker/Point-NN.

3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention
Tang, ZhenhuaandQiu, ZhaofanandHao, YanbinandHong, RichangandYao, Ting



研究问题:现有的基于变换器的3D人体姿态估计方法在计算关节间亲和性矩阵时,随着关节数量的增加,计算成本呈二次增长,尤其在视频序列中的姿态估计问题上,这一问题更为严重。
动机:为了解决上述问题,本文提出了一种新的空间-时间交叉注意力(STC)模块,将相关性学习分解为空间和时间两部分。
方法:STC首先将其输入特征沿通道维度均匀地切片成两个部分,然后分别对每个部分进行空间和时间的注意力处理。接着,STC通过连接来自注意力层的输出来同时模拟同一帧中的关节和同一轨迹中的关节之间的交互。在此基础上,我们设计了STCFormer,通过堆叠多个STC块,并在STCFormer中集成了一种新的结构增强的位置嵌入(SPE),以考虑人体结构。
效果:在Human3.6M和MPI-INF-3DHP基准测试集上进行了大量实验,与最先进的方法相比,取得了优越的结果。特别是在具有挑战性的Human3.6M数据集上,STCFormer实现了迄今为止最好的公开性能:40.5mm的P1误差。

Recent transformer-based solutions have shown great success in 3D human pose estimation. Nevertheless, to calculate the joint-to-joint affinity matrix, the computational cost has a quadratic growth with the increasing number of joints. Such drawback becomes even worse especially for pose estimation in a video sequence, which necessitates spatio-temporal correlation spanning over the entire video. In this paper, we facilitate the issue by decomposing correlation learning into space and time, and present a novel Spatio-Temporal Criss-cross attention (STC) block. Technically, STC first slices its input feature into two partitions evenly along the channel dimension, followed by performing spatial and temporal attention respectively on each partition. STC then models the interactions between joints in an identical frame and joints in an identical trajectory simultaneously by concatenating the outputs from attention layers. On this basis, we devise STCFormer by stacking multiple STC blocks and further integrate a new Structure-enhanced Positional Embedding (SPE) into STCFormer to take the structure of human body into consideration. The embedding function consists of two components: spatio-temporal convolution around neighboring joints to capture local structure, and part-aware embedding to indicate which part each joint belongs to. Extensive experiments are conducted on Human3.6M and MPI-INF-3DHP benchmarks, and superior results are reported when comparing to the state-of-the-art approaches. More remarkably, STCFormer achieves to-date the best published performance: 40.5mm P1 error on the challenging Human3.6M dataset.

LoGoNet: Towards Accurate 3D Object Detection With Local-to-Global Cross-Modal Fusion
Li, XinandMa, TaoandHou, YuenanandShi, BotianandYang, YuchenandLiu, YouquanandWu, XingjiaoandChen, QinandLi, YikangandQiao, YuandHe, Liang



研究问题:如何利用LiDAR和相机的融合方法进行3D物体检测。
动机:现有的多模态方法主要进行全局融合,缺乏精细的区域级信息,导致融合性能不佳。
方法:提出局部到全局融合网络(LoGoNet),在局部和全局级别进行LiDAR-camera融合。具体来说,全局融合部分基于先前的研究,使用点质心更精确地表示体素特征的位置,实现更好的跨模态对齐。对于局部融合部分,首先将每个提案划分为均匀网格,然后将这些网格中心投影到图像上。采样图像特征周围的投影网格点,与位置装饰的点云特征融合,最大限度地利用了提案周围的丰富上下文信息。进一步提出了特征动态聚合(FDA)模块,实现了局部和全局融合特征之间的信息交互,生成更具信息量的多模态特征。
效果:在Waymo开放数据集(WOD)和KITTI数据集上的大量实验表明,LoGoNet优于所有最先进的3D检测方法。特别是在Waymo 3D物体检测排行榜上排名第一,获得了81.02 mAPH (L2)的检测性能。值得注意的是,首次同时在三个类别上的检测性能超过了80 APH (L2)。代码将在https://github.com/sankin97/LoGoNet上提供。

LiDAR-camera fusion methods have shown impressive performance in 3D object detection. Recent advanced multi-modal methods mainly perform global fusion, where image features and point cloud features are fused across the whole scene. Such practice lacks fine-grained region-level information, yielding suboptimal fusion performance. In this paper, we present the novel Local-to-Global fusion network (LoGoNet), which performs LiDAR-camera fusion at both local and global levels. Concretely, the Global Fusion (GoF) of LoGoNet is built upon previous literature, while we exclusively use point centroids to more precisely represent the position of voxel features, thus achieving better cross-modal alignment. As to the Local Fusion (LoF), we first divide each proposal into uniform grids and then project these grid centers to the images. The image features around the projected grid points are sampled to be fused with position-decorated point cloud features, maximally utilizing the rich contextual information around the proposals. The Feature Dynamic Aggregation (FDA) module is further proposed to achieve information interaction between these locally and globally fused features, thus producing more informative multi-modal features. Extensive experiments on both Waymo Open Dataset (WOD) and KITTI datasets show that LoGoNet outperforms all state-of-the-art 3D detection methods. Notably, LoGoNet ranks 1st on Waymo 3D object detection leaderboard and obtains 81.02 mAPH (L2) detection performance. It is noteworthy that, for the first time, the detection performance on three classes surpasses 80 APH (L2) simultaneously. Code will be available at https://github.com/sankin97/LoGoNet.

Representation Learning for Visual Object Tracking by Masked Appearance Transfer
Zhao, HaojieandWang, DongandLu, Huchuan



研究问题:本文旨在研究视觉对象跟踪中的表现学习方法。
动机:目前大多数跟踪器直接使用ImageNet预训练表示,但很少有工作研究针对跟踪的特定表示学习方法。
方法:提出一种基于编码器-解码器架构的简单而有效的跟踪特定表示学习方法——掩蔽外观转移。首先,对模板和搜索区域的视觉外观进行联合编码,然后分别解码。在解码过程中,原始搜索区域图像被重建。然而,对于模板,使解码器在搜索区域内重建目标外观。通过这种目标外观转移,学习到跟踪特定的表示。随机屏蔽输入,从而使学习到的表示更具判别性。
效果:设计了一个简单的轻量级跟踪器来评估表示的目标定位和框回归。大量实验表明,该方法是有效的,学习到的表示可以使简单的跟踪器在六个数据集上获得最先进的性能。

Visual representation plays an important role in visual object tracking. However, few works study the tracking-specified representation learning method. Most trackers directly use ImageNet pre-trained representations. In this paper, we propose masked appearance transfer, a simple but effective representation learning method for tracking, based on an encoder-decoder architecture. First, we encode the visual appearances of the template and search region jointly, and then we decode them separately. During decoding, the original search region image is reconstructed. However, for the template, we make the decoder reconstruct the target appearance within the search region. By this target appearance transfer, the tracking-specified representations are learned. We randomly mask out the inputs, thereby making the learned representations more discriminative. For sufficient evaluation, we design a simple and lightweight tracker that can evaluate the representation for both target localization and box regression. Extensive experiments show that the proposed method is effective, and the learned representations can enable the simple tracker to obtain state-of-the-art performance on six datasets.

Neural Fourier Filter Bank
Wu, ZhijieandJin, YuheandYi, KwangMoo



研究问题:提出一种新颖的方法,以提供高效且高度详细的重建。
动机:受到小波的启发,学习一个能在空间和频率上分解信号的神经场。
方法:采用基于网格的空间分解新近范例,通过傅里叶特征编码鼓励在每个网格中存储特定频率。然后应用多层感知机并激活正弦,将适当层的这些傅里叶编码的特征输入,使得高频组件依次累积在低频组件之上,最后将其求和形成最终输出。
效果:在多个任务(2D图像拟合、3D形状重建和神经辐射场)上,该方法在模型紧凑性和收敛速度方面优于现有技术。

We present a novel method to provide efficient and highly detailed reconstructions. Inspired by wavelets, we learn a neural field that decompose the signal both spatially and frequency-wise. We follow the recent grid-based paradigm for spatial decomposition, but unlike existing work, encourage specific frequencies to be stored in each grid via Fourier features encodings. We then apply a multi-layer perceptron with sine activations, taking these Fourier encoded features in at appropriate layers so that higher-frequency components are accumulated on top of lower-frequency components sequentially, which we sum up to form the final output. We demonstrate that our method outperforms the state of the art regarding model compactness and convergence speed on multiple tasks: 2D image fitting, 3D shape reconstruction, and neural radiance fields. Our code is available at https://github.com/ubc-vision/NFFB.

Self-Supervised Non-Uniform Kernel Estimation With Flow-Based Motion Prior for Blind Image Deblurring
Fang, ZhenxuanandWu, FangfangandDong, WeishengandLi, XinandWu, JinjianandShi, Guangming



研究问题:现有的深度学习方法在处理模糊图像恢复时,忽视了运动模糊的重要先验信息,导致实际场景下性能下降。
动机:为了解决这一问题,我们提出了一种利用正则化流将运动模糊核场表示为潜在空间,并设计卷积神经网络预测潜在码而非运动核的新方法。
方法:我们引入了不确定性学习来提高非均匀核估计的准确性和鲁棒性,并提出了一个多尺度核注意力模块以更好地整合图像特征与估计的核。
效果:大量的实验结果,特别是在真实世界的模糊数据集上,表明我们的方法在主观和客观质量以及非均匀图像去模糊的泛化性能方面都取得了最先进的结果。

Many deep learning-based solutions to blind image deblurring estimate the blur representation and reconstruct the target image from its blurry observation. However, these methods suffer from severe performance degradation in real-world scenarios because they ignore important prior information about motion blur (e.g., real-world motion blur is diverse and spatially varying). Some methods have attempted to explicitly estimate non-uniform blur kernels by CNNs, but accurate estimation is still challenging due to the lack of ground truth about spatially varying blur kernels in real-world images. To address these issues, we propose to represent the field of motion blur kernels in a latent space by normalizing flows, and design CNNs to predict the latent codes instead of motion kernels. To further improve the accuracy and robustness of non-uniform kernel estimation, we introduce uncertainty learning into the process of estimating latent codes and propose a multi-scale kernel attention module to better integrate image features with estimated kernels. Extensive experimental results, especially on real-world blur datasets, demonstrate that our method achieves state-of-the-art results in terms of both subjective and objective quality as well as excellent generalization performance for non-uniform image deblurring. The code is available at https://see.xidian.edu.cn/faculty/wsdong/Projects/UFPNet.htm.

Burstormer: Burst Image Restoration and Enhancement Transformer
Dudhane, AkshayandZamir, SyedWaqasandKhan, SalmanandKhan, FahadShahbazandYang, Ming-Hsuan



研究问题:如何对由于不可避免的移动导致的连拍照片中的各帧进行正确对齐,并融合其互补信息以生成高质量的图像。
动机:现有的技术在处理由于相机晃动导致的照片模糊、对焦不准等问题时效果不佳。
方法:提出Burstormer,一种基于变压器的新型架构,用于突发图像恢复和增强。通过利用多尺度局部和非局部特征实现改进的对齐和特征融合。
效果:实验结果表明,Burstormer在突发超分辨率、突发去噪和突发低光增强等任务上优于现有技术。

On a shutter press, modern handheld cameras capture multiple images in rapid succession and merge them to generate a single image. However, individual frames in a burst are misaligned due to inevitable motions and contain multiple degradations. The challenge is to properly align the successive image shots and merge their complimentary information to achieve high-quality outputs. Towards this direction, we propose Burstormer: a novel transformer-based architecture for burst image restoration and enhancement. In comparison to existing works, our approach exploits multi-scale local and non-local features to achieve improved alignment and feature fusion. Our key idea is to enable inter-frame communication in the burst neighborhoods for information aggregation and progressive fusion while modeling the burst-wide context. However, the input burst frames need to be properly aligned before fusing their information. Therefore, we propose an enhanced deformable alignment module for aligning burst features with regards to the reference frame. Unlike existing methods, the proposed alignment module not only aligns burst features but also exchanges feature information and maintains focused communication with the reference frame through the proposed reference-based feature enrichment mechanism, which facilitates handling complex motions. After multi-level alignment and enrichment, we re-emphasize on inter-frame communication within burst using a cyclic burst sampling module. Finally, the inter-frame information is aggregated using the proposed burst feature fusion module followed by progressive upsampling. Our Burstormer outperforms state-of-the-art methods on burst super-resolution, burst denoising and burst low-light enhancement. Our codes and pre-trained models are available at https://github.com/akshaydudhane16/Burstormer.

DeepMapping2: Self-Supervised Large-Scale LiDAR Map Optimization
Chen, ChaoandLiu, XinhaoandLi, YimingandDing, LiandFeng, Chen



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

LiDAR mapping is important yet challenging in self-driving and mobile robotics. To tackle such a global point cloud registration problem, DeepMapping converts the complex map estimation into a self-supervised training of simple deep networks. Despite its broad convergence range on small datasets, DeepMapping still cannot produce satisfactory results on large-scale datasets with thousands of frames. This is due to the lack of loop closures and exact cross-frame point correspondences, and the slow convergence of its global localization network. We propose DeepMapping2 by adding two novel techniques to address these issues: (1) organization of training batch based on map topology from loop closing, and (2) self-supervised local-to-global point consistency loss leveraging pairwise registration. Our experiments and ablation studies on public datasets such as KITTI, NCLT, and Nebula, demonstrate the effectiveness of our method.

WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding
Li, MengzeandWang, HanandZhang, WenqiaoandMiao, JiaxuandZhao, ZhouandZhang, ShengyuandJi, WeiandWu, Fei



研究问题:本文旨在解决视频与语言的对齐问题,即如何将语言查询定位到对应的视觉视频上。
动机:现有的技术需要密集的边界和边界框注释来实现这种对齐,这可能会非常昂贵。为了解决这个问题,我们研究了弱监督设置,其中模型从易于访问的视频-语言数据中学习,而无需注释。
方法:我们提出了一种新的框架WINNER,用于分层的视频-文本理解。WINNER首先以自下而上的方式构建语言分解树,然后在此基础上,结构注意力机制和自上而下的特征回溯共同构建了一个多模态分解树,实现了对无结构化视频的分层理解。这个多模态分解树是进行多层次语言-管道匹配的基础。
效果:我们设计了一个分层对比学习目标,以学习多层次的对应关系和区分度,包括样本内的和样本间的视频-文本分解结构,从而实现了视频-语言分解结构的对齐。大量的实验表明,我们的方法在弱监督设置下的效果超过了最先进的方法,甚至一些监督方法。

Spatio-temporal video grounding aims to localize the aligned visual tube corresponding to a language query. Existing techniques achieve such alignment by exploiting dense boundary and bounding box annotations, which can be prohibitively expensive. To bridge the gap, we investigate the weakly-supervised setting, where models learn from easily accessible video-language data without annotations. We identify that intra-sample spurious correlations among video-language components can be alleviated if the model captures the decomposed structures of video and language data. In this light, we propose a novel framework, namely WINNER, for hierarchical video-text understanding. WINNER first builds the language decomposition tree in a bottom-up manner, upon which the structural attention mechanism and top-down feature backtracking jointly build a multi-modal decomposition tree, permitting a hierarchical understanding of unstructured videos. The multi-modal decomposition tree serves as the basis for multi-hierarchy language-tube matching. A hierarchical contrastive learning objective is proposed to learn the multi-hierarchy correspondence and distinguishment with intra-sample and inter-sample video-text decomposition structures, achieving video-language decomposition structure alignment. Extensive experiments demonstrate the rationality of our design and its effectiveness beyond state-of-the-art weakly supervised methods, even some supervised methods.

Decompose More and Aggregate Better: Two Closer Looks at Frequency Representation Learning for Human Motion Prediction
Gao, XuehaoandDu, ShaoyiandWu, YangandYang, Yang



研究问题:如何有效地从原始姿势空间转换运动表示到频率空间进行人类运动预测。
动机:最近,人们受到在频率域内编码时间动态的有效性的启发,倾向于首先将运动表示从原始姿势空间转换为频率空间。
方法:我们开发了两种强大的单元,通过一种新的分解-聚合两阶段策略来分解和聚合频率表示学习任务:(1)频率分解单元通过将输入身体运动的频率特征嵌入多个空间中,从输入身体运动中解开多视图频率表示;(2)特征聚合单元部署了一系列的 intra-space 和 inter-space 特征聚合层,从这些空间收集全面的频率表示以进行稳健的人类运动预测。
效果:我们在大规模数据集上进行了评估,开发出一个强大的基线模型用于人类运动预测任务,该模型比最先进的方法高出很大的百分比:在Human3.6M上高出8%-12%,在CMU MoCap上高出3%-7%,在3DPW上高出7%-10%。

Encouraged by the effectiveness of encoding temporal dynamics within the frequency domain, recent human motion prediction systems prefer to first convert the motion representation from the original pose space into the frequency space. In this paper, we introduce two closer looks at effective frequency representation learning for robust motion prediction and summarize them as: decompose more and aggregate better. Motivated by these two insights, we develop two powerful units that factorize the frequency representation learning task with a novel decomposition-aggregation two-stage strategy: (1) frequency decomposition unit unweaves multi-view frequency representations from an input body motion by embedding its frequency features into multiple spaces; (2) feature aggregation unit deploys a series of intra-space and inter-space feature aggregation layers to collect comprehensive frequency representations from these spaces for robust human motion prediction. As evaluated on large-scale datasets, we develop a strong baseline model for the human motion prediction task that outperforms state-of-the-art methods by large margins: 8% 12% on Human3.6M, 3% 7% on CMU MoCap, and 7% 10% on 3DPW.

BAAM: Monocular 3D Pose and Shape Reconstruction With Bi-Contextual Attention Module and Attention-Guided Modeling
Lee, Hyo-JunandKim, HanulandChoi, Su-MinandJeong, Seong-GyunandKoh, YeongJun



研究问题:如何有效地重建3D交通场景中汽车对象的形状和姿态,同时考虑到对象间的相对上下文关系和场景环境。
动机:目前的研究对重建详细形状的关注较少,且大多数方法将每个3D对象视为独立的个体,导致丢失了反映道路环境的相对对象间上下文和场景上下文。
方法:本文提出了一种基于双上下文注意力和注意力引导建模(BAAM)的新型单目3D姿态和形状重建算法。首先,根据2D原始数据,我们通过考虑检测到的对象与车辆形状先验之间的相关性,使用注意力引导建模来重建3D对象形状。其次,我们通过利用对象间的关系上下文和对象与道路环境的场景上下文的双上下文注意力来估计3D对象的姿态。最后,我们提出了一种基于鸟瞰视图距离的3D非最大抑制算法,以消除基于噪声的对象。
效果:大量实验证明,所提出的BAAM在ApolloCar3D上取得了最先进的性能。此外,实验还表明,所提出的BAAM可以插入到KITTI上的任何成熟的单目3D对象检测器中,并显著提高其性能。

3D traffic scene comprises various 3D information about car objects, including their pose and shape. However, most recent studies pay relatively less attention to reconstructing detailed shapes. Furthermore, most of them treat each 3D object as an independent one, resulting in losses of relative context inter-objects and scene context reflecting road circumstances. A novel monocular 3D pose and shape reconstruction algorithm, based on bi-contextual attention and attention-guided modeling (BAAM), is proposed in this work. First, given 2D primitives, we reconstruct 3D object shape based on attention-guided modeling that considers the relevance between detected objects and vehicle shape priors. Next, we estimate 3D object pose through bi-contextual attention, which leverages relation-context inter objects and scene-context between an object and road environment. Finally, we propose a 3D non maximum suppression algorithm to eliminate spurious objects based on their Bird-Eye-View distance. Extensive experiments demonstrate that the proposed BAAM yields state-of-the-art performance on ApolloCar3D. Also, they show that the proposed BAAM can be plugged into any mature monocular 3D object detector on KITTI and significantly boost their performance.

Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention
Ding, MingyuandShen, YikangandFan, LijieandChen, ZhenfangandChen, ZitianandLuo, PingandTenenbaum, JoshuaB.andGan, Chuang



研究问题:如何模仿人类视觉系统,提取图像中实体及其部分的结构化表示,并获取它们之间的依赖关系。
动机:目前的模型在提取图像中的结构化信息时,缺乏对长范围依赖关系的捕捉能力。
方法:提出一种新的反向注意力机制,通过构建依赖图,让子节点关注其父节点,并按照归一化概率分布发送信息,从而自然地捕获图像补丁之间的长范围视觉依赖关系。
效果:该方法能在无监督的情况下从叶子节点到根节点逐步诱导出依赖树,实现了图像中实体及其部分的差异化表示,支持动态视觉聚合,并在多个数据集和任务上表现出良好的效果。

Humans possess a versatile mechanism for extracting structured representations of our visual world. When looking at an image, we can decompose the scene into entities and their parts as well as obtain the dependencies between them. To mimic such capability, we propose Visual Dependency Transformers (DependencyViT) that can induce visual dependencies without any labels. We achieve that with a novel neural operator called reversed attention that can naturally capture long-range visual dependencies between image patches. Specifically, we formulate it as a dependency graph where a child token in reversed attention is trained to attend to its parent tokens and send information following a normalized probability distribution rather than gathering information in conventional self-attention. With such a design, hierarchies naturally emerge from reversed attention layers, and a dependency tree is progressively induced from leaf nodes to the root node unsupervisedly. DependencyViT offers several appealing benefits. (i) Entities and their parts in an image are represented by different subtrees, enabling part partitioning from dependencies; (ii) Dynamic visual pooling is made possible. The leaf nodes which rarely send messages can be pruned without hindering the model performance, based on which we propose the lightweight DependencyViT-Lite to reduce the computational and memory footprints; (iii) DependencyViT works well on both self- and weakly-supervised pretraining paradigms on ImageNet, and demonstrates its effectiveness on 8 datasets and 5 tasks, such as unsupervised part and saliency segmentation, recognition, and detection.

SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy
Li, JiafengandWen, YingandHe, Lianghua



研究问题:如何减少卷积神经网络在提取特征时的计算资源消耗。
动机:现有的卷积神经网络虽然在计算机视觉任务上表现优秀,但需要大量的计算资源,部分原因是卷积层提取了冗余的特征。
方法:提出一种名为SCConv(空间和通道重建卷积)的高效卷积模块,通过利用特征的空间和通道冗余性进行压缩,降低冗余计算并促进代表性特征学习。
效果:实验结果表明,使用SCConv嵌入的模型能够通过显著降低复杂性和计算成本来减少冗余特征,从而实现更好的性能。

Convolutional Neural Networks (CNNs) have achieved remarkable performance in various computer vision tasks but this comes at the cost of tremendous computational resources, partly due to convolutional layers extracting redundant features. Recent works either compress well-trained large-scale models or explore well-designed lightweight models. In this paper, we make an attempt to exploit spatial and channel redundancy among features for CNN compression and propose an efficient convolution module, called SCConv (Spatial and Channel reconstruction Convolution), to decrease redundant computing and facilitate representative feature learning. The proposed SCConv consists of two units: spatial reconstruction unit (SRU) and channel reconstruction unit (CRU). SRU utilizes a separate-and-reconstruct method to suppress the spatial redundancy while CRU uses a split-transform-and-fuse strategy to diminish the channel redundancy. In addition, SCConv is a plug-and-play architectural unit that can be used to replace standard convolution in various convolutional neural networks directly. Experimental results show that SCConv-embedded models are able to achieve better performance by reducing redundant features with significantly lower complexity and computational costs.

Probability-Based Global Cross-Modal Upsampling for Pansharpening
Zhu, ZeyuandCao, XiangyongandZhou, ManandHuang, JunhaoandMeng, Deyu



研究问题:目前的深度学习方法在遥感图像处理中的升采样步骤中,只利用了低分辨率多光谱图像的局部信息,忽视了全局信息和引导的全色图像的跨模态信息,限制了性能的提升。
动机:为了解决这个问题,本文开发了一种新颖的概率基础全局跨模态升采样(PGCU)方法进行pan-sharpening。
方法:首先从概率的角度对PGCU方法进行形式化,然后设计了一个有效的网络模块来实现它,充分利用上述信息,同时考虑通道特异性。PGCU模块由三个块组成,即信息提取(IE),分布和期望估计(DEE),和微调(FA)。
效果:大量的实验验证了PGCU方法优于其他流行的升采样方法。此外,实验还表明,PGCU模块可以帮助提高现有最先进的深度学习pansharpening方法的性能。代码可在https://github.com/Zeyu-Zhu/PGCU获取。

Pansharpening is an essential preprocessing step for remote sensing image processing. Although deep learning (DL) approaches performed well on this task, current upsampling methods used in these approaches only utilize the local information of each pixel in the low-resolution multispectral (LRMS) image while neglecting to exploit its global information as well as the cross-modal information of the guiding panchromatic (PAN) image, which limits their performance improvement. To address this issue, this paper develops a novel probability-based global cross-modal upsampling (PGCU) method for pan-sharpening. Precisely, we first formulate the PGCU method from a probabilistic perspective and then design an efficient network module to implement it by fully utilizing the information mentioned above while simultaneously considering the channel specificity. The PGCU module consists of three blocks, i.e., information extraction (IE), distribution and expectation estimation (DEE), and fine adjustment (FA). Extensive experiments verify the superiority of the PGCU method compared with other popular upsampling methods. Additionally, experiments also show that the PGCU module can help improve the performance of existing SOTA deep learning pansharpening methods. The codes are available at https://github.com/Zeyu-Zhu/PGCU.

PHA: Patch-Wise High-Frequency Augmentation for Transformer-Based Person Re-Identification
Zhang, GuiweiandZhang, YongfeiandZhang, TianyuandLi, BoandPu, Shiliang



研究问题:尽管有实证研究表明将卷积神经网络(CNNs)注入视觉变换器(ViTs)可以提高人员再识别的性能,但其背后的原理仍然不清楚。
动机:从频率的角度来看,我们发现ViTs在保留关键高频组件(如衣物纹理细节)方面的表现不如CNNs,因为由于ViTs内在的自我注意力机制,高频组件不可避免地会被低频组件稀释。
方法:我们提出了一种Patch-wise High-frequency Augmentation(PHA)方法,主要有两个核心设计。首先,为了增强高频组件的特征表示能力,我们使用离散哈尔小波变换来分割含有高频组件的补丁,然后让ViT将这些分割后的补丁作为辅助输入。其次,为了防止在网络优化过程中以整个序列作为输入时高频组件被低频组件稀释,我们提出了一种新的补丁对比损失函数。从梯度优化的角度来看,它可以作为一种隐式的增强,以提高关键高频组件的表示能力。
效果:我们在广泛使用的ReID数据集上进行了大量实验,验证了我们方法的有效性。

Although recent studies empirically show that injecting Convolutional Neural Networks (CNNs) into Vision Transformers (ViTs) can improve the performance of person re-identification, the rationale behind it remains elusive. From a frequency perspective, we reveal that ViTs perform worse than CNNs in preserving key high-frequency components (e.g, clothes texture details) since high-frequency components are inevitably diluted by low-frequency ones due to the intrinsic Self-Attention within ViTs. To remedy such inadequacy of the ViT, we propose a Patch-wise High-frequency Augmentation (PHA) method with two core designs. First, to enhance the feature representation ability of high-frequency components, we split patches with high-frequency components by the Discrete Haar Wavelet Transform, then empower the ViT to take the split patches as auxiliary input. Second, to prevent high-frequency components from being diluted by low-frequency ones when taking the entire sequence as input during network optimization, we propose a novel patch-wise contrastive loss. From the view of gradient optimization, it acts as an implicit augmentation to improve the representation ability of key high-frequency components. This benefits the ViT to capture key high-frequency components to extract discriminative person representations. PHA is necessary during training and can be removed during inference, without bringing extra complexity. Extensive experiments on widely-used ReID datasets validate the effectiveness of our method.

AnchorFormer: Point Cloud Completion From Discriminative Nodes
Chen, ZhikaiandLong, FuchenandQiu, ZhaofanandYao, TingandZhou, WengangandLuo, JieboandMei, Tao



研究问题:本文旨在解决点云补全中由于全局特征向量无法充分表征一个物体的多样化模式而导致的高质形状生成问题。
动机:目前的点云补全方法通常将观察到的点编码为全局特征向量,然后通过在这个向量上的生成过程预测完整的点,但这种方法可能会因为全局特征向量无法充分表征物体的多样化模式而产生高质形状生成问题。
方法:本文提出了一种新的点云补全架构——AnchorFormer,该架构创新地利用了模式感知的判别节点(即锚点)来动态捕捉物体的区域信息。具体来说,AnchorFormer通过学习一组基于输入部分观察点的点特征的锚点来对区域进行建模和判别。这些锚点通过估计特定的偏移量分散到观察到和未观察到的位置,并与输入观察点的下采样点一起形成稀疏点。为了重建精细的物体模式,AnchorFormer进一步采用调制方案将稀疏点的个体位置处的规范2D网格变形为详细的3D结构。
效果:在PCN、ShapeNet-55/34和KITTI数据集上的大量实验从数量和质量两方面证明了AnchorFormer优于最先进的点云补全方法。

Point cloud completion aims to recover the completed 3D shape of an object from its partial observation. A common strategy is to encode the observed points to a global feature vector and then predict the complete points through a generative process on this vector. Nevertheless, the results may suffer from the high-quality shape generation problem due to the fact that a global feature vector cannot sufficiently characterize diverse patterns in one object. In this paper, we present a new shape completion architecture, namely AnchorFormer, that innovatively leverages pattern-aware discriminative nodes, i.e., anchors, to dynamically capture regional information of objects. Technically, AnchorFormer models the regional discrimination by learning a set of anchors based on the point features of the input partial observation. Such anchors are scattered to both observed and unobserved locations through estimating particular offsets, and form sparse points together with the down-sampled points of the input observation. To reconstruct the fine-grained object patterns, AnchorFormer further employs a modulation scheme to morph a canonical 2D grid at individual locations of the sparse points into a detailed 3D structure. Extensive experiments on the PCN, ShapeNet-55/34 and KITTI datasets quantitatively and qualitatively demonstrate the efficacy of AnchorFormer over the state-of-the-art point cloud completion approaches. Source code is available at https://github.com/chenzhik/AnchorFormer.

PillarNeXt: Rethinking Network Designs for 3D Object Detection in LiDAR Point Clouds
Li, JinyuandLuo, ChenxuandYang, Xiaodong



研究问题:如何有效处理稀疏和无结构的原始点云,提高基于LiDAR的3D物体检测的性能。
动机:大多数现有的3D物体检测研究主要关注设计专门的局部点聚合器进行精细的几何建模,但这种方法在性能和延迟方面表现不佳。
方法:本文从分配计算资源的角度重新审视了局部点聚合器,发现最简单的支柱模型在准确性和延迟方面表现出色。此外,我们还借鉴了2D物体检测的成功经验,如扩大感受野,显著提高了性能。
效果:通过大量的实验,我们发现基于支柱网络的现代设计和训练方法在两个流行的基准测试集(Waymo Open Dataset和nuScenes)上达到了最先进的性能。这些结果挑战了常见的直觉,即详细的几何建模对于实现高性能的3D物体检测至关重要。

In order to deal with the sparse and unstructured raw point clouds, most LiDAR based 3D object detection research focuses on designing dedicated local point aggregators for fine-grained geometrical modeling. In this paper, we revisit the local point aggregators from the perspective of allocating computational resources. We find that the simplest pillar based models perform surprisingly well considering both accuracy and latency. Additionally, we show that minimal adaptions from the success of 2D object detection, such as enlarging receptive field, significantly boost the performance. Extensive experiments reveal that our pillar based networks with modernized designs in terms of architecture and training render the state-of-the-art performance on two popular benchmarks: Waymo Open Dataset and nuScenes. Our results challenge the common intuition that detailed geometry modeling is essential to achieve high performance for 3D object detection.

BEV-LaneDet: An Efficient 3D Lane Detection Based on Virtual Camera via Key-Points
Wang, RuihaoandQin, JianandLi, KaiyingandLi, YaochenandCao, DongandXu, Jintao



研究问题:如何有效地进行三维车道检测,以解决自动驾驶中车辆路径规划的关键问题。
动机:现有的三维车道检测方法由于复杂的空间变换和对三维车道的刚性表示,实用性较差。
方法:我们提出了一种高效且稳健的单目三维车道检测方法,称为BEV-LaneDet,主要有三个贡献。首先,我们引入了虚拟摄像头,统一了安装在不同车辆上的摄像头的内外参数,以保证摄像头之间的空间关系的一致性。其次,我们提出了一种简单但有效的三维车道表示方法,称为关键点表示法,这种方法更适合表示复杂多样的三维车道结构。最后,我们提出了一种轻量级、易于集成到芯片的空间变换模块,名为空间变换金字塔,用于将多尺度的前视图特征转换为BEV特征。
效果:实验结果表明,我们的方法在F-Score上超过了最先进的方法,在OpenLane数据集上提高了10.6%,在Apollo 3D合成数据集上提高了4.0%,并且速度达到了185FPS。代码已在GitHub上发布。

3D lane detection which plays a crucial role in vehicle routing, has recently been a rapidly developing topic in autonomous driving. Previous works struggle with practicality due to their complicated spatial transformations and inflexible representations of 3D lanes. Faced with the issues, our work proposes an efficient and robust monocular 3D lane detection called BEV-LaneDet with three main contributions. First, we introduce the Virtual Camera that unifies the in/extrinsic parameters of cameras mounted on different vehicles to guarantee the consistency of the spatial relationship among cameras. It can effectively promote the learning procedure due to the unified visual space. We secondly propose a simple but efficient 3D lane representation called Key-Points Representation. This module is more suitable to represent the complicated and diverse 3D lane structures. At last, we present a light-weight and chip-friendly spatial transformation module named Spatial Transformation Pyramid to transform multiscale front-view features into BEV features. Experimental results demonstrate that our work outperforms the state-of-the-art approaches in terms of F-Score, being 10.6% higher on the OpenLane dataset and 4.0% higher on the Apollo 3D synthetic dataset, with a speed of 185 FPS. Code is released at https://github.com/gigo-team/bev_lane_det.

Self-Supervised 3D Scene Flow Estimation Guided by Superpoints
Shen, YaqiandHui, LeandXie, JinandYang, Jian



研究问题:本文旨在解决现有的3D场景流估计方法中,由于使用离线聚类生成的超点无法准确捕捉复杂3D场景中具有相似运动局部区域的问题。
动机:目前的3D场景流估计方法采用离线聚类生成超点,这种方法在复杂的3D场景中无法准确地捕捉到具有相似运动的局部区域,导致场景流估计不准确。
方法:提出一种端到端的迭代超点基场景流估计框架,该框架包含一个流动引导的超点生成模块和一个超点引导的流动精化模块。在超点生成模块中,利用前一次迭代的双向流动信息获取点的匹配点和超点中心的配对,构建软点到超点的关联,为配对点云生成超点。然后,利用生成的超点重构每个点的流动,并编码配对点云重构流动之间的一致性。最后,将一致性编码与重构的流动一起输入GRU进行点级流动的精化。
效果:在多个不同数据集上的大量实验表明,该方法能够取得良好的性能。

3D scene flow estimation aims to estimate point-wise motions between two consecutive frames of point clouds. Superpoints, i.e., points with similar geometric features, are usually employed to capture similar motions of local regions in 3D scenes for scene flow estimation. However, in existing methods, superpoints are generated with the offline clustering methods, which cannot characterize local regions with similar motions for complex 3D scenes well, leading to inaccurate scene flow estimation. To this end, we propose an iterative end-to-end superpoint based scene flow estimation framework, where the superpoints can be dynamically updated to guide the point-level flow prediction. Specifically, our framework consists of a flow guided superpoint generation module and a superpoint guided flow refinement module. In our superpoint generation module, we utilize the bidirectional flow information at the previous iteration to obtain the matching points of points and superpoint centers for soft point-to-superpoint association construction, in which the superpoints are generated for pairwise point clouds. With the generated superpoints, we first reconstruct the flow for each point by adaptively aggregating the superpoint-level flow, and then encode the consistency between the reconstructed flow of pairwise point clouds. Finally, we feed the consistency encoding along with the reconstructed flow into GRU to refine point-level flow. Extensive experiments on several different datasets show that our method can achieve promising performance.

Guided Depth Super-Resolution by Deep Anisotropic Diffusion
Metzger, NandoandDaudt, RodrigoCayeandSchindler, Konrad



研究问题:如何利用RGB图像的指导对深度图像进行超分辨率处理。
动机:尽管深度学习方法在此问题上取得了良好的效果,但最近的研究表明,将现代方法与更正式的框架相结合可以进一步提高性能。
方法:提出了一种新的方法,该方法结合了引导各向异性扩散和深度卷积网络,并在引导深度超分辨率方面取得了最先进的成果。通过现代网络的上下文推理能力增强了扩散的边缘传递/增强属性,严格的调整步骤确保完全遵循源图像。
效果:在三个常用的引导深度超分辨率基准测试中,我们的方法取得了前所未有的结果。与其他方法相比,在大尺度(如x32缩放)上的性能增益最大。为了促进结果的可重复性,我们将提供所提出方法的代码。

Performing super-resolution of a depth image using the guidance from an RGB image is a problem that concerns several fields, such as robotics, medical imaging, and remote sensing. While deep learning methods have achieved good results in this problem, recent work highlighted the value of combining modern methods with more formal frameworks. In this work we propose a novel approach which combines guided anisotropic diffusion with a deep convolutional network and advances the state of the art for guided depth super-resolution. The edge transferring/enhancing properties of the diffusion are boosted by the contextual reasoning capabilities of modern networks, and a strict adjustment step guarantees perfect adherence to the source image. We achieve unprecedented results in three commonly used benchmarks for guided depth super resolution. The performance gain compared to other methods is the largest at larger scales, such as x32 scaling. Code for the proposed method will be made available to promote reproducibility of our results.

Edge-Aware Regional Message Passing Controller for Image Forgery Localization
Li, DongandZhu, JiayingandWang, MengluandLiu, JiaweiandFu, XueyangandZha, Zheng-Jun



研究问题:本文旨在解决深度学习在图像伪造定位中的特征耦合问题。
动机:尽管基于深度学习的方法取得了显著的进步,但在图像伪造定位中,伪造和真实区域的严重特征耦合问题仍然存在。
方法:提出了一种两步边缘感知的区域消息传递控制策略。第一步是充分利用边缘信息,包括上下文增强的图构建和阈值自适应的可微二值化边缘算法;第二步是在可学习的边缘指导下,设计区域消息传递控制器来减弱伪造和真实区域之间的消息传递。
效果:实验结果表明,该方法在多个具有挑战性的基准测试上优于最先进的图像伪造定位方法。

Digital image authenticity has promoted research on image forgery localization. Although deep learning-based methods achieve remarkable progress, most of them usually suffer from severe feature coupling between the forged and authentic regions. In this work, we propose a two-step Edge-aware Regional Message Passing Controlling strategy to address the above issue. Specifically, the first step is to account for fully exploiting the edge information. It consists of two core designs: context-enhanced graph construction and threshold-adaptive differentiable binarization edge algorithm. The former assembles the global semantic information to distinguish the features between the forged and authentic regions, while the latter stands on the output of the former to provide the learnable edges. In the second step, guided by the learnable edges, a region message passing controller is devised to weaken the message passing between the forged and authentic regions. In this way, our ERMPC is capable of explicitly modeling the inconsistency between the forged and authentic regions and enabling it to perform well on refined forged images. Extensive experiments on several challenging benchmarks show that our method is superior to state-of-the-art image forgery localization methods qualitatively and quantitatively.

Frequency-Modulated Point Cloud Rendering With Easy Editing
Zhang, YiandHuang, XiaoyangandNi, BingbingandLi, TengandZhang, Wenjun



研究问题:开发一种有效的点云渲染管道,实现新的视图合成,高保真局部细节重建,实时渲染和用户友好的编辑。
动机:现有的方法在频率表现能力上不足,且需要大量的计算资源。
方法:我们提出了一个自适应频率调制模块(AFNet),利用超网络学习局部纹理频率编码,然后注入到自适应频率激活层中,以调制隐式辐射信号。同时,我们还提出了一个预处理模块,通过点透明度估计优化点云几何。
效果:在NeRF-Synthetic、ScanNet、DTU和Tanks and Temples数据集上的大量实验结果表明,我们的方法在PSNR、SSIM和LPIPS等指标上都优于现有技术。

We develop an effective point cloud rendering pipeline for novel view synthesis, which enables high fidelity local detail reconstruction, real-time rendering and user-friendly editing. In the heart of our pipeline is an adaptive frequency modulation module called Adaptive Frequency Net (AFNet), which utilizes a hypernetwork to learn the local texture frequency encoding that is consecutively injected into adaptive frequency activation layers to modulate the implicit radiance signal. This mechanism improves the frequency expressive ability of the network with richer frequency basis support, only at a small computational budget. To further boost performance, a preprocessing module is also proposed for point cloud geometry optimization via point opacity estimation. In contrast to implicit rendering, our pipeline supports high-fidelity interactive editing based on point cloud manipulation. Extensive experimental results on NeRF-Synthetic, ScanNet, DTU and Tanks and Temples datasets demonstrate the superior performances achieved by our method in terms of PSNR, SSIM and LPIPS, in comparison to the state-of-the-art.

SE-ORNet: Self-Ensembling Orientation-Aware Network for Unsupervised Point Cloud Shape Correspondence
Deng, JiachengandWang, ChuxinandLu, JiahaoandHe, JianfengandZhang, TianzhuandYu, JiyangandZhang, Zhe



研究问题:如何获得点云之间的密集点对对应关系,同时解决人类和动物的对称性和各种方向性导致的严重预测错误以及点云噪声干扰的问题。
动机:目前的无监督点云形状对应方法存在预测错误严重、受噪声干扰影响大等问题。
方法:提出一种名为SE-ORNet的自我集成方向感知网络,通过利用方向估计模块和领域自适应鉴别器来对齐点云对的方向,以显著减轻对称部分的预测错误。同时设计了一种自我集成框架,通过不同的数据增强来扰动学生和教师网络的输入,并约束预测的一致性,以克服点云噪声的干扰。
效果:在人类和动物数据集上的大量实验表明,我们的SE-ORNet可以超越最先进的无监督点云形状对应方法。

Unsupervised point cloud shape correspondence aims to obtain dense point-to-point correspondences between point clouds without manually annotated pairs. However, humans and some animals have bilateral symmetry and various orientations, which leads to severe mispredictions of symmetrical parts. Besides, point cloud noise disrupts consistent representations for point cloud and thus degrades the shape correspondence accuracy. To address the above issues, we propose a Self-Ensembling ORientation-aware Network termed SE-ORNet. The key of our approach is to exploit an orientation estimation module with a domain adaptive discriminator to align the orientations of point cloud pairs, which significantly alleviates the mispredictions of symmetrical parts. Additionally, we design a self-ensembling framework for unsupervised point cloud shape correspondence. In this framework, the disturbances of point cloud noise are overcome by perturbing the inputs of the student and teacher networks with different data augmentations and constraining the consistency of predictions. Extensive experiments on both human and animal datasets show that our SE-ORNet can surpass state-of-the-art unsupervised point cloud shape correspondence methods.

Raw Image Reconstruction With Learned Compact Metadata
Wang, YufeiandYu, YiandYang, WenhanandGuo, LanqingandChau, Lap-PuiandKot, AlexC.andWen, Bihan



研究问题:如何有效地压缩原始图像,同时保持高质量的图像重建效果?
动机:尽管原始图像具有线性和细粒度量化级别等优点,但由于存储需求大,普通用户并未广泛使用。
方法:提出一种新的框架,在潜在空间中学习紧凑的表示形式作为元数据,并设计了一种新的sRGB引导上下文模型,改进了熵估计策略。
效果:实验结果表明,该方法可以在较小的元数据大小下实现优秀的原始图像重建效果,并在未压缩的sRGB图像和JPEG图像上都表现出优越的性能。

While raw images exhibit advantages over sRGB images (e.g. linearity and fine-grained quantization level), they are not widely used by common users due to the large storage requirements. Very recent works propose to compress raw images by designing the sampling masks in the raw image pixel space, leading to suboptimal image representations and redundant metadata. In this paper, we propose a novel framework to learn a compact representation in the latent space serving as the metadata in an end-to-end manner. Furthermore, we propose a novel sRGB-guided context model with the improved entropy estimation strategies, which leads to better reconstruction quality, smaller size of metadata, and faster speed. We illustrate how the proposed raw image compression scheme can adaptively allocate more bits to image regions that are important from a global perspective. The experimental results show that the proposed method can achieve superior raw image reconstruction results using a smaller size of the metadata on both uncompressed sRGB images and JPEG images.

MonoATT: Online Monocular 3D Object Detection With Adaptive Token Transformer
Zhou, YunsongandZhu, HongziandLiu, QuanandChang, ShanandGuo, Minyi



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Mobile monocular 3D object detection (Mono3D) (e.g., on a vehicle, a drone, or a robot) is an important yet challenging task. Existing transformer-based offline Mono3D models adopt grid-based vision tokens, which is suboptimal when using coarse tokens due to the limited available computational power. In this paper, we propose an online Mono3D framework, called MonoATT, which leverages a novel vision transformer with heterogeneous tokens of varying shapes and sizes to facilitate mobile Mono3D. The core idea of MonoATT is to adaptively assign finer tokens to areas of more significance before utilizing a transformer to enhance Mono3D. To this end, we first use prior knowledge to design a scoring network for selecting the most important areas of the image, and then propose a token clustering and merging network with an attention mechanism to gradually merge tokens around the selected areas in multiple stages. Finally, a pixel-level feature map is reconstructed from heterogeneous tokens before employing a SOTA Mono3D detector as the underlying detection core. Experiment results on the real-world KITTI dataset demonstrate that MonoATT can effectively improve the Mono3D accuracy for both near and far objects and guarantee low latency. MonoATT yields the best performance compared with the state-of-the-art methods by a large margin and is ranked number one on the KITTI 3D benchmark.

Object Discovery From Motion-Guided Tokens
Bao, ZhipengandTokmakov, PavelandWang, Yu-XiongandGaidon, AdrienandHebert, Martial



研究问题:如何从背景中分离出对象,而无需手动标签?
动机:现有的方法主要依赖于低层次的线索进行聚类,无论是手工制作的(如颜色、纹理)还是学习得到的(如自动编码器)。
方法:本文在自动编码器表示学习框架中加入了两个关键组件:运动引导和中层次特征标记化。通过引入一个新的变压器解码器,证明了运动引导和标记化的协同作用可以提升性能。
效果:该方法有效地利用了运动和标记化的协同作用,在合成数据和真实数据集上都超越了现有技术。同时,该方法能够产生可解释的对象特定中层次特征,展示了运动引导(无需标注)和量化(可解释性、内存效率)的优势。

Object discovery -- separating objects from the background without manual labels -- is a fundamental open challenge in computer vision. Previous methods struggle to go beyond clustering of low-level cues, whether handcrafted (e.g., color, texture) or learned (e.g., from auto-encoders). In this work, we augment the auto-encoder representation learning framework with two key components: motion-guidance and mid-level feature tokenization. Although both have been separately investigated, we introduce a new transformer decoder showing that their benefits can compound thanks to motion-guided vector quantization. We show that our architecture effectively leverages the synergy between motion and tokenization, improving upon the state of the art on both synthetic and real datasets. Our approach enables the emergence of interpretable object-specific mid-level features, demonstrating the benefits of motion-guidance (no labeling) and quantization (interpretability, memory efficiency).

Hyperspherical Embedding for Point Cloud Completion
Zhang, JunmingandZhang, HaomengandVasudevan, RamandJohnson-Roberson, Matthew



研究问题:如何从深度传感器的不完整3D测量中预测物体的完整形状。
动机:现有的点云补全任务通常采用编码器-解码器架构,但学习到的嵌入在特征空间中的分布稀疏,导致测试时的泛化结果较差。
方法:提出一个超球模块,将编码器提取的嵌入进行转换和归一化,使其位于单位超球上。优化仅针对方向信息,从而得到更稳定训练和更紧凑的嵌入分布。
效果:实验结果表明,该方法在单任务和多任务学习中都能显著提高点云补全的效果。

Most real-world 3D measurements from depth sensors are incomplete, and to address this issue the point cloud completion task aims to predict the complete shapes of objects from partial observations. Previous works often adapt an encoder-decoder architecture, where the encoder is trained to extract embeddings that are used as inputs to generate predictions from the decoder. However, the learned embeddings have sparse distribution in the feature space, which leads to worse generalization results during testing. To address these problems, this paper proposes a hyperspherical module, which transforms and normalizes embeddings from the encoder to be on a unit hypersphere. With the proposed module, the magnitude and direction of the output hyperspherical embedding are decoupled and only the directional information is optimized. We theoretically analyze the hyperspherical embedding and show that it enables more stable training with a wider range of learning rates and more compact embedding distributions. Experiment results show consistent improvement of point cloud completion in both single-task and multi-task learning, which demonstrates the effectiveness of the proposed method.

Unsupervised Deep Asymmetric Stereo Matching With Spatially-Adaptive Self-Similarity
Song, TaeyongandKim, SunokandSohn, Kwanghoon



研究问题:现有的无监督立体匹配算法大多假设左右图像具有一致的视觉属性,即对称,但在非对称立体图像上容易失败。
动机:提出一种新颖的空间自适应自相似性(SASS)方法,用于解决非对称立体图像的无监督立体匹配问题。
方法:扩展自相似性的概念,生成对非对称性具有鲁棒性的深层特征。在整个图像区域内自适应地生成计算自相似性的采样模式,以有效地编码不同的模式。设计一种带有正负权重的对比相似性损失函数,进一步鼓励SASS编码非对称无关的特征,同时保持立体对应的特殊性。
效果:通过大量的实验结果,包括消融研究和与不同方法的比较,证明了该方法在分辨率和噪声非对称性下的有效性。

Unsupervised stereo matching has received a lot of attention since it enables the learning of disparity estimation without ground-truth data. However, most of the unsupervised stereo matching algorithms assume that the left and right images have consistent visual properties, i.e., symmetric, and easily fail when the stereo images are asymmetric. In this paper, we present a novel spatially-adaptive self-similarity (SASS) for unsupervised asymmetric stereo matching. It extends the concept of self-similarity and generates deep features that are robust to the asymmetries. The sampling patterns to calculate self-similarities are adaptively generated throughout the image regions to effectively encode diverse patterns. In order to learn the effective sampling patterns, we design a contrastive similarity loss with positive and negative weights. Consequently, SASS is further encouraged to encode asymmetry-agnostic features, while maintaining the distinctiveness for stereo correspondence. We present extensive experimental results including ablation studies and comparisons with different methods, demonstrating effectiveness of the proposed method under resolution and noise asymmetries.

MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training
Xu, RunsenandWang, TaiandZhang, WenweiandChen, RunjianandCao, JinkunandPang, JiangmiaoandLin, Dahua



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

This paper introduces the Masked Voxel Jigsaw and Reconstruction (MV-JAR) method for LiDAR-based self-supervised pre-training and a carefully designed data-efficient 3D object detection benchmark on the Waymo dataset. Inspired by the scene-voxel-point hierarchy in downstream 3D object detectors, we design masking and reconstruction strategies accounting for voxel distributions in the scene and local point distributions within the voxel. We employ a Reversed-Furthest-Voxel-Sampling strategy to address the uneven distribution of LiDAR points and propose MV-JAR, which combines two techniques for modeling the aforementioned distributions, resulting in superior performance. Our experiments reveal limitations in previous data-efficient experiments, which uniformly sample fine-tuning splits with varying data proportions from each LiDAR sequence, leading to similar data diversity across splits. To address this, we propose a new benchmark that samples scene sequences for diverse fine-tuning splits, ensuring adequate model convergence and providing a more accurate evaluation of pre-training methods. Experiments on our Waymo benchmark and the KITTI dataset demonstrate that MV-JAR consistently and significantly improves 3D detection performance across various data scales, achieving up to a 6.3% increase in mAPH compared to training from scratch. Codes and the benchmark are available at https://github.com/SmartBot-PJLab/MV-JAR.

Learning a Sparse Transformer Network for Effective Image Deraining
Chen, XiangandLi, HaoandLi, MingqiangandPan, Jinshan



研究问题:现有的Transformer模型在图像去雨任务中,由于对查询和键的相似性进行特征聚合时,可能会干扰清晰的图像恢复。
动机:为了解决现有Transformer模型在图像去雨任务中的干扰问题,提高图像重建质量。
方法:提出一种有效的去雨网络——稀疏Transformer(DRSformer)。通过学习可调整的top-k选择操作符,自适应地保留每个查询最重要的注意力分数,以更好地进行特征聚合。同时,开发混合尺度前馈网络生成更好的图像去雨特征。
效果:实验结果表明,该方法在常用的基准测试上取得了优于现有最先进技术的性能。

Transformers-based methods have achieved significant performance in image deraining as they can model the non-local information which is vital for high-quality image reconstruction. In this paper, we find that most existing Transformers usually use all similarities of the tokens from the query-key pairs for the feature aggregation. However, if the tokens from the query are different from those of the key, the self-attention values estimated from these tokens also involve in feature aggregation, which accordingly interferes with the clear image restoration. To overcome this problem, we propose an effective DeRaining network, Sparse Transformer (DRSformer) that can adaptively keep the most useful self-attention values for feature aggregation so that the aggregated features better facilitate high-quality image reconstruction. Specifically, we develop a learnable top-k selection operator to adaptively retain the most crucial attention scores from the keys for each query for better feature aggregation. Simultaneously, as the naive feed-forward network in Transformers does not model the multi-scale information that is important for latent clear image restoration, we develop an effective mixed-scale feed-forward network to generate better features for image deraining. To learn an enriched set of hybrid features, which combines local context from CNN operators, we equip our model with mixture of experts feature compensator to present a cooperation refinement deraining scheme. Extensive experimental results on the commonly used benchmarks demonstrate that the proposed method achieves favorable performance against state-of-the-art approaches. The source code and trained models are available at https://github.com/cschenxiang/DRSformer.

DA-DETR: Domain Adaptive Detection Transformer With Information Fusion
Zhang, JingyiandHuang, JiaxingandLuo, ZhipengandZhang, GongjieandZhang, XiaoqinandLu, Shijian



研究问题:如何利用简单有效的DETR架构进行领域自适应目标检测。
动机:尽管DETR简化了目标检测流程,但如何将其应用于领域自适应对象检测的问题尚未得到充分关注。
方法:设计了一种名为DA-DETR的领域自适应对象检测变换器,通过引入信息融合,实现从有标签源域到无标签目标域的有效转移。具体来说,DA-DETR引入了一种新颖的CNN-Transformer Blender(CTBlender),巧妙地融合了CNN特征和Transformer特征,以实现跨领域的有效特征对齐和知识转移。
效果:大量实验表明,DA-DETR在多个广泛采用的领域适应基准测试中始终表现出优越的检测性能。

The recent detection transformer (DETR) simplifies the object detection pipeline by removing hand-crafted designs and hyperparameters as employed in conventional two-stage object detectors. However, how to leverage the simple yet effective DETR architecture in domain adaptive object detection is largely neglected. Inspired by the unique DETR attention mechanisms, we design DA-DETR, a domain adaptive object detection transformer that introduces information fusion for effective transfer from a labeled source domain to an unlabeled target domain. DA-DETR introduces a novel CNN-Transformer Blender (CTBlender) that fuses the CNN features and Transformer features ingeniously for effective feature alignment and knowledge transfer across domains. Specifically, CTBlender employs the Transformer features to modulate the CNN features across multiple scales where the high-level semantic information and the low-level spatial information are fused for accurate object identification and localization. Extensive experiments show that DA-DETR achieves superior detection performance consistently across multiple widely adopted domain adaptation benchmarks.

Global-to-Local Modeling for Video-Based 3D Human Pose and Shape Estimation
Shen, XiaolongandYang, ZongxinandWang, XiaohanandMa, JianxinandZhou, ChangandYang, Yi



研究问题:本文旨在解决视频中三维人体姿态和形状估计的问题,特别是在处理短期和长期时间相关性时的挑战。
动机:现有的最先进方法将这两种度量标准视为统一的问题,并使用单一的模型结构(如RNN或基于注意力的模块)来设计网络,这在平衡学习短期和长期时间相关性上存在困难,可能导致预测结果出现位置偏移、时间不一致性和局部细节不足等问题。
方法:为了解决这些问题,我们提出了一种端到端的框架——全局到局部转换器(GLoT),该框架在结构上将长期和短期相关性的建模进行解耦。首先,引入了一个全局转换器,并采用了一种被遮盖的姿态和形状估计策略来进行长期建模。其次,局部转换器负责利用人体网格上的局部细节,并通过利用交叉注意力与全局转换器进行交互。此外,我们还进一步引入了一种分层空间相关性回归器,通过解耦的全局-局部表示和隐式运动学约束来细化帧内估计。
效果:我们的GLoT在流行的基准测试——3DPW, MPI-INF-3DHP, 和Human3.6M上超越了之前最先进的方法,同时具有最低的模型参数。

Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness. Although these two metrics are responsible for different ranges of temporal consistency, existing state-of-the-art methods treat them as a unified problem and use monotonous modeling structures (e.g., RNN or attention-based block) to design their networks. However, using a single kind of modeling structure is difficult to balance the learning of short-term and long-term temporal correlations, and may bias the network to one of them, leading to undesirable predictions like global location shift, temporal inconsistency, and insufficient local details. To solve these problems, we propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT). First, a global transformer is introduced with a Masked Pose and Shape Estimation strategy for long-term modeling. The strategy stimulates the global transformer to learn more inter-frame correlations by randomly masking the features of several frames. Second, a local transformer is responsible for exploiting local details on the human mesh and interacting with the global transformer by leveraging cross-attention. Moreover, a Hierarchical Spatial Correlation Regressor is further introduced to refine intra-frame estimations by decoupled global-local representation and implicit kinematic constraints. Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M. Codes are available at https://github.com/sxl142/GLoT.

Point2Pix: Photo-Realistic Point Cloud Rendering via Neural Radiance Fields
Hu, TaoandXu, XiaogangandLiu, ShuandJia, Jiaya



研究问题:如何有效地将稀疏的点云表示与2D密集图像像素链接起来,以合成高质量的图像。
动机:由于点云表示的稀疏性,从点云合成逼真的图像具有挑战性。
方法:提出了一种新的点渲染器Point2Pix,利用点云3D先验和NeRF渲染管道,从彩色点云中合成高质量的图像。
效果:通过提出点引导采样、多尺度辐射场的点编码以及融合编码等方法,显著提高了图像合成的效率和质量。在ScanNet和ArkitScenes数据集上的大量实验证明了该方法的有效性和泛化能力。

Synthesizing photo-realistic images from a point cloud is challenging because of the sparsity of point cloud representation. Recent Neural Radiance Fields and extensions are proposed to synthesize realistic images from 2D input. In this paper, we present Point2Pix as a novel point renderer to link the 3D sparse point clouds with 2D dense image pixels. Taking advantage of the point cloud 3D prior and NeRF rendering pipeline, our method can synthesize high-quality images from colored point clouds, generally for novel indoor scenes. To improve the efficiency of ray sampling, we propose point-guided sampling, which focuses on valid samples. Also, we present Point Encoding to build Multi-scale Radiance Fields that provide discriminative 3D point features. Finally, we propose Fusion Encoding to efficiently synthesize high-quality images. Extensive experiments on the ScanNet and ArkitScenes datasets demonstrate the effectiveness and generalization.

Multiplicative Fourier Level of Detail
Dou, YishunandZheng, ZhongandJin, QiaoqiaoandNi, Bingbing



研究问题:本文旨在开发一种名为多倍频细节级别的简单而有效的隐式表示方案(MFLOD)。
动机:受最近乘法滤波网络成功的影响,我们提出了MFLOD。
方法:基于多分辨率特征网格/体积(如稀疏体素八叉树),每一层的特征首先被正弦函数调制,然后逐元素地与前一层表示的线性变换进行层到层的递归乘法,从而产生用于后续简单线性前向的尺度聚合编码以获得最终输出。
效果:通过在隐式神经表示学习任务上进行实验,包括图像拟合、3D形状表示和神经辐射场等,结果证明了MFLOD方案的优越性和通用性。

We develop a simple yet surprisingly effective implicit representing scheme called Multiplicative Fourier Level of Detail (MFLOD) motivated by the recent success of multiplicative filter network. Built on multi-resolution feature grid/volume (e.g., the sparse voxel octree), each level's feature is first modulated by a sinusoidal function and then element-wisely multiplied by a linear transformation of previous layer's representation in a layer-to-layer recursive manner, yielding the scale-aggregated encodings for a subsequent simple linear forward to get final output. In contrast to previous hybrid representations relying on interleaved multilevel fusion and nonlinear activation-based decoding, MFLOD could be elegantly characterized as a linear combination of sine basis functions with varying amplitude, frequency, and phase upon the learned multilevel features, thus offering great feasibility in Fourier analysis. Comprehensive experimental results on implicit neural representation learning tasks including image fitting, 3D shape representation, and neural radiance fields well demonstrate the superior quality and generalizability achieved by the proposed MFLOD scheme.

Low-Light Image Enhancement via Structure Modeling and Guidance
Xu, XiaogangandWang, RuixingandLu, Jiangbo



研究问题:本文提出了一种新的低光图像增强框架,通过同时进行外观和结构建模。
动机:现有的方法在处理低光图像时,往往只关注图像的外观信息,忽视了图像的结构信息。
方法:该框架采用结构特征引导外观增强,实现了边缘检测和图像增强。具体来说,通过设计一个结构感知的特征提取器和生成器,实现了结构建模;并通过一个简单的U-Net网络进行外观建模。
效果:实验结果表明,该方法在所有数据集上都取得了最先进的性能,证明了其有效性。

This paper proposes a new framework for low-light image enhancement by simultaneously conducting the appearance as well as structure modeling. It employs the structural feature to guide the appearance enhancement, leading to sharp and realistic results. The structure modeling in our framework is implemented as the edge detection in low-light images. It is achieved with a modified generative model via designing a structure-aware feature extractor and generator. The detected edge maps can accurately emphasize the essential structural information, and the edge prediction is robust towards the noises in dark areas. Moreover, to improve the appearance modeling, which is implemented with a simple U-Net, a novel structure-guided enhancement module is proposed with structure-guided feature synthesis layers. The appearance modeling, edge detector, and enhancement module can be trained end-to-end. The experiments are conducted on representative datasets (sRGB and RAW domains), showing that our model consistently achieves SOTA performance on all datasets with the same architecture.

Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection
Li, LongandHan, JunweiandZhang, NiandLiu, NianandKhan, SalmanandCholakkal, HishamandAnwer, RaoMuhammadandKhan, FahadShahbaz



研究问题:大多数先前的共显著对象检测工作主要关注通过挖掘图像间的一致性关系来提取共显著线索,而忽视了对背景区域的显式探索。
动机:本文提出了一种基于经济多粒度关联模块的判别性共显著性和背景挖掘Transformer框架(DMT),以显式地挖掘共显著性和背景信息,并有效地模型它们的区分能力。
方法:首先,我们提出了区域到区域关联模块,以经济地为像素级分割特征建模图像间关系。然后,我们使用两种预定义的标记来通过我们提出的对比诱导像素到标记和共显著性标记到标记关联模块来挖掘共显著性和背景信息。我们还设计了一个标记引导的特征细化模块,以在学到的标记的指导下增强分割特征的可区分性。我们对分割特征提取和标记构建进行了迭代相互促进。
效果:我们在三个基准数据集上进行的实验结果表明了我们提出的方法的有效性。源代码可在以下网址获取:https://github.com/dragonlee258079/DMT。

Most previous co-salient object detection works mainly focus on extracting co-salient cues via mining the consistency relations across images while ignoring the explicit exploration of background regions. In this paper, we propose a Discriminative co-saliency and background Mining Transformer framework (DMT) based on several economical multi-grained correlation modules to explicitly mine both co-saliency and background information and effectively model their discrimination. Specifically, we first propose region-to-region correlation modules to economically model inter-image relations for pixel-wise segmentation features. Then, we use two types of predefined tokens to mine co-saliency and background information via our proposed contrast-induced pixel-to-token and co-saliency token-to-token correlation modules. We also design a token-guided feature refinement module to enhance the discriminability of the segmentation features under the guidance of the learned tokens. We perform iterative mutual promotion for the segmentation feature extraction and token construction. Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method. The source code is available at: https://github.com/dragonlee258079/DMT.

Binary Latent Diffusion
Wang, ZeandWang, JiangandLiu, ZichengandQiu, Qiang



研究问题:如何利用二进制潜在空间进行紧凑且富有表现力的图片表示。
动机:现有的图片表示方法通常需要多阶段的潜在空间层次结构,或者使用像素或连续的潜在表示,这限制了其效率和分辨率。
方法:通过训练一个具有伯努利编码分布的自动编码器,实现图像与其对应二进制潜在表示之间的双向映射。
效果:实验结果表明,该方法在多个数据集上的表现与最先进的方法相当,同时显著提高了采样效率,可以在不使用任何测试时间加速的情况下仅用16步生成图像。此外,该方法还可以无缝扩展到1024 x 1024的高分辨率图像生成,无需依赖潜在空间层次结构或多阶段细化。

In this paper, we show that a binary latent space can be explored for compact yet expressive image representations. We model the bi-directional mappings between an image and the corresponding latent binary representation by training an auto-encoder with a Bernoulli encoding distribution. On the one hand, the binary latent space provides a compact discrete image representation of which the distribution can be modeled more efficiently than pixels or continuous latent representations. On the other hand, we now represent each image patch as a binary vector instead of an index of a learned cookbook as in discrete image representations with vector quantization. In this way, we obtain binary latent representations that allow for better image quality and high-resolution image representations without any multi-stage hierarchy in the latent space. In this binary latent space, images can now be generated effectively using a binary latent diffusion model tailored specifically for modeling the prior over the binary image representations. We present both conditional and unconditional image generation experiments with multiple datasets, and show that the proposed method performs comparably to state-of-the-art methods while dramatically improving the sampling efficiency to as few as 16 steps without using any test-time acceleration. The proposed framework can also be seamlessly scaled to 1024 x 1024 high-resolution image generation without resorting to latent hierarchy or multi-stage refinements.

Adaptive Assignment for Geometry Aware Local Feature Matching
Huang, DiheandChen, YingandLiu, YongandLiu, JianlinandXu, ShangandWu, WenlongandDing, YikangandTang, FanandWang, Chengjie



研究问题:目前无检测器的特征匹配方法在大规模和视角变化上表现不佳,因为应用相互最近邻标准(即一对一分配)在补丁级别匹配中导致几何不一致。
动机:为了解决这个问题,我们提出了AdaMatcher,它首先通过精心设计的特征交互模块完成特征相关性和共可见区域估计,然后在估计图像之间的比例的同时进行自适应分配的补丁级别匹配,最后通过比例对齐和亚像素回归模块细化共可见匹配。
方法:AdaMatcher采用特征交互模块进行特征相关性和共可见区域估计,然后进行自适应分配的补丁级别匹配并估计图像之间的比例,最后通过比例对齐和亚像素回归模块细化共可见匹配。
效果:实验表明,AdaMatcher优于坚实的基线并实现了许多下游任务的最先进的结果。此外,自适应分配和亚像素细化模块可以用作其他匹配方法(如SuperGlue)的细化网络,以进一步提高其性能。

The detector-free feature matching approaches are currently attracting great attention thanks to their excellent performance. However, these methods still struggle at large-scale and viewpoint variations, due to the geometric inconsistency resulting from the application of the mutual nearest neighbour criterion (i.e., one-to-one assignment) in patch-level matching. Accordingly, we introduce AdaMatcher, which first accomplishes the feature correlation and co-visible area estimation through an elaborate feature interaction module, then performs adaptive assignment on patch-level matching while estimating the scales between images, and finally refines the co-visible matches through scale alignment and sub-pixel regression module. Extensive experiments show that AdaMatcher outperforms solid baselines and achieves state-of-the-art results on many downstream tasks. Additionally, the adaptive assignment and sub-pixel refinement module can be used as a refinement network for other matching methods, such as SuperGlue, to boost their performance further. The code will be publicly available at https://github.com/AbyssGaze/AdaMatcher.

FeatER: An Efficient Network for Human Reconstruction via Feature Map-Based TransformER
Zheng, CeandMendieta, MatiasandYang, TaojiannanandQi, Guo-JunandChen, Chen



研究问题:现有的视觉转换器无法直接处理特征图输入,需要对位置敏感的人体结构信息进行不自然的展平,同时计算和内存需求也在不断增加。
动机:为了解决这些问题,我们提出了FeatER,一种能够保留特征图表示固有结构的新型转换器设计,同时降低内存和计算成本。
方法:利用FeatER,我们构建了一个高效的网络用于一系列人体重建任务,包括2D人体姿态估计、3D人体姿态估计和人体网格重建。我们还应用了特征图重建模块来提高估计的人体姿态和网格的性能。
效果:实验表明,FeatER在各种人体姿态和网格数据集上表现出色。例如,在Human3.6M和3DPW数据集上,FeatER比最先进的SOTA方法MeshGraphormer所需的参数减少了5%,乘法累加操作减少了16%。

Recently, vision transformers have shown great success in a set of human reconstruction tasks such as 2D human pose estimation (2D HPE), 3D human pose estimation (3D HPE), and human mesh reconstruction (HMR) tasks. In these tasks, feature map representations of the human structural information are often extracted first from the image by a CNN (such as HRNet), and then further processed by transformer to predict the heatmaps (encodes each joint's location into a feature map with a Gaussian distribution) for HPE or HMR. However, existing transformer architectures are not able to process these feature map inputs directly, forcing an unnatural flattening of the location-sensitive human structural information. Furthermore, much of the performance benefit in recent HPE and HMR methods has come at the cost of ever-increasing computation and memory needs. Therefore, to simultaneously address these problems, we propose FeatER, a novel transformer design which preserves the inherent structure of feature map representations when modeling attention while reducing the memory and computational costs. Taking advantage of FeatER, we build an efficient network for a set of human reconstruction tasks including 2D HPE, 3D HPE, and HMR. A feature map reconstruction module is applied to improve the performance of the estimated human pose and mesh. Extensive experiments demonstrate the effectiveness of FeatER on various human pose and mesh datasets. For instance, FeatER outperforms the SOTA method MeshGraphormer by requiring 5% of Params (total parameters) and 16% of MACs (the Multiply-Accumulate Operations) on Human3.6M and 3DPW datasets. Code will be publicly available.

Residual Degradation Learning Unfolding Framework With Mixing Priors Across Spectral and Spatial for Compressive Spectral Imaging
Dong, YuboandGao, DahuaandQiu, TianandLi, YuyanandYang, MinxiandShi, Guangming



研究问题:如何从2D测量中恢复可靠且精细的底层3D光谱立方体。
动机:CASSI系统的核心问题是从2D测量中恢复可靠的底层3D光谱立方体,而现有的深度展开方法在数据子问题和先验子问题上存在不足。
方法:提出了残差退化学习展开框架(RDLUF)和MixS2 Transformer,前者弥合了感知矩阵与退化过程之间的差距,后者通过混合光谱和空间先验来增强光谱-空间表示能力。
效果:实验结果表明,所提出的方法优于现有方法。

To acquire a snapshot spectral image, coded aperture snapshot spectral imaging (CASSI) is proposed. A core problem of the CASSI system is to recover the reliable and fine underlying 3D spectral cube from the 2D measurement. By alternately solving a data subproblem and a prior subproblem, deep unfolding methods achieve good performance. However, in the data subproblem, the used sensing matrix is ill-suited for the real degradation process due to the device errors caused by phase aberration, distortion; in the prior subproblem, it is important to design a suitable model to jointly exploit both spatial and spectral priors. In this paper, we propose a Residual Degradation Learning Unfolding Framework (RDLUF), which bridges the gap between the sensing matrix and the degradation process. Moreover, a MixS2 Transformer is designed via mixing priors across spectral and spatial to strengthen the spectral-spatial representation capability. Finally, plugging the MixS2 Transformer into the RDLUF leads to an end-to-end trainable and interpretable neural network RDLUF-MixS2. Experimental results establish the superior performance of the proposed method over existing ones.

PanelNet: Understanding 360 Indoor Environment via Panel Representation
Yu, HaozhengandHe, LuandJian, BingandFeng, WeiweiandLiu, Shan



研究问题:如何利用室内360度全景的两个关键特性(连续性和无缝性,以及重力在室内环境设计中的重要性)来理解室内环境。
动机:现有的方法往往忽视了室内360度全景的这些特性,导致对室内环境的理解和分析不够准确。
方法:提出了PanelNet框架,该框架将等距投影图表示为连续的垂直面板,并结合了面板的几何特征。同时,引入了面板几何嵌入网络和局部到全局转换器,以减少全景失真的负面影响,并捕捉房间设计的几何上下文。
效果:实验结果表明,该方法在室内360度深度估计、室内布局估计和语义分割等任务上均优于现有方法,且训练开销低。

Indoor 360 panoramas have two essential properties. (1) The panoramas are continuous and seamless in the horizontal direction. (2) Gravity plays an important role in indoor environment design. By leveraging these properties, we present PanelNet, a framework that understands indoor environments using a novel panel representation of 360 images. We represent an equirectangular projection (ERP) as consecutive vertical panels with corresponding 3D panel geometry. To reduce the negative impact of panoramic distortion, we incorporate a panel geometry embedding network that encodes both the local and global geometric features of a panel. To capture the geometric context in room design, we introduce Local2Global Transformer, which aggregates local information within a panel and panel-wise global context. It greatly improves the model performance with low training overhead. Our method outperforms existing methods on indoor 360 depth estimation and shows competitive results against state-of-the-art approaches on the task of indoor layout estimation and semantic segmentation.

Correspondence Transformers With Asymmetric Feature Learning and Matching Flow Super-Resolution
Sun, YixuanandZhao, DongyangandYin, ZhangyueandHuang, YiwenandGui, TaoandZhang, WenqiangandGe, Weifeng



研究问题:解决在只有稀疏标注的情况下,学习同一类别不同物体实例之间的密集视觉对应关系的问题。
动机:现有的方法需要大量的标注数据才能进行准确的像素级语义匹配,而本文提出的方法只需要稀疏的标注即可实现。
方法:将像素级的语义匹配问题分解为两个较简单的子问题:首先,将源图像和目标图像的局部特征描述符映射到共享的语义空间中以获得粗略的匹配流;其次,对低分辨率的匹配流进行精细化处理以生成精确的点对点匹配结果。为此,提出了基于视觉变换器的非对称特征学习和匹配流超分辨率方法。
效果:通过在多个流行基准测试集上进行广泛的实验,如PF-PASCAL、PF-WILLOW和SPair-71K,验证了该方法可以有效地捕捉像素间的微妙语义差异。

This paper solves the problem of learning dense visual correspondences between different object instances of the same category with only sparse annotations. We decompose this pixel-level semantic matching problem into two easier ones: (i) First, local feature descriptors of source and target images need to be mapped into shared semantic spaces to get coarse matching flows. (ii) Second, matching flows in low resolution should be refined to generate accurate point-to-point matching results. We propose asymmetric feature learning and matching flow super-resolution based on vision transformers to solve the above problems. The asymmetric feature learning module exploits a biased cross-attention mechanism to encode token features of source images with their target counterparts. Then matching flow in low resolutions is enhanced by a super-resolution network to get accurate correspondences. Our pipeline is built upon vision transformers and can be trained in an end-to-end manner. Extensive experimental results on several popular benchmarks, such as PF-PASCAL, PF-WILLOW, and SPair-71K, demonstrate that the proposed method can catch subtle semantic differences in pixels efficiently. Code is available on https://github.com/YXSUNMADMAX/ACTR.

Unsupervised 3D Point Cloud Representation Learning by Triangle Constrained Contrast for Autonomous Driving
Pang, BoandXia, HongchiandLu, Cewu



研究问题:由于自动驾驶的3D激光雷达数据标注困难,因此需要一种有效的无监督3D表示学习方法。
动机:本文设计了针对自动驾驶场景的三角形约束对比(TriCC)框架,通过多模态信息和时间序列动态学习3D无监督表示。
方法:我们将一张摄像头图像和两个不同时间戳的激光雷达点云视为一个三元组。关键设计是自动通过"自我循环"找到三元组中的匹配关系的一致性约束,并从中学习表示。利用跨时间和模态的匹配关系,我们可以进一步进行三元组对比以提高学习效率。
效果:实验结果表明,TriCC是第一个统一时间和多模态语义的框架,即它几乎利用了自动驾驶场景中的所有信息。与以前的对比方法相比,它可以自动挖掘出更难的对比对,而不是依赖手工制作的对比对。在几个语义分割和3D检测数据集上进行的大量实验表明,TriCC可以在少得多的训练迭代次数下学习到有效的表示,并在所有下游任务上都大大提高了最新结果。代码和模型可在https://bopang1996.github.io/ 找到。

Due to the difficulty of annotating the 3D LiDAR data of autonomous driving, an efficient unsupervised 3D representation learning method is important. In this paper, we design the Triangle Constrained Contrast (TriCC) framework tailored for autonomous driving scenes which learns 3D unsupervised representations through both the multimodal information and dynamic of temporal sequences. We treat one camera image and two LiDAR point clouds with different timestamps as a triplet. And our key design is the consistent constraint that automatically finds matching relationships among the triplet through "self-cycle" and learns representations from it. With the matching relations across the temporal dimension and modalities, we can further conduct a triplet contrast to improve learning efficiency. To the best of our knowledge, TriCC is the first framework that unifies both the temporal and multimodal semantics, which means it utilizes almost all the information in autonomous driving scenes. And compared with previous contrastive methods, it can automatically dig out contrasting pairs with higher difficulty, instead of relying on handcrafted ones. Extensive experiments are conducted with Minkowski-UNet and VoxelNet on several semantic segmentation and 3D detection datasets. Results show that TriCC learns effective representations with much fewer training iterations and improves the SOTA results greatly on all the downstream tasks. Code and models can be found at https://bopang1996.github.io/.

Controllable Mesh Generation Through Sparse Latent Point Diffusion Models
Lyu, ZhaoyangandWang, JinyiandAn, YuweiandZhang, YaandLin, DahuaandDai, Bo



研究问题:设计一种有效的网格生成模型,以应对网格的非规则数据结构和同一类别中网格的不一致拓扑结构。
动机:由于网格的复杂性和不规则性,设计出一种有效的网格生成模型具有很大的挑战性。
方法:提出一种新的稀疏潜在点扩散模型进行网格生成。将点云视为网格的中间表示,并对其分布进行建模。进一步将点云编码为一组具有点位语义有意义特征的稀疏潜在点,并在这些潜在点的稀疏空间中训练两个DDPMs,分别对潜在点的位置和特征分布进行建模。
效果:在ShapeNet数据集上进行的大量实验表明,与现有方法相比,所提出的稀疏潜在点扩散模型在生成质量和可控性方面表现出优越的性能。

Mesh generation is of great value in various applications involving computer graphics and virtual content, yet designing generative models for meshes is challenging due to their irregular data structure and inconsistent topology of meshes in the same category. In this work, we design a novel sparse latent point diffusion model for mesh generation. Our key insight is to regard point clouds as an intermediate representation of meshes, and model the distribution of point clouds instead. While meshes can be generated from point clouds via techniques like Shape as Points (SAP), the challenges of directly generating meshes can be effectively avoided. To boost the efficiency and controllability of our mesh generation method, we propose to further encode point clouds to a set of sparse latent points with point-wise semantic meaningful features, where two DDPMs are trained in the space of sparse latent points to respectively model the distribution of the latent point positions and features at these latent points. We find that sampling in this latent space is faster than directly sampling dense point clouds. Moreover, the sparse latent points also enable us to explicitly control both the overall structures and local details of the generated meshes. Extensive experiments are conducted on the ShapeNet dataset, where our proposed sparse latent point diffusion model achieves superior performance in terms of generation quality and controllability when compared to existing methods.

SGLoc: Scene Geometry Encoding for Outdoor LiDAR Localization
Li, WenandYu, ShangshuandWang, ChengandHu, GuoshengandShen, SiqiandWen, Chenglu



研究问题:现有的基于激光雷达的定位方法在有效编码场景几何和数据质量方面存在困难,导致准确性有待提高。
动机:提出一种新的激光雷达定位框架SGLoc,通过解耦点云对应关系回归和通过对应关系的姿态估计来改善这一问题。
方法:SGLoc采用解耦对应关系回归和姿态估计,设计了三尺度空间特征聚合模块和几何间一致性约束损失函数以有效捕捉场景几何。同时,提出了一种姿态质量评估和增强方法来测量和修正地面真值。
效果:在牛津雷达机器人汽车和NCLT数据集上的大量实验表明,SGLoc的有效性,其位置精度分别比最先进的基于回归的定位方法提高了68.5%和67.6%。

LiDAR-based absolute pose regression estimates the global pose through a deep network in an end-to-end manner, achieving impressive results in learning-based localization. However, the accuracy of existing methods still has room to improve due to the difficulty of effectively encoding the scene geometry and the unsatisfactory quality of the data. In this work, we propose a novel LiDAR localization framework, SGLoc, which decouples the pose estimation to point cloud correspondence regression and pose estimation via this correspondence. This decoupling effectively encodes the scene geometry because the decoupled correspondence regression step greatly preserves the scene geometry, leading to significant performance improvement. Apart from this decoupling, we also design a tri-scale spatial feature aggregation module and inter-geometric consistency constraint loss to effectively capture scene geometry. Moreover, we empirically find that the ground truth might be noisy due to GPS/INS measuring errors, greatly reducing the pose estimation performance. Thus, we propose a pose quality evaluation and enhancement method to measure and correct the ground truth pose. Extensive experiments on the Oxford Radar RobotCar and NCLT datasets demonstrate the effectiveness of SGLoc, which outperforms state-of-the-art regression-based localization methods by 68.5% and 67.6% on position accuracy, respectively.

Bridging Search Region Interaction With Template for RGB-T Tracking
Hui, TianruiandXun, ZizhengandPeng, FengguangandHuang, JunshiandWei, XiaomingandWei, XiaolinandDai, JiaoandHan, JizhongandLiu, Si



研究问题:如何利用RGB和TIR模态的相互增强和互补能力,提高各种场景下的跟踪过程。
动机:现有的方法直接连接RGB和TIR搜索区域特征进行粗略交互,引入了冗余的背景噪声,或者在局部区域内对孤立的RGB和TIR框对进行各种融合,限制了跨模态交互,导致上下文模型不充分。
方法:提出一种新的模板桥接搜索区域交互(TBSI)模块,通过使用模板作为介质来桥接RGB和TIR搜索区域之间的跨模态交互,收集和分发与目标相关的物体和环境上下文。
效果:在三个流行的RGB-T跟踪基准上进行的大量实验表明,该方法实现了新的最先进的性能。

RGB-T tracking aims to leverage the mutual enhancement and complement ability of RGB and TIR modalities for improving the tracking process in various scenarios, where cross-modal interaction is the key component. Some previous methods concatenate the RGB and TIR search region features directly to perform a coarse interaction process with redundant background noises introduced. Many other methods sample candidate boxes from search frames and conduct various fusion approaches on isolated pairs of RGB and TIR boxes, which limits the cross-modal interaction within local regions and brings about inadequate context modeling. To alleviate these limitations, we propose a novel Template-Bridged Search region Interaction (TBSI) module which exploits templates as the medium to bridge the cross-modal interaction between RGB and TIR search regions by gathering and distributing target-relevant object and environment contexts. Original templates are also updated with enriched multimodal contexts from the template medium. Our TBSI module is inserted into a ViT backbone for joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate our method achieves new state-of-the-art performances. Code is available at https://github.com/RyanHTR/TBSI.

MetaFusion: Infrared and Visible Image Fusion via Meta-Feature Embedding From Object Detection
Zhao, WendaandXie, ShigengandZhao, FanandHe, YouandLu, Huchuan



研究问题:如何通过融合红外和可见图像来提高后续物体检测任务的纹理细节,并研究问题:如何通过融合红外和可见图像来提高后续物体检测任务的纹理细节,并利用物体检测任务提供的对象语义信息改善红外和可见图像的融合效果。
动机:现有的方法中,红外和可见图像的融合以及物体检测是两个不同层次的任务,它们之间的特征差距阻碍了融合效果的提升。
方法:本文提出了一种基于元特征嵌入的红外和可见图像融合方法,该方法通过设计一个元特征嵌入模型来根据融合网络的能力生成对象语义特征,使得这些语义特征与融合特征自然兼容。并通过模拟元学习进行优化,同时在融合和检测任务之间实施相互促进的学习以提高它们的性能。
效果:在三个公共数据集上的全面实验证明了该方法的有效性。

Fusing infrared and visible images can provide more texture details for subsequent object detection task. Conversely, detection task furnishes object semantic information to improve the infrared and visible image fusion. Thus, a joint fusion and detection learning to use their mutual promotion is attracting more attention. However, the feature gap between these two different-level tasks hinders the progress. Addressing this issue, this paper proposes an infrared and visible image fusion via meta-feature embedding from object detection. The core idea is that meta-feature embedding model is designed to generate object semantic features according to fusion network ability, and thus the semantic features are naturally compatible with fusion features. It is optimized by simulating a meta learning. Moreover, we further implement a mutual promotion learning between fusion and detection tasks to improve their performances. Comprehensive experiments on three public datasets demonstrate the effectiveness of our method. Code and model are available at: https://github.com/wdzhao123/MetaFusion.

Spectral Enhanced Rectangle Transformer for Hyperspectral Image Denoising
Li, MiaoyuandLiu, JiandFu, YingandZhang, YulunandDou, Dejing



研究问题:现有的高光谱图像(HSI)去噪方法在捕捉非局部自相似性方面存在局限,且现有深度学习方法对空间和光谱相关性的建模效果不佳。
动机:为了解决上述问题,本文提出了一种基于变换器的高光谱图像去噪方法,该方法能够更好地探索HSIs的空间相似性和全局光谱低秩特性。
方法:具体来说,我们设计了一个光谱增强矩形变换器,通过水平垂直地利用矩形自注意力机制来捕获空间域中的非局部相似性,同时设计了一个光谱增强模块来提取空间-光谱立方体的全局潜在低秩特性以抑制噪声,并允许不重叠的空间矩形之间的交互。
效果:我们在合成噪声HSI和真实噪声HSI上进行了大量实验,结果表明,我们的方法在客观度量和主观视觉质量方面都取得了良好的效果。

Denoising is a crucial step for hyperspectral image (HSI) applications. Though witnessing the great power of deep learning, existing HSI denoising methods suffer from limitations in capturing the non-local self-similarity. Transformers have shown potential in capturing long-range dependencies, but few attempts have been made with specifically designed Transformer to model the spatial and spectral correlation in HSIs. In this paper, we address these issues by proposing a spectral enhanced rectangle Transformer, driving it to explore the non-local spatial similarity and global spectral low-rank property of HSIs. For the former, we exploit the rectangle self-attention horizontally and vertically to capture the non-local similarity in the spatial domain. For the latter, we design a spectral enhancement module that is capable of extracting global underlying low-rank property of spatial-spectral cubes to suppress noise, while enabling the interactions among non-overlapping spatial rectangles. Extensive experiments have been conducted on both synthetic noisy HSIs and real noisy HSIs, showing the effectiveness of our proposed method in terms of both objective metric and subjective visual quality. The code is available at https://github.com/MyuLi/SERT.

End-to-End Vectorized HD-Map Construction With Piecewise Bezier Curve
Qiao, LimengandDing, WenjieandQiu, XiandZhang, Chi



研究问题:本文旨在解决自动驾驶领域中对厘米级环境信息感知的矢量高清地图(HD-map)构建问题。
动机:现有的方法主要通过基于分割的流水线获取光栅化地图,然后进行繁重的后处理以实现下游友好的矢量化。这种方法效率低下且不够精确。
方法:本文提出了一种简洁而优雅的参数化方法,采用统一的分段贝塞尔曲线来矢量化可变的地图元素。具体来说,我们设计了一种名为Piecewise Bezier HD-map Network(BeMapNet)的简单而有效的架构,该架构直接进行集合预测,无需后处理。
效果:实验结果表明,我们的方法在各种指标上均优于其他最先进的方法,最高可达18.0 mAP。

Vectorized high-definition map (HD-map) construction, which focuses on the perception of centimeter-level environmental information, has attracted significant research interest in the autonomous driving community. Most existing approaches first obtain rasterized map with the segmentation-based pipeline and then conduct heavy post-processing for downstream-friendly vectorization. In this paper, by delving into parameterization-based methods, we pioneer a concise and elegant scheme that adopts unified piecewise Bezier curve. In order to vectorize changeful map elements end-to-end, we elaborate a simple yet effective architecture, named Piecewise Bezier HD-map Network (BeMapNet), which is formulated as a direct set prediction paradigm and postprocessing-free. Concretely, we first introduce a novel IPM-PE Align module to inject 3D geometry prior into BEV features through common position encoding in Transformer. Then a well-designed Piecewise Bezier Head is proposed to output the details of each map element, including the coordinate of control points and the segment number of curves. In addition, based on the progressively restoration of Bezier curve, we also present an efficient Point-Curve-Region Loss for supervising more robust and precise HD-map modeling. Extensive comparisons show that our method is remarkably superior to other existing SOTAs by 18.0 mAP at least.

PointListNet: Deep Learning on 3D Point Lists
Fan, HeheandZhu, LinchaoandYang, YiandKankanhalli, Mohan



研究问题:如何同时处理具有规则的一维列表(如自然语言)和不规则的三维集合(如点云)的数据?
动机:有些数据既表现出规则的一维列表结构,又表现出不规则的三维集合结构,例如蛋白质和非编码RNA。
方法:提出一种Transformer风格的PointListNet模型来处理这类数据。首先,使用基于距离的非参数注意力机制,因为在某些情况下,决定两个点(如氨基酸)相关性的主要因素是它们之间的距离,而非特征或类型。其次,与直接对输入执行简单线性变换并生成值而不显式建模相对关系的普通Transformer不同,我们的PointListNet将一维顺序和三维欧氏位移整合到值中。
效果:在蛋白质折叠分类和酶反应分类任务上进行实验,结果表明提出的PointListNet的有效性。

Deep neural networks on regular 1D lists (e.g., natural languages) and irregular 3D sets (e.g., point clouds) have made tremendous achievements. The key to natural language processing is to model words and their regular order dependency in texts. For point cloud understanding, the challenge is to understand the geometry via irregular point coordinates, in which point-feeding orders do not matter. However, there are a few kinds of data that exhibit both regular 1D list and irregular 3D set structures, such as proteins and non-coding RNAs. In this paper, we refer to them as 3D point lists and propose a Transformer-style PointListNet to model them. First, PointListNet employs non-parametric distance-based attention because we find sometimes it is the distance, instead of the feature or type, that mainly determines how much two points, e.g., amino acids, are correlated in the micro world. Second, different from the vanilla Transformer that directly performs a simple linear transformation on inputs to generate values and does not explicitly model relative relations, our PointListNet integrates the 1D order and 3D Euclidean displacements into values. We conduct experiments on protein fold classification and enzyme reaction classification. Experimental results show the effectiveness of the proposed PointListNet.

Spherical Transformer for LiDAR-Based 3D Recognition
Lai, XinandChen, YukangandLu, FanbinandLiu, JianhuiandJia, Jiaya



研究问题:如何更好地利用LiDAR点云数据进行三维识别,特别是在稀疏远距离点的识别上。
动机:当前大多数方法在处理LiDAR点云数据时,没有充分考虑其分布特性,导致信息断开和感受野有限的问题,尤其是在稀疏远距离点的识别上。
方法:提出SphereFormer模型,通过直接聚合密集近点的信息到稀疏远点,设计了径向窗口自注意力机制,将空间分割成多个不重叠的窄长窗口,解决了信息断开问题,平滑且显著地扩大了感受野。同时,为了适应窄长窗口,提出了指数分割产生细粒度的位置编码和动态特征选择来提高模型的表示能力。
效果:在nuScenes和SemanticKITTI语义分割基准测试中,该方法分别以81.9%和74.8%的mIoU排名第一。在nuScenes对象检测基准测试中,取得了72.8%的NDS和68.5%的mAP,排名第三。代码已在GitHub上开源。

LiDAR-based 3D point cloud recognition has benefited various applications. Without specially considering the LiDAR point distribution, most current methods suffer from information disconnection and limited receptive field, especially for the sparse distant points. In this work, we study the varying-sparsity distribution of LiDAR points and present SphereFormer to directly aggregate information from dense close points to the sparse distant ones. We design radial window self-attention that partitions the space into multiple non-overlapping narrow and long windows. It overcomes the disconnection issue and enlarges the receptive field smoothly and dramatically, which significantly boosts the performance of sparse distant points. Moreover, to fit the narrow and long windows, we propose exponential splitting to yield fine-grained position encoding and dynamic feature selection to increase model representation ability. Notably, our method ranks 1st on both nuScenes and SemanticKITTI semantic segmentation benchmarks with 81.9% and 74.8% mIoU, respectively. Also, we achieve the 3rd place on nuScenes object detection benchmark with 72.8% NDS and 68.5% mAP. Code is available at https://github.com/dvlab-research/SphereFormer.git.

VisFusion: Visibility-Aware Online 3D Scene Reconstruction From Videos
Gao, HuiyuandMao, WeiandLiu, Miaomiao



研究问题:本文旨在提出一种可见性感知的在线3D场景重建方法,从单目视频中重建场景。
动机:现有的重建方法在对每个体素的特征进行聚合时,没有考虑到其可见性,这影响了特征融合的效果。
方法:我们提出了VisFusion方法,通过从每对图像中的投影特征计算相似度矩阵,显式推断出每个体素的可见性,从而改进了特征融合。此外,我们还提出了一种局部特征体积稀疏化的方法,以保留更多的细节信息。
效果:实验结果表明,我们的方法在基准测试上取得了优越的性能,能够重建出更多场景细节。

We propose VisFusion, a visibility-aware online 3D scene reconstruction approach from posed monocular videos. In particular, we aim to reconstruct the scene from volumetric features. Unlike previous reconstruction methods which aggregate features for each voxel from input views without considering its visibility, we aim to improve the feature fusion by explicitly inferring its visibility from a similarity matrix, computed from its projected features in each image pair. Following previous works, our model is a coarse-to-fine pipeline including a volume sparsification process. Different from their works which sparsify voxels globally with a fixed occupancy threshold, we perform the sparsification on a local feature volume along each visual ray to preserve at least one voxel per ray for more fine details. The sparse local volume is then fused with a global one for online reconstruction. We further propose to predict TSDF in a coarse-to-fine manner by learning its residuals across scales leading to better TSDF predictions. Experimental results on benchmarks show that our method can achieve superior performance with more scene details. Code is available at: https://github.com/huiyu-gao/VisFusion

Feature Shrinkage Pyramid for Camouflaged Object Detection With Transformers
Huang, ZhouandDai, HangandXiang, Tian-ZhuandWang, ShuoandChen, Huai-XinandQin, JieandXiong, Huan



研究问题:现有的视觉转换器在伪装目标检测中存在局部建模效果不佳和解码器特征聚合不足的问题。
动机:为了解决这些问题,提出了一种基于转换器的新颖的特征收缩金字塔网络(FSPNet),通过逐步缩小来分层解码增强局部性的相邻转换器特征,以提高伪装目标检测的性能。
方法:设计了一种非局部令牌增强模块(NL-TEM)和特征收缩解码器(FSD),前者利用非局部机制交互相邻的令牌,探索令牌内的基于图的高阶关系以增强转换器的局部表示;后者则通过逐层缩小的金字塔逐渐聚合相邻的转换器特征,尽可能多地累积难以察觉但有效的线索用于对象信息解码。
效果:大量的定量和定性实验表明,所提出的模型在三个具有挑战性的COD基准数据集上显著优于24个现有竞争对手,并在六种广泛使用的评估指标下表现优秀。

Vision transformers have recently shown strong global context modeling capabilities in camouflaged object detection. However, they suffer from two major limitations: less effective locality modeling and insufficient feature aggregation in decoders, which are not conducive to camouflaged object detection that explores subtle cues from indistinguishable backgrounds. To address these issues, in this paper, we propose a novel transformer-based Feature Shrinkage Pyramid Network (FSPNet), which aims to hierarchically decode locality-enhanced neighboring transformer features through progressive shrinking for camouflaged object detection. Specifically, we propose a non-local token enhancement module (NL-TEM) that employs the non-local mechanism to interact neighboring tokens and explore graph-based high-order relations within tokens to enhance local representations of transformers. Moreover, we design a feature shrinkage decoder (FSD) with adjacent interaction modules (AIM), which progressively aggregates adjacent transformer features through a layer-by-layer shrinkage pyramid to accumulate imperceptible but effective cues as much as possible for object information decoding. Extensive quantitative and qualitative experiments demonstrate that the proposed model significantly outperforms the existing 24 competitors on three challenging COD benchmark datasets under six widely-used evaluation metrics. Our code is publicly available at https://github.com/ZhouHuang23/FSPNet.

MSF: Motion-Guided Sequential Fusion for Efficient 3D Object Detection From Point Cloud Sequences
He, ChenhangandLi, RuihuangandZhang, YabinandLi, ShuaiandZhang, Lei



研究问题:如何有效地利用点云序列准确检测3D对象?
动机:目前的多帧探测器在处理点云序列时存在大量冗余计算,因为相邻帧之间高度相关。
方法:提出一种运动引导的序列融合(MSF)方法,通过挖掘物体运动的连续性,提取有用的序列上下文进行当前帧的目标检测。首先在当前帧生成3D提议,并根据估计的速度将其传播到前一帧。然后从序列中汇总感兴趣的点,并将其编码为提议特征。进一步提出了一种新的双向特征聚合(BiFA)模块,以促进提议特征在各帧之间的交互。此外,通过体素采样技术优化了点云池化,使数百万个点能在几毫秒内处理完毕。
效果:提出的MSF方法不仅比其他多帧探测器更高效,而且准确性领先,在Waymo开放数据集的LEVEL1和LEVEL2测试集上分别达到了83.12%和78.30%的mAP。

Point cloud sequences are commonly used to accurately detect 3D objects in applications such as autonomous driving. Current top-performing multi-frame detectors mostly follow a Detect-and-Fuse framework, which extracts features from each frame of the sequence and fuses them to detect the objects in the current frame. However, this inevitably leads to redundant computation since adjacent frames are highly correlated. In this paper, we propose an efficient Motion-guided Sequential Fusion (MSF) method, which exploits the continuity of object motion to mine useful sequential contexts for object detection in the current frame. We first generate 3D proposals on the current frame and propagate them to preceding frames based on the estimated velocities. The points-of-interest are then pooled from the sequence and encoded as proposal features. A novel Bidirectional Feature Aggregation (BiFA) module is further proposed to facilitate the interactions of proposal features across frames. Besides, we optimize the point cloud pooling by a voxel-based sampling technique so that millions of points can be processed in several milliseconds. The proposed MSF method achieves not only better efficiency than other multi-frame detectors but also leading accuracy, with 83.12% and 78.30% mAP on the LEVEL1 and LEVEL2 test sets of Waymo Open Dataset, respectively. Codes can be found at https://github.com/skyhehe123/MSF.

Kernel Aware Resampler
Bernasconi, MichaelandDjelouah, AbdelazizandSalehi, FarnoodandGross, MarkusandSchroers, Christopher



研究问题:本文旨在解决深度学习在图像超分辨率重建中的问题,如固定整数缩放因子和非整数缩放因子的处理。
动机:现有的深度学习方法在处理图像超分辨率时,主要针对固定的整数缩放因子(如x2或x4),对于非整数缩放因子和模糊核的建模等问题尚未给出完善的解决方案。
方法:本文提出了一个通用的图像重采样框架,该框架不仅解决了上述问题,还扩展了可能的变换集,从放大到一般的变换。关键的一点是在训练数据准备阶段忠实地模拟图像变形和采样率的变化。这使得我们可以局部表示隐含的图像退化,考虑到重建核、局部几何畸变和抗锯齿核。
效果:通过使用这种空间变化的退化图作为我们重采样模型的条件,我们可以用同一个模型处理全局变换(如放大或旋转)和局部变化(如镜头畸变或去畸变)。此外,我们还实现了在更复杂的重采样设置(即盲图像重采样)中自动估计退化图。实验结果表明,通过预测应用于输入图像的内核,而不是直接预测颜色,可以达到最先进的结果。这使得我们的模型可以应用于训练期间未见过的不同类型数据,如法线。

Deep learning based methods for super-resolution have become state-of-the-art and outperform traditional approaches by a significant margin. From the initial models designed for fixed integer scaling factors (e.g. x2 or x4), efforts were made to explore different directions such as modeling blur kernels or addressing non-integer scaling factors. However, existing works do not provide a sound framework to handle them jointly. In this paper we propose a framework for generic image resampling that not only addresses all the above mentioned issues but extends the sets of possible transforms from upscaling to generic transforms. A key aspect to unlock these capabilities is the faithful modeling of image warping and changes of the sampling rate during the training data preparation. This allows a localized representation of the implicit image degradation that takes into account the reconstruction kernel, the local geometric distortion and the anti-aliasing kernel. Using this spatially variant degradation map as conditioning for our resampling model, we can address with the same model both global transformations, such as upscaling or rotation, and locally varying transformations such lens distortion or undistortion. Another important contribution is the automatic estimation of the degradation map in this more complex resampling setting (i.e. blind image resampling). Finally, we show that state-of-the-art results can be achieved by predicting kernels to apply on the input image instead of direct color prediction. This renders our model applicable for different types of data not seen during the training such as normals.

HypLiLoc: Towards Effective LiDAR Pose Regression With Hyperbolic Fusion
Wang, SijieandKang, QiyuandShe, RuiandWang, WeiandZhao, KaiandSong, YangandTay, WeePeng



研究问题:如何提高LiDAR重定位的精度和效率。
动机:LiDAR重定位在机器人、自动驾驶和计算机视觉等领域起着关键作用,但传统的基于数据库的检索方法计算存储成本高且可能产生全局不准确的位姿估计,而直接回归全局位姿的方法则计算效率高但准确性不足。
方法:提出HypLiLoc模型,使用两个分支的骨干网络分别提取3D特征和2D投影特征,并在欧几里得空间和双曲空间中考虑多模态特征融合以获取更有效的特征表示。
效果:实验结果表明,HypLiLoc在户外和室内数据集上都取得了最先进的性能,同时通过框架设计的消融研究证明了多模态特征提取和多空间嵌入的有效性。

LiDAR relocalization plays a crucial role in many fields, including robotics, autonomous driving, and computer vision. LiDAR-based retrieval from a database typically incurs high computation storage costs and can lead to globally inaccurate pose estimations if the database is too sparse. On the other hand, pose regression methods take images or point clouds as inputs and directly regress global poses in an end-to-end manner. They do not perform database matching and are more computationally efficient than retrieval techniques. We propose HypLiLoc, a new model for LiDAR pose regression. We use two branched backbones to extract 3D features and 2D projection features, respectively. We consider multi-modal feature fusion in both Euclidean and hyperbolic spaces to obtain more effective feature representations. Experimental results indicate that HypLiLoc achieves state-of-the-art performance in both outdoor and indoor datasets. We also conduct extensive ablation studies on the framework design, which demonstrate the effectiveness of multi-modal feature extraction and multi-space embedding. Our code is released at: https://github.com/sijieaaa/HypLiLoc

Transformer-Based Unified Recognition of Two Hands Manipulating Objects
Cho, HoseongandKim, ChanwooandKim, JihyeonandLee, SeongyeongandIsmayilzada, ElkhanandBaek, Seungryul



研究问题:如何更好地理解从第一人称视角视频中的手-物体交互。
动机:目前大多数方法基于卷积神经网络(CNN)特征和通过长短期记忆(LSTM)或图卷积网络(GCN)的时序编码,提供两只手、一个物体及其交互的统一理解,但效果有待提高。
方法:提出一种基于Transformer的统一框架,将描绘两只手、一个物体及其交互的整个图像作为输入,从每一帧中估计两只手的姿势、一个物体的姿势和对象类型三种信息,然后根据估计的信息和编码两只手与物体之间交互的接触图,从整个视频中预测由手-物体交互定义的动作类别。
效果:在H2O和FPHA基准数据集上进行实验,证明了该方法的优越性,实现了最先进的精度。消融研究进一步证明了每个提出的模块的有效性。

Understanding the hand-object interactions from an egocentric video has received a great attention recently. So far, most approaches are based on the convolutional neural network (CNN) features combined with the temporal encoding via the long short-term memory (LSTM) or graph convolution network (GCN) to provide the unified understanding of two hands, an object and their interactions. In this paper, we propose the Transformer-based unified framework that provides better understanding of two hands manipulating objects. In our framework, we insert the whole image depicting two hands, an object and their interactions as input and jointly estimate 3 information from each frame: poses of two hands, pose of an object and object types. Afterwards, the action class defined by the hand-object interactions is predicted from the entire video based on the estimated information combined with the contact map that encodes the interaction between two hands and an object. Experiments are conducted on H2O and FPHA benchmark datasets and we demonstrated the superiority of our method achieving the state-of-the-art accuracy. Ablative studies further demonstrate the effectiveness of each proposed module.

Efficient Map Sparsification Based on 2D and 3D Discretized Grids
Zhang, XiaoyuandLiu, Yun-Hui



研究问题:如何有效地进行机器人自主导航中的地图稀疏化和定位。
动机:随着地图规模的增大,传统的地图稀疏化方法需要更高的内存容量和计算量,且未考虑映射和查询序列的空间分布差异对定位性能的影响。
方法:本文提出了一种高效的线性地图稀疏化方法,通过二维离散网格均匀选择地标,并引入基于三维离散网格的空间约束项以减小空间分布差异的影响。
效果:实验证明,该方法在效率和定位性能上都优于现有方法。相关代码将在https://github.com/fishmarch/SLAM_Map_Compression上发布。

Localization in a pre-built map is a basic technique for robot autonomous navigation. Existing mapping and localization methods commonly work well in small-scale environments. As a map grows larger, however, more memory is required and localization becomes inefficient. To solve these problems, map sparsification becomes a practical necessity to acquire a subset of the original map for localization. Previous map sparsification methods add a quadratic term in mixed-integer programming to enforce a uniform distribution of selected landmarks, which requires high memory capacity and heavy computation. In this paper, we formulate map sparsification in an efficient linear form and select uniformly distributed landmarks based on 2D discretized grids. Furthermore, to reduce the influence of different spatial distributions between the mapping and query sequences, which is not considered in previous methods, we also introduce a space constraint term based on 3D discretized grids. The exhaustive experiments in different datasets demonstrate the superiority of the proposed methods in both efficiency and localization performance. The relevant codes will be released at https://github.com/fishmarch/SLAM_Map_Compression.

Generalizable Implicit Neural Representations via Instance Pattern Composers
Kim, ChiheonandLee, DoyupandKim, SaehoonandCho, MinsuandHan, Wook-Shin



研究问题:如何让基于坐标的多层感知机(MLP)学习到的数据实例通用表示能够泛化到未见过的数据实例。
动机:尽管隐式神经表示(INRs)有了最新的进展,但基于坐标的MLP在学习和泛化数据实例通用表示上仍然面临挑战。
方法:我们提出了一个简单而有效的框架,通过在早期MLP层中调制一小部分权重作为实例模式合成器,使基于坐标的MLP能够表示复杂的数据实例;其余的MLP权重则学习模式合成规则,以学习跨实例的通用表示。
效果:广泛的实验表明,我们的方法在音频、图像和3D对象等多种领域都取得了高性能,同时消融研究验证了我们的权重调制。

Despite recent advances in implicit neural representations (INRs), it remains challenging for a coordinate-based multi-layer perceptron (MLP) of INRs to learn a common representation across data instances and generalize it for unseen instances. In this work, we introduce a simple yet effective framework for generalizable INRs that enables a coordinate-based MLP to represent complex data instances by modulating only a small set of weights in an early MLP layer as an instance pattern composer; the remaining MLP weights learn pattern composition rules to learn common representations across instances. Our generalizable INR framework is fully compatible with existing meta-learning and hypernetworks in learning to predict the modulated weight for unseen instances. Extensive experiments demonstrate that our method achieves high performance on a wide range of domains such as an audio, image, and 3D object, while the ablation study validates our weight modulation.

3D Registration With Maximal Cliques
Zhang, XiyuandYang, JiaqiandZhang, ShikunandZhang, Yanning



研究问题:本文旨在解决计算机视觉中的基本问题,即3D点云配准(PCR),寻找最优姿态对齐点云对。
动机:目前的3D点云配准方法存在准确度不高的问题,因此需要提出一种新的方法来提高配准的精度。
方法:本文提出了一种基于最大团的3D配准方法(MAC)。该方法首先构建了一个兼容性图来呈现初始对应点的亲和关系,然后在图中搜索最大团,每个团代表一个共识集。接着,通过SVD算法为选定的团计算变换假设,并使用最佳假设进行配准。
效果:在U3M、3DMatch、3DLoMatch和KITTI等数据集上的大量实验表明,MAC有效地提高了配准精度,优于各种最先进的方法,并提高了深度学习方法的性能。MAC与深度学习方法结合,在3DMatch和3DLoMatch上实现了95.7% / 78.9%的最佳注册召回率。

As a fundamental problem in computer vision, 3D point cloud registration (PCR) aims to seek the optimal pose to align a point cloud pair. In this paper, we present a 3D registration method with maximal cliques (MAC). The key insight is to loosen the previous maximum clique constraint, and to mine more local consensus information in a graph for accurate pose hypotheses generation: 1) A compatibility graph is constructed to render the affinity relationship between initial correspondences. 2) We search for maximal cliques in the graph, each of which represents a consensus set. We perform node-guided clique selection then, where each node corresponds to the maximal clique with the greatest graph weight. 3) Transformation hypotheses are computed for the selected cliques by SVD algorithm and the best hypothesis is used to perform registration. Extensive experiments on U3M, 3DMatch, 3DLoMatch and KITTI demonstrate that MAC effectively increases registration accuracy, outperforms various state-of-the-art methods and boosts the performance of deep-learned methods. MAC combined with deep-learned methods achieves state-of-the-art registration recall of 95.7% / 78.9% on the 3DMatch / 3DLoMatch.

Efficient RGB-T Tracking via Cross-Modality Distillation
Zhang, TianluandGuo, HongyuanandJiao, QiangandZhang, QiangandHan, Jungong



研究问题:目前大多数RGB-T跟踪器采用两流结构提取单模态RGB和热特征,并通过复杂的融合策略实现多模态特征融合,这需要大量的参数,限制了其在现实生活中的应用。
动机:为了解决这一问题,提出了一种跨模态蒸馏框架,以缩小简单跟踪器和强大跟踪器之间的性能差距。
方法:具体来说,提出了一个特定-通用特征蒸馏模块,将深层双流网络的模态通用信息和模态特定信息转化为浅层单流网络。此外,还提出了一种多路径选择蒸馏模块,通过使用多种路径指导简单的融合模块从精心设计的融合机制中学习更准确的多模态信息。
效果:在三个RGB-T基准测试上进行了广泛的实验验证,该方法实现了最先进的性能,但消耗的计算资源却少得多。

Most current RGB-T trackers adopt a two-stream structure to extract unimodal RGB and thermal features and complex fusion strategies to achieve multi-modal feature fusion, which require a huge number of parameters, thus hindering their real-life applications. On the other hand, a compact RGB-T tracker may be computationally efficient but encounter non-negligible performance degradation, due to the weakening of feature representation ability. To remedy this situation, a cross-modality distillation framework is presented to bridge the performance gap between a compact tracker and a powerful tracker. Specifically, a specific-common feature distillation module is proposed to transform the modality-common information as well as the modality-specific information from a deeper two-stream network to a shallower single-stream network. In addition, a multi-path selection distillation module is proposed to instruct a simple fusion module to learn more accurate multi-modal information from a well-designed fusion mechanism by using multiple paths. We validate the effectiveness of our method with extensive experiments on three RGB-T benchmarks, which achieves state-of-the-art performance but consumes much less computational resources.

VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking
Chen, YukangandLiu, JianhuiandZhang, XiangyuandQi, XiaojuanandJia, Jiaya



研究问题:本文旨在解决3D物体检测中需要借助手工制作的代理(如锚点或中心)以及将2D框架转化为3D的问题,同时处理稀疏体素特征的密度化和密集预测头部的处理,这无疑增加了额外的计算成本。
动机:本文提出了一种全新的方法VoxelNext,该方法直接基于稀疏体素特征进行物体预测,而无需依赖手工制作的代理。
方法:我们的核心思想是直接根据稀疏体素特征进行物体的检测和跟踪,不依赖于手工制作的代理。我们的强稀疏卷积网络VoxelNeXt通过体素特征完全进行3D物体的检测和跟踪。这是一种优雅且高效的框架,无需进行稀疏到密集的转换或NMS后处理。
效果:在nuScenes数据集上,我们的方法比其他主流探测器实现了更好的速度-准确性权衡。首次证明,全稀疏的体素表示法可以很好地用于LIDAR 3D物体检测和跟踪。在nuScenes、Waymo和Argoverse2基准上的大量实验验证了我们方法的有效性。

3D object detectors usually rely on hand-crafted proxies, e.g., anchors or centers, and translate well-studied 2D frameworks to 3D. Thus, sparse voxel features need to be densified and processed by dense prediction heads, which inevitably costs extra computation. In this paper, we instead propose VoxelNext for fully sparse 3D object detection. Our core insight is to predict objects directly based on sparse voxel features, without relying on hand-crafted proxies. Our strong sparse convolutional network VoxelNeXt detects and tracks 3D objects through voxel features entirely. It is an elegant and efficient framework, with no need for sparse-to-dense conversion or NMS post-processing. Our method achieves a better speed-accuracy trade-off than other mainframe detectors on the nuScenes dataset. For the first time, we show that a fully sparse voxel-based representation works decently for LIDAR 3D object detection and tracking. Extensive experiments on nuScenes, Waymo, and Argoverse2 benchmarks validate the effectiveness of our approach. Without bells and whistles, our model outperforms all existing LIDAR methods on the nuScenes tracking test benchmark.

AttentionShift: Iteratively Estimated Part-Based Attention Map for Pointly Supervised Instance Segmentation
Liao, MingxiangandGuo, ZonghaoandWang, YuzeandYuan, PengandFeng, BailanandWan, Fang



研究问题:如何通过单一监督点解决实例分割中的语义偏差和错误分割问题。
动机:现有的点监督实例分割方法由于只使用对象的一个点进行监督,导致对象部分之间的语义差异大,产生语义偏差和错误分割。
方法:提出一种迭代分解实例注意力图并估计每个部分细粒度语义的AttentionShift方法。该方法包括两步:一是生成点监督注意力图的标记查询;二是通过在特征空间中的关键部位过滤来重新估计基于部位的关注图。这两个步骤反复执行,以优化空间和特征空间中的基于部位的关注图,覆盖整个对象范围。
效果:在PASCAL VOC和MS COCO 2017数据集上的实验表明,AttentionShift分别将最先进的mAP@0.5提高了7.7%和4.8%,为使用视觉转换器的点监督实例分割建立了坚实的基线。

Pointly supervised instance segmentation (PSIS) learns to segment objects using a single point within the object extent as supervision. Challenged by the non-negligible semantic variance between object parts, however, the single supervision point causes semantic bias and false segmentation. In this study, we propose an AttentionShift method, to solve the semantic bias issue by iteratively decomposing the instance attention map to parts and estimating fine-grained semantics of each part. AttentionShift consists of two modules plugged on the vision transformer backbone: (i) token querying for pointly supervised attention map generation, and (ii) key-point shift, which re-estimates part-based attention maps by key-point filtering in the feature space. These two steps are iteratively performed so that the part-based attention maps are optimized spatially as well as in the feature space to cover full object extent. Experiments on PASCAL VOC and MS COCO 2017 datasets show that AttentionShift respectively improves the state-of-the-art of by 7.7% and 4.8% under mAP@0.5, setting a solid PSIS baseline using vision transformer. Code is enclosed in the supplementary material.

Spatial-Frequency Mutual Learning for Face Super-Resolution
Wang, ChenyangandJiang, JunjunandZhong, ZhiweiandLiu, Xianming



研究问题:本文旨在解决面部超分辨率(FSR)问题,即如何从低分辨率(LR)人脸图像重建高分辨率(HR)图像。
动机:尽管深度学习在面部超分辨率技术中取得了重大突破,但现有的方法要么具有固定的感知野,要么无法保持面部结构,限制了FSR的性能。
方法:本文提出了一种基于傅里叶变换的空间频率互相关网络(SFMNet),这是首个探索空间和频率域之间相关性的FSR方法。具体来说,SFMNet是一个双分支网络,包括一个空间分支和一个频率分支。频率分支利用傅里叶变换实现图像大小的感知野并捕获全局依赖性,而空间分支则提取局部依赖性。考虑到这两种依赖性是互补的,并且都有利于FSR,我们进一步开发了一个频率-空间交互模块(FSIB),该模块将互补的空间和频率信息进行相互融合,增强了模型的能力。
效果:定量和定性的实验结果表明,该方法在恢复人脸图像方面优于最先进的FSR方法。

Face super-resolution (FSR) aims to reconstruct high-resolution (HR) face images from the low-resolution (LR) ones. With the advent of deep learning, the FSR technique has achieved significant breakthroughs. However, existing FSR methods either have a fixed receptive field or fail to maintain facial structure, limiting the FSR performance. To circumvent this problem, Fourier transform is introduced, which can capture global facial structure information and achieve image-size receptive field. Relying on the Fourier transform, we devise a spatial-frequency mutual network (SFMNet) for FSR, which is the first FSR method to explore the correlations between spatial and frequency domains as far as we know. To be specific, our SFMNet is a two-branch network equipped with a spatial branch and a frequency branch. Benefiting from the property of Fourier transform, the frequency branch can achieve image-size receptive field and capture global dependency while the spatial branch can extract local dependency. Considering that these dependencies are complementary and both favorable for FSR, we further develop a frequency-spatial interaction block (FSIB) which mutually amalgamates the complementary spatial and frequency information to enhance the capability of the model. Quantitative and qualitative experimental results show that the proposed method outperforms state-of-the-art FSR methods in recovering face images. The implementation and model will be released at https://github.com/wcy-cs/SFMNet.

Efficient Frequency Domain-Based Transformers for High-Quality Image Deblurring
Kong, LingshunandDong, JiangxinandGe, JianjunandLi, MingqiangandPan, Jinshan



研究问题:如何有效地利用Transformers在频率域中进行高质量的图像去模糊。
动机:受到卷积定理的启发,即两个信号在空间域中的相关性或卷积等价于它们在频率域中的逐元素乘积。
方法:开发了一种基于频率域的高效自注意力求解器(FSAS),通过逐元素乘法操作而不是空间域中的矩阵乘法来估计缩放的点积注意力。同时,提出了一种简单而有效的判别性频率域前馈网络(DFFN),引入基于JPEG压缩算法的门控机制,以判别性地确定应保留哪些特征的低频和高频信息用于潜在清晰图像恢复。
效果:实验结果表明,该方法优于最先进的方法。

We present an effective and efficient method that explores the properties of Transformers in the frequency domain for high-quality image deblurring. Our method is motivated by the convolution theorem that the correlation or convolution of two signals in the spatial domain is equivalent to an element-wise product of them in the frequency domain. This inspires us to develop an efficient frequency domain-based self-attention solver (FSAS) to estimate the scaled dot-product attention by an element-wise product operation instead of the matrix multiplication in the spatial domain. In addition, we note that simply using the naive feed-forward network (FFN) in Transformers does not generate good deblurred results. To overcome this problem, we propose a simple yet effective discriminative frequency domain-based FFN (DFFN), where we introduce a gated mechanism in the FFN based on the Joint Photographic Experts Group (JPEG) compression algorithm to discriminatively determine which low- and high-frequency information of the features should be preserved for latent clear image restoration. We formulate the proposed FSAS and DFFN into an asymmetrical network based on an encoder and decoder architecture, where the FSAS is only used in the decoder module for better image deblurring. Experimental results show that the proposed method performs favorably against the state-of-the-art approaches.

B-Spline Texture Coefficients Estimator for Screen Content Image Super-Resolution
Pak, ByeonghyunandLee, JaewonandJin, KyongHwan



研究问题:如何准确处理屏幕内容图像(SCIs)的边缘和纹理,以在显示设备的分辨率与SCIs不同时最小化内容的失真。
动机:由于SCIs包含许多信息丰富的组件,如文本和图形,其像素分布与自然图像不同,因此需要适当处理边缘和纹理。
方法:提出一种使用B-splines的隐式神经表示方法进行任意比例的屏幕内容图像超分辨率(SCI SR)。该方法提取B-splines的比例、平移和平滑参数,然后通过多层感知器(MLP)使用估计的B-splines恢复高分辨率SCI。
效果:该方法在所有放大因子上均优于基于变压器的重建方法和隐式傅里叶表示方法,因为B-spline基具有正约束和平铺支持。此外,我们的SR结果被预训练的场景文本识别网络以最高置信度识别为正确的文本字母。

Screen content images (SCIs) include many informative components, e.g., texts and graphics. Such content creates sharp edges or homogeneous areas, making a pixel distribution of SCI different from the natural image. Therefore, we need to properly handle the edges and textures to minimize information distortion of the contents when a display device's resolution differs from SCIs. To achieve this goal, we propose an implicit neural representation using B-splines for screen content image super-resolution (SCI SR) with arbitrary scales. Our method extracts scaling, translating, and smoothing parameters of B-splines. The followed multi-layer perceptron (MLP) uses the estimated B-splines to recover high-resolution SCI. Our network outperforms both a transformer-based reconstruction and an implicit Fourier representation method in almost upscaling factor, thanks to the positive constraint and compact support of the B-spline basis. Moreover, our SR results are recognized as correct text letters with the highest confidence by a pre-trained scene text recognition network. Source code is available at https://github.com/ByeongHyunPak/btc.

Towards End-to-End Generative Modeling of Long Videos With Memory-Efficient Bidirectional Transformers
Yoo, JaehoonandKim, SeminandLee, DoyupandKim, ChiheonandHong, Seunghoon



研究问题:如何提高视频生成中长时依赖的学习效率和推断速度。
动机:自回归转换器在视频生成中表现出色,但因自注意力的二次复杂度无法直接学习视频中的长时依赖,且推理速度慢、易产生误差传播。
方法:提出一种内存高效的双向转换器(MeBT),通过将可观察的上下文标记投影到固定数量的潜在标记并通过交叉注意解码被遮蔽的标记,实现从部分观察到的补丁并行解码整个视频的空间-时间体积。
效果:该方法实现了编码和解码的线性时间复杂度,并在生成中等到较长的视频方面在质量和速度上均显著优于自回归转换器。

Autoregressive transformers have shown remarkable success in video generation. However, the transformers are prohibited from directly learning the long-term dependency in videos due to the quadratic complexity of self-attention, and inherently suffering from slow inference time and error propagation due to the autoregressive process. In this paper, we propose Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of long-term dependency in videos and fast inference. Based on recent advances in bidirectional transformers, our method learns to decode the entire spatio-temporal volume of a video in parallel from partially observed patches. The proposed transformer achieves a linear time complexity in both encoding and decoding, by projecting observable context tokens into a fixed number of latent tokens and conditioning them to decode the masked tokens through the cross-attention. Empowered by linear complexity and bidirectional modeling, our method demonstrates significant improvement over the autoregressive Transformers for generating moderately long videos in both quality and speed.

Masked Representation Learning for Domain Generalized Stereo Matching
Rao, ZhiboandXiong, BangshuandHe, MingyiandDai, YuchaoandHe, RenjieandShen, ZhelunandLi, Xing



研究问题:现有的深度立体匹配方法在跨领域性能上取得了显著成就,但在不同训练阶段的性能泛化波动性方面存在问题。
动机:受掩码表示学习和多任务学习的启发,设计了一种简单有效的用于领域泛化的立体匹配的掩码表示。
方法:首先,将遮蔽的左图像和完整的右图像作为输入模型。然后,在特征提取模块后添加一个轻量级且简单的解码器来恢复原始的左图像。最后,通过将两个任务(立体匹配和图像重建)作为伪多任务学习框架进行训练,促使模型学习结构信息并提高泛化性能。
效果:实验结果表明,该方法可以很容易地插入到当前的多种立体匹配模型中以提高泛化性能;该方法可以减少在不同训练阶段的性能泛化波动性;发现当前的方法倾向于在不同的训练阶段中选择最好的结果作为泛化性能,但实际上无法通过地面真实值选择最好的性能。

Recently, many deep stereo matching methods have begun to focus on cross-domain performance, achieving impressive achievements. However, these methods did not deal with the significant volatility of generalization performance among different training epochs. Inspired by masked representation learning and multi-task learning, this paper designs a simple and effective masked representation for domain generalized stereo matching. First, we feed the masked left and complete right images as input into the models. Then, we add a lightweight and simple decoder following the feature extraction module to recover the original left image. Finally, we train the models with two tasks (stereo matching and image reconstruction) as a pseudo-multi-task learning framework, promoting models to learn structure information and to improve generalization performance. We implement our method on two well-known architectures (CFNet and LacGwcNet) to demonstrate its effectiveness. Experimental results on multi-datasets show that: (1) our method can be easily plugged into the current various stereo matching models to improve generalization performance; (2) our method can reduce the significant volatility of generalization performance among different training epochs; (3) we find that the current methods prefer to choose the best results among different training epochs as generalization performance, but it is impossible to select the best performance by ground truth in practice.

EqMotion: Equivariant Multi-Agent Motion Prediction With Invariant Interaction Reasoning
Xu, ChenxinandTan, RobbyT.andTan, YuhongandChen, SihengandWang, YuGuangandWang, XinchaoandWang, Yanfeng



研究问题:如何预测具有关系推理的代理运动,以实现在欧几里得几何变换下的动态等变和代理交互的不变性。
动机:现有的方法忽视了动态等变和交互不变的属性,这对于许多应用来说是非常重要的。
方法:提出了EqMotion模型,通过设计等变操作来学习欧几里得可转换特征,实现动态等变;通过提出不变的交互推理模块来实现更稳定的交互建模;并通过学习不变的模式特征来进一步提升网络表现力。
效果:在粒子动力学、分子动力学、人体骨骼运动预测和行人轨迹预测四个场景中进行了实验,结果表明该方法不仅具有通用性,而且在所有任务上都取得了最先进的预测性能,提高了24.0/30.1/8.6/9.2%。

Learning to predict agent motions with relationship reasoning is important for many applications. In motion prediction tasks, maintaining motion equivariance under Euclidean geometric transformations and invariance of agent interaction is a critical and fundamental principle. However, such equivariance and invariance properties are overlooked by most existing methods. To fill this gap, we propose EqMotion, an efficient equivariant motion prediction model with invariant interaction reasoning. To achieve motion equivariance, we propose an equivariant geometric feature learning module to learn a Euclidean transformable feature through dedicated designs of equivariant operations. To reason agent's interactions, we propose an invariant interaction reasoning module to achieve a more stable interaction modeling. To further promote more comprehensive motion features, we propose an invariant pattern feature learning module to learn an invariant pattern feature, which cooperates with the equivariant geometric feature to enhance network expressiveness. We conduct experiments for the proposed model on four distinct scenarios: particle dynamics, molecule dynamics, human skeleton motion prediction and pedestrian trajectory prediction. Experimental results show that our method is not only generally applicable, but also achieves state-of-the-art prediction performances on all the four tasks, improving by 24.0/30.1/8.6/9.2%. Code is available at https://github.com/MediaBrain-SJTU/EqMotion.

FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation
Shi, XiaoyuandHuang, ZhaoyangandLi, DasongandZhang, ManyuanandCheung, KaChunandSee, SimonandQin, HongweiandDai, JifengandLi, Hongsheng



研究问题:如何将Transformer架构应用于光流估计并实现最先进的性能。
动机:受最近在解锁Transformer编码视觉表示能力的掩蔽自动编码(MAE)预训练成功启发,提出掩蔽成本体积自动编码(MCVA)以通过新颖的MAE方案预训练成本-体积编码器来增强FlowFormer。
方法:首先引入块共享屏蔽策略防止屏蔽信息泄漏,因为相邻源像素的成本图高度相关。其次,提出一种新的预文本重建任务,鼓励成本-体积编码器聚合长范围信息并确保预训练-微调一致性。我们还展示了如何在预训练期间修改FlowFormer架构以适应屏蔽。
效果:使用MCVA预训练的我们提出的FlowFormer++在Sintel和KITTI-2015基准测试中均排名第一。具体来说,FlowFormer++在Sintel基准测试的清洁和最终通道上实现了1.07和1.94的平均端点误差(AEPE),比FlowFormer减少了7.76%和7.18%的错误。 FlowFormer++在KITTI-2015测试集上获得了4.52的F1-all,比FlowFormer提高了0.16。

FlowFormer introduces a transformer architecture into optical flow estimation and achieves state-of-the-art performance. The core component of FlowFormer is the transformer-based cost-volume encoder. Inspired by recent success of masked autoencoding (MAE) pretraining in unleashing transformers' capacity of encoding visual representation, we propose Masked Cost Volume Autoencoding (MCVA) to enhance FlowFormer by pretraining the cost-volume encoder with a novel MAE scheme. Firstly, we introduce a block-sharing masking strategy to prevent masked information leakage, as the cost maps of neighboring source pixels are highly correlated. Secondly, we propose a novel pre-text reconstruction task, which encourages the cost-volume encoder to aggregate long-range information and ensures pretraining-finetuning consistency. We also show how to modify the FlowFormer architecture to accommodate masks during pretraining. Pretrained with MCVA, our proposed FlowFormer++ ranks 1st among published methods on both Sintel and KITTI-2015 benchmarks. Specifically, FlowFormer++ achieves 1.07 and 1.94 average end-point-error (AEPE) on the clean and final pass of Sintel benchmark, leading to 7.76% and 7.18% error reductions from FlowFormer. FlowFormer++ obtains 4.52 F1-all on the KITTI-2015 test set, improving FlowFormer by 0.16.

STMixer: A One-Stage Sparse Action Detector
Wu, TaoandCao, MengqiandGao, ZitengandWu, GangshanandWang, Limin



研究问题:传统的视频动作检测器通常采用两阶段流程,需要多阶段训练和推理,并且无法捕获边界框外的环境信息。
动机:为了解决传统方法的局限性,本文提出了一种新颖的单阶段稀疏动作检测器STMixer。
方法:STMixer基于两个核心设计。首先,我们提出了一种基于查询的自适应特征采样模块,使STMixer能够从整个时空域中挖掘一组判别性特征。其次,我们设计了一个双分支特征混合模块,使STMixer能够分别在空间和时间维度上动态关注并混合视频特征,以实现更好的特征解码。
效果:通过将这两个设计与视频主干网络相结合,我们得到了一个高效且准确的动作检测器。在没有额外优化的情况下,STMixer在AVA、UCF101-24和JHMDB等数据集上取得了最先进的结果。

Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to yield actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference and cannot capture context information outside the bounding box. Recently, a few query-based action detectors are proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling or decoding, thus suffering from the issue of inferior performance or slower convergence. In this paper, we propose a new one-stage sparse action detector, termed STMixer. STMixer is based on two core designs. First, we present a query-based adaptive feature sampling module, which endows our STMixer with the flexibility of mining a set of discriminative features from the entire spatiotemporal domain. Second, we devise a dual-branch feature mixing module, which allows our STMixer to dynamically attend to and mix video features along the spatial and the temporal dimension respectively for better feature decoding. Coupling these two designs with a video backbone yields an efficient and accurate action detector. Without bells and whistles, STMixer obtains the state-of-the-art results on the datasets of AVA, UCF101-24, and JHMDB.

HRDFuse: Monocular 360deg Depth Estimation by Collaboratively Learning Holistic-With-Regional Depth Distributions
Ai, HaoandCao, ZidongandCao, Yan-PeiandShan, YingandWang, Lin



研究问题:单目360度图像深度估计是一个新兴的问题,由于其对场景的整体感知。
动机:现有的方法如OmniFusion等,通过切线投影(TP)表示360度图像,并通过分块回归预测深度值,然后将这些深度值合并以获取等距极射投影(ERP)格式的深度图。但这些方法存在1) 合并大量分块的复杂过程;2) 直接回归每个像素的深度值,捕获的区域局部上下文信息较少的问题。
方法:我们提出了一种新的框架HRDFuse,通过协同学习ERP的整体上下文信息和TP的区域结构信息,巧妙地结合了卷积神经网络(CNNs)和变压器的潜力。首先,我们提出了一个空间特征对齐(SFA)模块,该模块学习TP和ERP之间的特征相似性,以像素为单位将TP特征聚合到完整的ERP特征图中。其次,我们提出了一个协同深度分布分类(CDDC)模块,该模块学习捕捉ERP和TP深度分布的整体-区域直方图。因此,最终的深度值可以预测为直方图bin中心的线性组合。最后,我们自适应地组合来自ERP和TP的深度预测,以获得最终的深度图。
效果:大量的实验表明,我们的方法预测出更平滑、更准确的深度结果,同时比最先进的方法取得了更好的结果。

Depth estimation from a monocular 360 image is a burgeoning problem owing to its holistic sensing of a scene. Recently, some methods, e.g., OmniFusion, have applied the tangent projection (TP) to represent a 360 image and predicted depth values via patch-wise regressions, which are merged to get a depth map with equirectangular projection (ERP) format. However, these methods suffer from 1) non-trivial process of merging a large number of patches; 2) capturing less holistic-with-regional contextual information by directly regressing the depth value of each pixel. In this paper, we propose a novel framework, HRDFuse, that subtly combines the potential of convolutional neural networks (CNNs) and transformers by collaboratively learning the holistic contextual information from the ERP and the regional structural information from the TP. Firstly, we propose a spatial feature alignment (SFA) module that learns feature similarities between the TP and ERP to aggregate the TP features into a complete ERP feature map in a pixel-wise manner. Secondly, we propose a collaborative depth distribution classification (CDDC) module that learns the holistic-with-regional histograms capturing the ERP and TP depth distributions. As such, the final depth values can be predicted as a linear combination of histogram bin centers. Lastly, we adaptively combine the depth predictions from ERP and TP to obtain the final depth map. Extensive experiments show that our method predicts more smooth and accurate depth results while achieving favorably better results than the SOTA methods.

PATS: Patch Area Transportation With Subdivision for Local Feature Matching
Ni, JunjieandLi, YijinandHuang, ZhaoyangandLi, HongshengandBao, HujunandCui, ZhaopengandZhang, Guofeng



研究问题:如何建立图像对之间的稀疏对应关系,特别是在图像对具有大尺度差异的情况下。
动机:现有的无检测器方法在处理大尺度差异的图像对时效果不佳。
方法:提出Patch Area Transportation with Subdivision(PATS)方法。将原始图像对分割成等大小的补丁,并逐步调整和细分它们为相同尺度的较小补丁。通过学习补丁区域传输来估计这些补丁之间的尺度差异。
效果:PATS提高了匹配的准确性和覆盖率,并在下游任务如相对姿态估计、视觉定位和光流估计等方面表现出优越的性能。

Local feature matching aims at establishing sparse correspondences between a pair of images. Recently, detector-free methods present generally better performance but are not satisfactory in image pairs with large scale differences. In this paper, we propose Patch Area Transportation with Subdivision (PATS) to tackle this issue. Instead of building an expensive image pyramid, we start by splitting the original image pair into equal-sized patches and gradually resizing and subdividing them into smaller patches with the same scale. However, estimating scale differences between these patches is non-trivial since the scale differences are determined by both relative camera poses and scene structures, and thus spatially varying over image pairs. Moreover, it is hard to obtain the ground truth for real scenes. To this end, we propose patch area transportation, which enables learning scale differences in a self-supervised manner. In contrast to bipartite graph matching, which only handles one-to-one matching, our patch area transportation can deal with many-to-many relationships. PATS improves both matching accuracy and coverage, and shows superior performance in downstream tasks, such as relative pose estimation, visual localization, and optical flow estimation.The source code will be released to benefit the community.

G-MSM: Unsupervised Multi-Shape Matching With Graph-Based Affinity Priors
Eisenberger, MarvinandToker, AysimandLeal-Taix\'e, LauraandCremers, Daniel



研究问题:本文旨在提出一种新的无监督学习方法G-MSM,用于非刚性形状对应。
动机:现有的方法将输入姿态集合视为无序样本集,而未明确对底层形状数据流形进行建模。
方法:提出了一种自适应的多形状匹配架构,在给定的训练形状集合上以自监督的方式构建亲和力图。主要思想是通过在底层形状图中沿着最短路径传播映射来组合可能的成对对应关系。
效果:在多个最新的形状对应基准测试中,包括具有拓扑噪声的真实世界3D扫描网格和具有挑战性的类别间对,展示了最先进的性能。

We present G-MSM (Graph-based Multi-Shape Matching), a novel unsupervised learning approach for non-rigid shape correspondence. Rather than treating a collection of input poses as an unordered set of samples, we explicitly model the underlying shape data manifold. To this end, we propose an adaptive multi-shape matching architecture that constructs an affinity graph on a given set of training shapes in a self-supervised manner. The key idea is to combine putative, pairwise correspondences by propagating maps along shortest paths in the underlying shape graph. During training, we enforce cycle-consistency between such optimal paths and the pairwise matches which enables our model to learn topology-aware shape priors. We explore different classes of shape graphs and recover specific settings, like template-based matching (star graph) or learnable ranking/sorting (TSP graph), as special cases in our framework. Finally, we demonstrate state-of-the-art performance on several recent shape correspondence benchmarks, including real-world 3D scan meshes with topological noise and challenging inter-class pairs.

Enhancing Deformable Local Features by Jointly Learning To Detect and Describe Keypoints
Potje, GuilhermeandCadar, FelipeandAraujo, Andr\'eandMartins, RenatoandNascimento, EricksonR.



研究问题:本文旨在解决计算机视觉中图像匹配和检索等重要任务,以及非刚性变形表面的匹配问题。
动机:大多数方法假设图像只经历仿射变换,忽视了更复杂的非刚性变形效果。此外,针对非刚性对应关系的新方法仍然依赖于为刚性变换设计的关键点检测器,由于检测器的限制,这阻碍了性能的提高。
方法:我们提出了DALF(Deformation-Aware Local Features),一种新颖的变形感知网络,用于联合检测和描述关键点,以处理匹配可变形表面的挑战。所有网络组件通过特征融合方法协同工作,增强了描述符的独特性和不变性。
效果:使用真实变形物体的实验展示了我们方法的优势,与之前的最佳结果相比,我们的匹配得分提高了8%。我们的方法还提高了两个实际应用的性能:可变形物体检索和非刚性3D表面注册。我们的训练、推理和应用代码已在verlab.dcc.ufmg.br/descriptors/dalf_cvpr23上公开。

Local feature extraction is a standard approach in computer vision for tackling important tasks such as image matching and retrieval. The core assumption of most methods is that images undergo affine transformations, disregarding more complicated effects such as non-rigid deformations. Furthermore, incipient works tailored for non-rigid correspondence still rely on keypoint detectors designed for rigid transformations, hindering performance due to the limitations of the detector. We propose DALF (Deformation-Aware Local Features), a novel deformation-aware network for jointly detecting and describing keypoints, to handle the challenging problem of matching deformable surfaces. All network components work cooperatively through a feature fusion approach that enforces the descriptors' distinctiveness and invariance. Experiments using real deforming objects showcase the superiority of our method, where it delivers 8% improvement in matching scores compared to the previous best results. Our approach also enhances the performance of two real-world applications: deformable object retrieval and non-rigid 3D surface registration. Code for training, inference, and applications are publicly available at verlab.dcc.ufmg.br/descriptors/dalf_cvpr23.

Neighborhood Attention Transformer
Hassani, AliandWalton, StevenandLi, JiachenandLi, ShenandShi, Humphrey



研究问题:如何设计一种高效且可扩展的滑动窗口注意力机制用于视觉处理?
动机:现有的自注意力(SA)计算复杂度为二次,而局部化的自注意力(NA)将复杂度降低到线性,同时保持了对邻近像素的关注。
方法:提出邻居注意力(NA),这是一种像素级的运算,将自注意力局限在最近的邻近像素上,因此其时间和空间复杂度都优于SA。通过滑动窗口模式,NA无需额外的像素位移即可扩大感受野,并保留了平移等变性。
效果:开发的NATTEN包使NA比Swin的WSA快40%,内存使用减少25%。基于NA的新层级转换器设计NAT在图像分类和下游视觉性能上都有所提升。实验结果表明,NAT-Tiny在ImageNet上的top-1准确率达到83.2%,在MS-COCO上的mAP达到51.4%,在ADE20K上的mIoU达到48.4%,相比类似大小的Swin模型分别提高了1.9%、1.0%和2.6%。

We present Neighborhood Attention (NA), the first efficient and scalable sliding window attention mechanism for vision. NA is a pixel-wise operation, localizing self attention (SA) to the nearest neighboring pixels, and therefore enjoys a linear time and space complexity compared to the quadratic complexity of SA. The sliding window pattern allows NA's receptive field to grow without needing extra pixel shifts, and preserves translational equivariance, unlike Swin Transformer's Window Self Attention (WSA). We develop NATTEN (Neighborhood Attention Extension), a Python package with efficient C++ and CUDA kernels, which allows NA to run up to 40% faster than Swin's WSA while using up to 25% less memory. We further present Neighborhood Attention Transformer (NAT), a new hierarchical transformer design based on NA that boosts image classification and downstream vision performance. Experimental results on NAT are competitive; NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet, 51.4% mAP on MS-COCO and 48.4% mIoU on ADE20K, which is 1.9% ImageNet accuracy, 1.0% COCO mAP, and 2.6% ADE20K mIoU improvement over a Swin model with similar size. To support more research based on sliding window attention, we open source our project and release our checkpoints.

Trap Attention: Monocular Depth Estimation With Manual Traps
Ning, ChaoandGan, Hongping



研究问题:从单张图像预测高质量的深度图是一项具有挑战性的任务,因为将2D场景投影到相应的3D场景的可能性是无限的。
动机:最近的研究引入了多头注意力(MHA)模块来执行长距离交互,这在回归深度图中显示出显著的进展。然而,由于MHA的二次复杂度,这些方法无法利用MHA以适当的计算复杂度在高分辨率中计算深度特征。
方法:本文提出了一种新颖的陷阱注意力机制,通过在扩展空间中为每个像素设置陷阱,并通过卷积窗口的特征保留率形成注意力机制,将二次计算复杂度转换为线性形式。然后构建了一个编码器-解码器陷阱深度估计网络,该网络引入了视觉转换器作为编码器,并使用陷阱注意力在解码器中从单张图像估计深度。
效果:大量的实验结果表明,我们提出的网络在单目深度估计任务上优于现有的最先进的方法,在NYU Depth-v2和KITTI数据集上,参数数量显著减少。

Predicting a high quality depth map from a single image is a challenging task, because it exists infinite possibility to project a 2D scene to the corresponding 3D scene. Recently, some studies introduced multi-head attention (MHA) modules to perform long-range interaction, which have shown significant progress in regressing the depth maps.The main functions of MHA can be loosely summarized to capture long-distance information and report the attention map by the relationship between pixels. However, due to the quadratic complexity of MHA, these methods can not leverage MHA to compute depth features in high resolution with an appropriate computational complexity. In this paper, we exploit a depth-wise convolution to obtain long-range information, and propose a novel trap attention, which sets some traps on the extended space for each pixel, and forms the attention mechanism by the feature retention ratio of convolution window, resulting in that the quadratic computational complexity can be converted to linear form. Then we build an encoder-decoder trap depth estimation network, which introduces a vision transformer as the encoder, and uses the trap attention to estimate the depth from single image in the decoder. Extensive experimental results demonstrate that our proposed network can outperform the state-of-the-art methods in monocular depth estimation on datasets NYU Depth-v2 and KITTI, with significantly reduced number of parameters. Code is available at: https://github.com/ICSResearch/TrapAttention.

Rethinking Few-Shot Medical Segmentation: A Vector Quantization View
Huang, ShiqiandXu, TingfaandShen, NingandMu, FengandLi, Jianan



研究问题:现有的少数镜头医学分割网络在原型数量和性能之间存在正相关关系,但缺乏对特征点聚类和未见过任务的适应的关注。
动机:观察到这一现象后,我们提出了一种学习向量量化(VQ)机制,包括网格格式VQ(GFVQ)、自组织VQ(SOVQ)和残差导向VQ(ROVQ)。
方法:GFVQ通过在空间范围内平均方格生成原型矩阵,均匀量化局部细节;SOVQ将特征点自适应地分配给不同的局部类别,并在全局视图中创建一个新的表示空间,其中可学习的局部原型得到更新;ROVQ引入残差信息微调上述学习的局部原型,无需重新训练,有利于提高与训练任务无关的泛化性能。
效果:我们的VQ框架在腹部、心脏和前列腺MRI数据集上取得了最先进的性能,并期望这项工作将引发对当前少数镜头医学分割模型设计的重新思考。

The existing few-shot medical segmentation networks share the same practice that the more prototypes, the better performance. This phenomenon can be theoretically interpreted in Vector Quantization (VQ) view: the more prototypes, the more clusters are separated from pixel-wise feature points distributed over the full space. However, as we further think about few-shot segmentation with this perspective, it is found that the clusterization of feature points and the adaptation to unseen tasks have not received enough attention. Motivated by the observation, we propose a learning VQ mechanism consisting of grid-format VQ (GFVQ), self-organized VQ (SOVQ) and residual oriented VQ (ROVQ). To be specific, GFVQ generates the prototype matrix by averaging square grids over the spatial extent, which uniformly quantizes the local details; SOVQ adaptively assigns the feature points to different local classes and creates a new representation space where the learnable local prototypes are updated with a global view; ROVQ introduces residual information to fine-tune the aforementioned learned local prototypes without re-training, which benefits the generalization performance for the irrelevance to the training task. We empirically show that our VQ framework yields the state-of-the-art performance over abdomen, cardiac and prostate MRI datasets and expect this work will provoke a rethink of the current few-shot medical segmentation model design. Our code will soon be publicly available.

ARO-Net: Learning Implicit Fields From Anchored Radial Observations
Wang, YizhiandHuang, ZeyuandShamir, ArielandHuang, HuiandZhang, HaoandHu, Ruizhen



研究问题:本文旨在提出一种新的形状编码方法,用于学习3D形状的隐式场表示,该方法对形状变化具有显著的通用性和泛化性。
动机:通过从一组被称为锚点的视点进行部分观察来推理形状,我们的目标是开发一种通用且统一的 shape 表示方法。
方法:我们采用固定的一系列锚点,并通过斐波那契采样来设计一个坐标基础的深度神经网络,以预测空间中查询点的占用值。与使用全局形状特征的先前神经隐含模型不同,我们的 shape 编码器在上下文、特定于查询的特征上操作。
效果:我们在稀疏点云的表面重建上展示了我们网络的质量与通用性,并在新颖和未见过的物体类别上进行了测试,“一形”训练,并与最先进的神经和经典方法进行了重建和细分比较。

We introduce anchored radial observations (ARO), a novel shape encoding for learning implicit field representation of 3D shapes that is category-agnostic and generalizable amid significant shape variations. The main idea behind our work is to reason about shapes through partial observations from a set of viewpoints, called anchors. We develop a general and unified shape representation by employing a fixed set of anchors, via Fibonacci sampling, and designing a coordinate-based deep neural network to predict the occupancy value of a query point in space. Differently from prior neural implicit models that use global shape feature, our shape encoder operates on contextual, query-specific features. To predict point occupancy, locally observed shape information from the perspective of the anchors surrounding the input query point are encoded and aggregated through an attention module, before implicit decoding is performed. We demonstrate the quality and generality of our network, coined ARO-Net, on surface reconstruction from sparse point clouds, with tests on novel and unseen object categories, "one-shape" training, and comparisons to state-of-the-art neural and classical methods for reconstruction and tessellation.

Learnable Skeleton-Aware 3D Point Cloud Sampling
Wen, ChengandYu, BaoshengandTao, Dacheng



研究问题:如何有效地进行大规模点云分析,特别是在采样过程中保持对象几何和拓扑信息。
动机:现有的任务特定采样方法通常无法明确探索物体的几何形状,因此需要一种新方法来改进这个问题。
方法:提出一种新的骨架感知学习采样方法,通过学习物体骨架作为先验知识来保留采样过程中的对象几何和拓扑信息。具体来说,我们首先在无监督的方式下通过中轴线变换定义学习类别无关的物体骨架,然后利用这个骨架评估局部特征大小的直方图作为先验知识,从概率的角度进行骨架感知采样。此外,我们还通过探索重参数化技巧使得包含任务网络的骨架感知采样流程可以进行端到端的训练。
效果:在点云分类、检索和重建这三个常见下游任务上的大量实验表明,我们的方法可以有效地进行大规模点云分析。

Point cloud sampling is crucial for efficient large-scale point cloud analysis, where learning-to-sample methods have recently received increasing attention from the community for jointly training with downstream tasks. However, the above-mentioned task-specific sampling methods usually fail to explore the geometries of objects in an explicit manner. In this paper, we introduce a new skeleton-aware learning-to-sample method by learning object skeletons as the prior knowledge to preserve the object geometry and topology information during sampling. Specifically, without labor-intensive annotations per object category, we first learn category-agnostic object skeletons via the medial axis transform definition in an unsupervised manner. With object skeleton, we then evaluate the histogram of the local feature size as the prior knowledge to formulate skeleton-aware sampling from a probabilistic perspective. Additionally, the proposed skeleton-aware sampling pipeline with the task network is thus end-to-end trainable by exploring the reparameterization trick. Extensive experiments on three popular downstream tasks, point cloud classification, retrieval, and reconstruction, demonstrate the effectiveness of the proposed method for efficient point cloud analysis.

Rotation-Invariant Transformer for Point Cloud Matching
Yu, HaoandQin, ZhengandHou, JiandSaleh, MahdiandLi, DongshengandBusam, BenjaminandIlic, Slobodan



研究问题:现有的深度学习点云匹配器通过数据增强获取旋转不变性,但面对罕见旋转时表现不稳定。
动机:为了解决这一问题,我们提出了一种名为RoITr的旋转不变Transformer,以应对点云匹配任务中的姿态变化。
方法:我们从局部层面引入了一种嵌入了基于点对特征(PPF)坐标的注意力机制,构建了一个新颖的注意力基础编码器-解码器架构。我们还提出了一个全局变换器,其旋转不变的跨帧空间意识是通过自我注意力机制学习的,这显著提高了特征的独特性,并使模型在低重叠情况下具有鲁棒性。
效果:我们在刚性和非刚性公共基准上进行了实验,RoITr在所有低重叠场景中都大大超过了所有最先进的模型。特别是在具有挑战性的3DLoMatch基准上放大旋转时,RoITr在内联比和注册召回率方面至少分别超过现有方法13%和5%。

The intrinsic rotation invariance lies at the core of matching point clouds with handcrafted descriptors. However, it is widely despised by recent deep matchers that obtain the rotation invariance extrinsically via data augmentation. As the finite number of augmented rotations can never span the continuous SO(3) space, these methods usually show instability when facing rotations that are rarely seen. To this end, we introduce RoITr, a Rotation-Invariant Transformer to cope with the pose variations in the point cloud matching task. We contribute both on the local and global levels. Starting from the local level, we introduce an attention mechanism embedded with Point Pair Feature (PPF)-based coordinates to describe the pose-invariant geometry, upon which a novel attention-based encoder-decoder architecture is constructed. We further propose a global transformer with rotation-invariant cross-frame spatial awareness learned by the self-attention mechanism, which significantly improves the feature distinctiveness and makes the model robust with respect to the low overlap. Experiments are conducted on both the rigid and non-rigid public benchmarks, where RoITr outperforms all the state-of-the-art models by a considerable margin in the low-overlapping scenarios. Especially when the rotations are enlarged on the challenging 3DLoMatch benchmark, RoITr surpasses the existing methods by at least 13 and 5 percentage points in terms of Inlier Ratio and Registration Recall, respectively.

ViPLO: Vision Transformer Based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection
Park, JeeseungandPark, Jin-WooandLee, Jong-Seok



研究问题:如何提高人体与物体交互(HOI)检测的性能。
动机:虽然两阶段HOI检测器在训练和推理效率上有优势,但由于旧的骨干网络和缺乏对人体在交互分类器中的HOI感知过程的考虑,其性能低于一阶段方法。
方法:提出一种基于视觉变压器的姿态条件自我循环图(ViPLO)来解决这些问题。首先,提出了一种适用于视觉变压器骨干的新型特征提取方法,称为带重叠区域的掩蔽(MOA)模块。其次,设计了一种带有姿态条件自我循环结构图,该图使用人类关节的局部特征更新人类节点编码,使分类器能够专注于特定的人体关节以有效识别交互类型。
效果:实验结果表明,ViPLO在两个公共基准测试中取得了最先进的结果,特别是在HICO-DET数据集上获得了+2.07 mAP的性能增益。

Human-Object Interaction (HOI) detection, which localizes and infers relationships between human and objects, plays an important role in scene understanding. Although two-stage HOI detectors have advantages of high efficiency in training and inference, they suffer from lower performance than one-stage methods due to the old backbone networks and the lack of considerations for the HOI perception process of humans in the interaction classifiers. In this paper, we propose Vision Transformer based Pose-Conditioned Self-Loop Graph (ViPLO) to resolve these problems. First, we propose a novel feature extraction method suitable for the Vision Transformer backbone, called masking with overlapped area (MOA) module. The MOA module utilizes the overlapped area between each patch and the given region in the attention function, which addresses the quantization problem when using the Vision Transformer backbone. In addition, we design a graph with a pose-conditioned self-loop structure, which updates the human node encoding with local features of human joints. This allows the classifier to focus on specific human joints to effectively identify the type of interaction, which is motivated by the human perception process for HOI. As a result, ViPLO achieves the state-of-the-art results on two public benchmarks, especially obtaining a +2.07 mAP performance gain on the HICO-DET dataset.

Improving Table Structure Recognition With Visual-Alignment Sequential Coordinate Modeling
Huang, YongshuaiandLu, NingandChen, DapengandLi, YiboandXie, ZechengandZhu, ShenggaoandGao, LiangcaiandPeng, Wei



研究问题:如何更准确地从无结构的表格图像中提取逻辑和物理结构。
动机:现有的端到端图像到文本方法在预测物理结构(单元格的边界框)时,由于缺乏局部视觉信息,常常产生不精确的边界框。
方法:提出了一种名为VAST的端到端序列建模框架进行表格结构识别。该模型包含一个由逻辑结构解码器表示的非空单元格触发的新型坐标序列解码器。在坐标序列解码器中,我们将边界框坐标建模为语言序列,并按顺序解码左、上、右和底部坐标以利用坐标间的依赖关系。此外,还提出了一种辅助的视觉对齐损失来强制非空单元格的逻辑表示包含更多的局部视觉细节,从而生成更好的单元格边界框。
效果:大量实验证明,该方法在逻辑结构和物理结构识别上都取得了最先进的结果。消融研究也验证了提出的坐标序列解码器和视觉对齐损失是该方法成功的关键。

Table structure recognition aims to extract the logical and physical structure of unstructured table images into a machine-readable format. The latest end-to-end image-to-text approaches simultaneously predict the two structures by two decoders, where the prediction of the physical structure (the bounding boxes of the cells) is based on the representation of the logical structure. However, as the logical representation lacks the local visual information, the previous methods often produce imprecise bounding boxes. To address this issue, we propose an end-to-end sequential modeling framework for table structure recognition called VAST. It contains a novel coordinate sequence decoder triggered by the representation of the non-empty cell from the logical structure decoder. In the coordinate sequence decoder, we model the bounding box coordinates as a language sequence, where the left, top, right and bottom coordinates are decoded sequentially to leverage the inter-coordinate dependency. Furthermore, we propose an auxiliary visual-alignment loss to enforce the logical representation of the non-empty cells to contain more local visual details, which helps produce better cell bounding boxes. Extensive experiments demonstrate that our proposed method can achieve state-of-the-art results in both logical and physical structure recognition. The ablation study also validates that the proposed coordinate sequence decoder and the visual-alignment loss are the keys to the success of our method.

WIRE: Wavelet Implicit Neural Representations
Saragadam, VishwanathandLeJeune, DanielandTan, JasperandBalakrishnan, GuhaandVeeraraghavan, AshokandBaraniuk, RichardG.



研究问题:如何提高隐式神经表示(INRs)的准确性和鲁棒性,同时避免其对信号噪声、参数变化等的敏感性。
动机:当前的INRs虽然设计得具有较高的准确性,但同时也存在鲁棒性差的问题。
方法:受谐波分析的启发,开发了一种新的、高精度且鲁棒的INR——小波隐式神经表示(WIRE)。它使用复Gabor小波作为激活函数,该函数在空间-频率上具有最佳的集中性,并具有优秀的图像表示偏差。
效果:通过广泛的实验(图像去噪、图像修复、超分辨率、计算机断层扫描重建、图像过拟合以及使用神经辐射场的新视图合成),证明WIRE在INR的准确性、训练时间和鲁棒性方面定义了新的最先进水平。

Implicit neural representations (INRs) have recently advanced numerous vision-related areas. INR performance depends strongly on the choice of activation function employed in its MLP network. A wide range of nonlinearities have been explored, but, unfortunately, current INRs designed to have high accuracy also suffer from poor robustness (to signal noise, parameter variation, etc.). Inspired by harmonic analysis, we develop a new, highly accurate and robust INR that does not exhibit this tradeoff. Our Wavelet Implicit neural REpresentation (WIRE) uses as its activation function the complex Gabor wavelet that is well-known to be optimally concentrated in space--frequency and to have excellent biases for representing images. A wide range of experiments (image denoising, image inpainting, super-resolution, computed tomography reconstruction, image overfitting, and novel view synthesis with neural radiance fields) demonstrate that WIRE defines the new state of the art in INR accuracy, training time, and robustness.

Bi-Directional Feature Fusion Generative Adversarial Network for Ultra-High Resolution Pathological Image Virtual Re-Staining
Sun, KexinandChen, ZhinengandWang, GongweiandLiu, JunandYe, XiongjunandJiang, Yu-Gang



研究问题:病理图像的高分辨率导致传统的虚拟重染色方法在生成大尺寸图像时存在颜色、亮度和对比度的差异,即“方格效应”。
动机:由于病理检查的成本高昂,使得虚拟重染病理图像具有实际意义。然而,由于病理图像的超高分辨率,传统的虚拟重染方法必须将WSI图像分割成小块进行模型训练和推理,这导致全局信息的缺失,从而在合并重染的小块以生成更大尺寸的图像时出现明显的颜色、亮度和对比度差异。
方法:为了消除“方格效应”,我们设计了一种具有全局分支和局部分支的双向特征融合生成对抗网络(BFF-GAN)。它通过全局和局部特征的融合以及块状注意力来学习块之间的连接。
效果:我们在私有数据集RCC和公共数据集ANHIR上进行了实验。结果显示,我们的模型取得了有竞争力的性能,并且能够生成极其真实的图像,即使是经验丰富的病理学家也难以分辨,这意味着它具有重大的临床意义。

The cost of pathological examination makes virtual re-staining of pathological images meaningful. However, due to the ultra-high resolution of pathological images, traditional virtual re-staining methods have to divide a WSI image into patches for model training and inference. Such a limitation leads to the lack of global information, resulting in observable differences in color, brightness and contrast when the re-stained patches are merged to generate an image of larger size. We summarize this issue as the square effect. Some existing methods try to solve this issue through overlapping between patches or simple post-processing. But the former one is not that effective, while the latter one requires carefully tuning. In order to eliminate the square effect, we design a bi-directional feature fusion generative adversarial network (BFF-GAN) with a global branch and a local branch. It learns the inter-patch connections through the fusion of global and local features plus patch-wise attention. We perform experiments on both the private dataset RCC and the public dataset ANHIR. The results show that our model achieves competitive performance and is able to generate extremely real images that are deceptive even for experienced pathologists, which means it is of great clinical significance.

Feature Representation Learning With Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition
Zhai, ZhijunandZhao, JianhuiandLong, ChengjiangandXu, WenjuandHe, ShuangjiangandZhao, Huijuan



研究问题:如何有效地识别微表情,这是一项具有挑战性的任务,因为微表情短暂且强度低,难以被识别。
动机:由于微表情在非语言交流中的重要性,开发一种能够有效识别微表情的方法是必要的。
方法:提出了一种新的框架——特征表示学习与自适应位移生成和变压器融合(FRL-DGT)。该框架使用卷积位移生成模块(DGM)进行自我监督学习以提取动态特征,然后通过精心设计的变压器融合机制(包括基于变压器的局部融合模块、全局融合模块和全脸融合模块)从DGM的输出中提取多级信息特征,用于最终的微表情预测。
效果:广泛的实验和坚实的leave-one-subject-out(LOSO)评估结果强有力地证明了提出的FRL-DGT优于最先进的方法。

Micro-expressions are spontaneous, rapid and subtle facial movements that can neither be forged nor suppressed. They are very important nonverbal communication clues, but are transient and of low intensity thus difficult to recognize. Recently deep learning based methods have been developed for micro-expression recognition using feature extraction and fusion techniques, however, targeted feature learning and efficient feature fusion still lack further study according to micro-expression characteristics. To address these issues, we propose a novel framework Feature Representation Learning with adaptive Displacement Generation and Transformer fusion (FRL-DGT), in which a convolutional Displacement Generation Module (DGM) with self-supervised learning is used to extract dynamic feature targeted to the subsequent ME recognition task, and a well-designed Transformer fusion mechanism composed of the Transformer-based local fusion module, global fusion module, and full-face fusion module is applied to extract the multi-level informative feature from the output of the DGM for the final micro-expression prediction. Extensive experiments with solid leave-one-subject-out (LOSO) evaluation results have strongly demonstrated the superiority of our proposed FRL-DGT to state-of-the-art methods.

ViewNet: A Novel Projection-Based Backbone With View Pooling for Few-Shot Point Cloud Classification
Chen, JiajingandYang, MinminandVelipasalar, Senem



研究问题:尽管已经提出了许多针对3D点云相关任务的方法,但少样本学习(FSL)在3D点云中的应用仍然是一个未充分探索的领域。
动机:现有的FSL方法主要采用基于点的模型作为其主干网络,然而,我们通过大量的实验和分析发现,这种方法存在一些问题,如大量点的特征被丢弃,对遮挡敏感等。
方法:为了解决这些问题,我们提出了一种新的基于投影和二维卷积神经网络的主干网络——ViewNet。该方法首先将3D点云投影到六个不同的视图上,以缓解丢失点的问题。同时,我们还提出了视图池化方法,将不同的投影平面组合成五组,并在每组上执行最大池化操作,以生成更具描述性和区分性的特征。
效果:我们在ModelNet40、ScanObjectNN和ModelNet40-C数据集上进行了实验,并与其他最先进的基线方法进行了比较,结果显示我们的方法始终优于其他方法。此外,与ResNet等传统的图像分类主干网络相比,我们的ViewNet能够从点云的多个视图中提取出更多的区分性特征。

Although different approaches have been proposed for 3D point cloud-related tasks, few-shot learning (FSL) of 3D point clouds still remains under-explored. In FSL, unlike traditional supervised learning, the classes of training and test data do not overlap, and a model needs to recognize unseen classes from only a few samples. Existing FSL methods for 3D point clouds employ point-based models as their backbone. Yet, based on our extensive experiments and analysis, we first show that using a point-based backbone is not the most suitable FSL approach, since (i) a large number of points' features are discarded by the max pooling operation used in 3D point-based backbones, decreasing the ability of representing shape information; (ii)point-based backbones are sensitive to occlusion. To address these issues, we propose employing a projection- and 2D Convolutional Neural Network-based backbone, referred to as the ViewNet, for FSL from 3D point clouds. Our approach first projects a 3D point cloud onto six different views to alleviate the issue of missing points. Also, to generate more descriptive and distinguishing features, we propose View Pooling, which combines different projected plane combinations into five groups and performs max-pooling on each of them. The experiments performed on the ModelNet40, ScanObjectNN and ModelNet40-C datasets, with cross validation, show that our method consistently outperforms the state-of-the-art baselines. Moreover, compared to traditional image classification backbones, such as ResNet, the proposed ViewNet can extract more distinguishing features from multiple views of a point cloud. We also show that ViewNet can be used as a backbone with different FSL heads and provides improved performance compared to traditionally used backbones.

HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Kuo, Chia-WenandKira, Zsolt



研究问题:如何有效地利用异构编码来提高图像描述的性能?
动机:随着更先进的编码方式的可用性和整合,如何高效地利用这些异构编码成为了一个自然的问题。
方法:本文提出了一种新的图像描述模型,该模型将编码视为输入图像的增强视图,并使用共享编码器独立地对每个视图进行编码。然后,通过一种新颖的方式在编码视图之间引入对比损失,以提高它们的表示质量和模型的数据效率。
效果:实验结果表明,与现有技术相比,该方法在MS-COCO和Flickr30k数据集上分别提高了+5.6%和+12.9%的CIDEr性能。

A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage the heterogeneous set of encodings? In this paper, we propose to regard the encodings as augmented views of the input image. The image captioning model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views in a novel way to improve their representation quality and the model's data efficiency. Our proposed hierarchical decoder then adaptively weighs the encoded views according to their effectiveness for caption generation by first aggregating within each view at the token level, and then across views at the view level. We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts,

End-to-End 3D Dense Captioning With Vote2Cap-DETR
Chen, SijinandZhu, HongyuanandChen, XinandLei, YinjieandYu, GangandChen, Tao



研究问题:本文旨在解决3D密集描述任务,即生成与对象区域相关联的多个描述。
动机:现有的方法遵循复杂的"检测然后描述"流程,并配备了许多手工制作的组件。然而,在面对不同场景中杂乱的对象空间和类别分布时,这些手工制作的组件可能会产生次优的性能。
方法:本文提出了一种简单而有效的基于最近流行的DEtection TRansformer (DETR)的变压器框架Vote2Cap-DETR。相比于先前的方法,我们的框架有几个吸引人的优点:1)不依赖于大量的手工制作的组件,我们的方法基于一个完整的变压器编码器-解码器架构,具有可学习的投票查询驱动的对象解码器和一个以集合预测方式生成密集描述的标题解码器。2)与两阶段方案相比,我们的方法可以在一阶段进行检测和描述。3)没有花哨的东西,我们在两个常用的数据集ScanRefer和Nr3D上进行了广泛的实验,证明我们的Vote2Cap-DETR在CIDEr@0.5IoU上分别超过了当前最先进的11.13%和7.11%。
效果:实验结果表明,我们的Vote2Cap-DETR在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

3D dense captioning aims to generate multiple captions localized with their associated object regions. Existing methods follow a sophisticated "detect-then-describe" pipeline equipped with numerous hand-crafted components. However, these hand-crafted components would yield suboptimal performance given cluttered object spatial and class distributions among different scenes. In this paper, we propose a simple-yet-effective transformer framework Vote2Cap-DETR based on recent popular DEtection TRansformer (DETR). Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the two-stage scheme, our method can perform detection and captioning in one-stage. 3) Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate that our Vote2Cap-DETR surpasses current state-of-the-arts by 11.13% and 7.11% in CIDEr@0.5IoU, respectively. Codes will be released soon.

Optimization-Inspired Cross-Attention Transformer for Compressive Sensing
Song, JiechongandMou, ChongandWang, ShiqiandMa, SiweiandZhang, Jian



研究问题:现有的深度展开网络在提高图像质量时,往往需要大量的参数,且在迭代过程中存在特征信息损失的问题。
动机:通过将优化求解器与深度神经网络结合,设计出具有良好解释性和高性能的深度展开网络。
方法:提出一种基于优化启发的交叉注意力转换器(OCT)模块作为迭代过程,构建了一种轻量级的OCT基础展开框架(OCTUF)。并设计了一种新的双交叉注意力(Dual-CA)子模块,包括惯性供应交叉注意力(ISCA)块和投影引导交叉注意力(PGCA)块。
效果:实验证明,相比于最先进的方法,OCTUF在训练复杂度更低的情况下,取得了优越的性能。

By integrating certain optimization solvers with deep neural networks, deep unfolding network (DUN) with good interpretability and high performance has attracted growing attention in compressive sensing (CS). However, existing DUNs often improve the visual quality at the price of a large number of parameters and have the problem of feature information loss during iteration. In this paper, we propose an Optimization-inspired Cross-attention Transformer (OCT) module as an iterative process, leading to a lightweight OCT-based Unfolding Framework (OCTUF) for image CS. Specifically, we design a novel Dual Cross Attention (Dual-CA) sub-module, which consists of an Inertia-Supplied Cross Attention (ISCA) block and a Projection-Guided Cross Attention (PGCA) block. ISCA block introduces multi-channel inertia forces and increases the memory effect by a cross attention mechanism between adjacent iterations. And, PGCA block achieves an enhanced information interaction, which introduces the inertia force into the gradient descent step through a cross attention block. Extensive CS experiments manifest that our OCTUF achieves superior performance compared to state-of-the-art methods while training lower complexity. Codes are available at https://github.com/songjiechong/OCTUF.

TruFor: Leveraging All-Round Clues for Trustworthy Image Forgery Detection and Localization
Guillaro, FabrizioandCozzolino, DavideandSud, AvneeshandDufour, NicholasandVerdoliva, Luisa



研究问题:开发一种适用于各种图像处理方式的取证框架。
动机:现有的取证方法无法有效应对从传统低成本假图片到基于深度学习的最新处理方式的图像篡改。
方法:通过结合RGB图像和学习到的对噪声敏感的指纹(通过仅在真实数据上进行自我监督训练来嵌入与相机内部和外部处理相关的伪影)的转换器基础融合架构提取高级和低级痕迹。
效果:该方法能够可靠地检测并定位各种局部篡改,并在几个数据集上的实验中表现出色,优于现有技术。

In this paper we present TruFor, a forensic framework that can be applied to a large variety of image manipulation methods, from classic cheapfakes to more recent manipulations based on deep learning. We rely on the extraction of both high-level and low-level traces through a transformer-based fusion architecture that combines the RGB image and a learned noise-sensitive fingerprint. The latter learns to embed the artifacts related to the camera internal and external processing by training only on real data in a self-supervised manner. Forgeries are detected as deviations from the expected regular pattern that characterizes each pristine image. Looking for anomalies makes the approach able to robustly detect a variety of local manipulations, ensuring generalization. In addition to a pixel-level localization map and a whole-image integrity score, our approach outputs a reliability map that highlights areas where localization predictions may be error-prone. This is particularly important in forensic applications in order to reduce false alarms and allow for a large scale analysis. Extensive experiments on several datasets show that our method is able to reliably detect and localize both cheapfakes and deepfakes manipulations outperforming state-of-the-art works. Code is publicly available at https://grip-unina.github.io/TruFor/.

Topology-Guided Multi-Class Cell Context Generation for Digital Pathology
Abousamra, ShahiraandGupta, RajarsiandKurc, TahsinandSamaras, DimitrisandSaltz, JoelandChen, Chao



研究问题:如何有效利用细胞的空间上下文进行细胞分类、癌症诊断和预后。
动机:细胞形成不同的混合物、谱系、簇和孔洞,要学习模型这种复杂的细胞结构模式具有挑战性。
方法:引入空间统计和拓扑数据分析的数学工具,将结构性描述符整合到深度生成模型中作为条件输入和可微损失。
效果:首次生成高质量的多类细胞布局,证明富含拓扑结构的细胞布局可用于数据增强并提高下游任务(如细胞分类)的性能。

In digital pathology, the spatial context of cells is important for cell classification, cancer diagnosis and prognosis. To model such complex cell context, however, is challenging. Cells form different mixtures, lineages, clusters and holes. To model such structural patterns in a learnable fashion, we introduce several mathematical tools from spatial statistics and topological data analysis. We incorporate such structural descriptors into a deep generative model as both conditional inputs and a differentiable loss. This way, we are able to generate high quality multi-class cell layouts for the first time. We show that the topology-rich cell layouts can be used for data augmentation and improve the performance of downstream tasks such as cell classification.

Learning Steerable Function for Efficient Image Resampling
Li, JiachengandChen, ChangandHuang, WeiandLang, ZhiqiangandSong, FenglongandYan, YouliangandXiong, Zhiwei



研究问题:如何提高图像重采样的效率和连续性?
动机:现有的深度学习网络在重采样性能上取得了显著进步,但效率和连续性仍是问题。
方法:提出一种学习重采样函数(LeRF)的新方法,利用深度学习网络学习的结构化先验知识和插值方法的局部连续假设。通过空间变化的可定向重采样函数对输入图像像素进行赋值,并使用神经网络预测确定这些重采样函数方向的超参数。
效果:实验表明,该方法运行速度快,能适应任意变换,且在性能上优于插值方法,例如,在Manga109数据集上的x2上采样任务中,比双三次插值提高了3dB的PSNR。

Image resampling is a basic technique that is widely employed in daily applications. Existing deep neural networks (DNNs) have made impressive progress in resampling performance. Yet these methods are still not the perfect substitute for interpolation, due to the issues of efficiency and continuous resampling. In this work, we propose a novel method of Learning Resampling Function (termed LeRF), which takes advantage of both the structural priors learned by DNNs and the locally continuous assumption of interpolation methods. Specifically, LeRF assigns spatially-varying steerable resampling functions to input image pixels and learns to predict the hyper-parameters that determine the orientations of these resampling functions with a neural network. To achieve highly efficient inference, we adopt look-up tables (LUTs) to accelerate the inference of the learned neural network. Furthermore, we design a directional ensemble strategy and edge-sensitive indexing patterns to better capture local structures. Extensive experiments show that our method runs as fast as interpolation, generalizes well to arbitrary transformations, and outperforms interpolation significantly, e.g., up to 3dB PSNR gain over bicubic for x2 upsampling on Manga109.

TokenHPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers
Zhang, ChengandLiu, HaiandDeng, YongjianandXie, BochenandLi, Youfu



研究问题:如何应对头部姿态估计中极端的头部姿态随机性和严重的遮挡问题。
动机:现有的方法无法处理头部姿态估计中的极端随机性和严重遮挡问题。
方法:提出了一种基于Transformer架构的新型关键少数关系感知方法,通过设计方向令牌来明确编码基本方向区域,并设计了一种新的令牌引导多损失函数来指导方向令牌学习所需的区域相似性和关系。
效果:在三个具有挑战性的基准HPE数据集上进行评估,实验表明该方法比现有方法表现更好。

Head pose estimation (HPE) has been widely used in the fields of human machine interaction, self-driving, and attention estimation. However, existing methods cannot deal with extreme head pose randomness and serious occlusions. To address these challenges, we identify three cues from head images, namely, neighborhood similarities, significant facial changes, and critical minority relationships. To leverage the observed findings, we propose a novel critical minority relationship-aware method based on the Transformer architecture in which the facial part relationships can be learned. Specifically, we design several orientation tokens to explicitly encode the basic orientation regions. Meanwhile, a novel token guide multi-loss function is designed to guide the orientation tokens as they learn the desired regional similarities and relationships. We evaluate the proposed method on three challenging benchmark HPE datasets. Experiments show that our method achieves better performance compared with state-of-the-art methods. Our code is publicly available at https://github.com/zc2023/TokenHPE.

RIFormer: Keep Your Vision Backbone Effective but Removing Token Mixer
Wang, JiahaoandZhang, SongyangandLiu, YongandWu, TaiqiangandYang, YujiuandLiu, XihuiandChen, KaiandLuo, PingandLin, Dahua



研究问题:如何在移除视觉Transformer中的基本构建模块中的标记混合器的同时,保持视觉主干模型的有效性。
动机:标记混合器作为视觉变换器的自注意力机制,用于在不同的空间标记之间进行信息交流,但计算成本和延迟较高。直接移除它们会导致模型结构不完整,从而带来显著的准确性下降。
方法:首先基于重参数化思想开发了一个RepIdentityFormer,以研究无标记混合器模型架构。然后探索了改进的学习范式,打破了简单无标记混合器主干的局限性,并将经验实践总结为5条准则。通过采用提出的优化策略,我们能够构建一个非常简单的视觉主干模型,同时在推理过程中具有高效率。大量的实验和消融分析还表明,网络架构的归纳偏置可以与适当的优化策略结合到简单的网络结构中。
效果:实验结果表明,在移除标记混合器的同时,使用优化策略可以在保持模型有效性的同时提高模型的效率。我们希望这项工作能成为优化驱动的有效网络设计的探索起点。

This paper studies how to keep a vision backbone effective while removing token mixers in its basic building blocks. Token mixers, as self-attention for vision transformers (ViTs), are intended to perform information communication between different spatial tokens but suffer from considerable computational cost and latency. However, directly removing them will lead to an incomplete model structure prior, and thus brings a significant accuracy drop. To this end, we first develop an RepIdentityFormer base on the re-parameterizing idea, to study the token mixer free model architecture. And we then explore the improved learning paradigm to break the limitation of simple token mixer free backbone, and summarize the empirical practice into 5 guidelines. Equipped with the proposed optimization strategy, we are able to build an extremely simple vision backbone with encouraging performance, while enjoying the high efficiency during inference. Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy. We hope this work can serve as a starting point for the exploration of optimization-driven efficient network design.

Context-Based Trit-Plane Coding for Progressive Image Compression
Jeon, SeungminandChoi, KwangPyoandPark, YoungoandKim, Chang-Su



研究问题:如何利用自动回归上下文模型实现深度渐进式图像压缩。
动机:目前的三进制平面编码虽然可以实现深度渐进式图像压缩,但不能使用自动回归上下文模型。
方法:提出基于上下文的三进制平面编码(CTC)算法,通过开发基于上下文的率降低模块和基于上下文的失真降低模块,以及设计解码器再训练方案,实现更紧凑的渐进式压缩。
效果:实验表明,CTC在Kodak无损数据集上的BD-rate比基线三进制平面编码器提高了14.84%,而时间复杂度仅略有增加。

Trit-plane coding enables deep progressive image compression, but it cannot use autoregressive context models. In this paper, we propose the context-based trit-plane coding (CTC) algorithm to achieve progressive compression more compactly. First, we develop the context-based rate reduction module to estimate trit probabilities of latent elements accurately and thus encode the trit-planes compactly. Second, we develop the context-based distortion reduction module to refine partial latent tensors from the trit-planes and improve the reconstructed image quality. Third, we propose a retraining scheme for the decoder to attain better rate-distortion tradeoffs. Extensive experiments show that CTC outperforms the baseline trit-plane codec significantly, e.g. by -14.84% in BD-rate on the Kodak lossless dataset, while increasing the time complexity only marginally. The source codes are available at https://github.com/seungminjeon-github/CTC.

Self-Supervised Learning for Multimodal Non-Rigid 3D Shape Matching
Cao, DongliangandBernard, Florian



研究问题:如何提高3D形状匹配的质量,特别是在处理点云和网格这两种不同表示形式时。
动机:虽然点云是原始真实世界3D数据(如激光扫描仪)的常见表示,但网格编码丰富且富有表现力的结构信息,但其创建通常需要某种形式的(通常是手动的)整理。反过来,纯粹依赖点云的方法无法满足利用额外拓扑结构的网格方法的匹配质量。
方法:我们引入了一种自我监督的多模态学习方法,该方法结合了基于网格的功能地图正则化和耦合网格和点云数据的对比损失。我们的形态匹配方法允许获取三角形网格、完整点云和部分观测点云的模内对应关系,以及这些数据模态之间的对应关系。
效果:我们在几个具有挑战性的基准数据集上展示了我们的方法实现了最先进的结果,甚至与最近的有监督方法相比也是如此,并且我们的方法达到了前所未有的跨数据集泛化能力。

The matching of 3D shapes has been extensively studied for shapes represented as surface meshes, as well as for shapes represented as point clouds. While point clouds are a common representation of raw real-world 3D data (e.g. from laser scanners), meshes encode rich and expressive topological information, but their creation typically requires some form of (often manual) curation. In turn, methods that purely rely on point clouds are unable to meet the matching quality of mesh-based methods that utilise the additional topological structure. In this work we close this gap by introducing a self-supervised multimodal learning strategy that combines mesh-based functional map regularisation with a contrastive loss that couples mesh and point cloud data. Our shape matching approach allows to obtain intramodal correspondences for triangle meshes, complete point clouds, and partially observed point clouds, as well as correspondences across these data modalities. We demonstrate that our method achieves state-of-the-art results on several challenging benchmark datasets even in comparison to recent supervised methods, and that our method reaches previously unseen cross-dataset generalisation ability.

Recurrent Vision Transformers for Object Detection With Event Cameras
Gehrig, MathiasandScaramuzza, Davide



研究问题:本文旨在提出一种新的对象检测方法,利用事件相机和循环视觉转换器(RVTs)作为新的视觉主干。
动机:事件相机具有亚毫秒级的延迟、高动态范围和对运动模糊的强大鲁棒性等独特属性,为低延迟的对象检测和跟踪提供了巨大的潜力。
方法:通过重新设计循环视觉主干,实现了在保持相似性能的同时将推理时间减少6倍。具体来说,我们采用了三个关键概念:一是可以视为条件位置嵌入的卷积先验;二是用于空间特征交互的局部和扩张全局自注意力;三是用于最小化延迟同时保留时间信息的递归时间特征聚合。
效果:实验结果表明,RVTs可以在事件驱动的对象检测上达到最先进的性能,实现Gen1汽车数据集上的mAP为47.2%。同时,RVTs具有快速的推理能力(在T4 GPU上小于12ms)和良好的参数效率(比现有技术少5倍)。这项研究为超越事件驱动视觉的有效设计选择提供了新的见解。

We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras. Event cameras provide visual information with sub-millisecond latency at a high-dynamic range and with strong robustness against motion blur. These unique properties offer great potential for low-latency object detection and tracking in time-critical scenarios. Prior work in event-based vision has achieved outstanding detection performance but at the cost of substantial inference time, typically beyond 40 milliseconds. By revisiting the high-level design of recurrent vision backbones, we reduce inference time by a factor of 6 while retaining similar performance. To achieve this, we explore a multi-stage design that utilizes three key concepts in each stage: First, a convolutional prior that can be regarded as a conditional positional embedding. Second, local- and dilated global self-attention for spatial feature interaction. Third, recurrent temporal feature aggregation to minimize latency while retaining temporal information. RVTs can be trained from scratch to reach state-of-the-art performance on event-based object detection - achieving an mAP of 47.2% on the Gen1 automotive dataset. At the same time, RVTs offer fast inference (<12 ms on a T4 GPU) and favorable parameter efficiency (5 times fewer than prior art). Our study brings new insights into effective design choices that can be fruitful for research beyond event-based vision.

METransformer: Radiology Report Generation by Transformer With Multiple Learnable Expert Tokens
Wang, ZhanyuandLiu, LingqiaoandWang, LeiandZhou, Luping



研究问题:如何利用多专家联合诊断机制提升现有单专家框架在临床场景中的诊断效果。
动机:多专家咨询在复杂病例的诊断中具有显著优势,因此探索一种“多专家联合诊断”机制以改进现有文献中的“单专家”框架。
方法:提出METransformer,该方法采用基于变压器的主干实现这一想法。关键设计是在变压器编码器和解码器中引入多个可学习的“专家”令牌。在编码器中,每个专家令牌与视觉令牌和其他专家令牌交互,学习关注图像的不同区域以进行图像表示。通过最小化它们重叠的正交损失来鼓励这些专家令牌捕获互补信息。在解码器中,每个被关注的专家令牌引导输入单词和视觉令牌之间的交叉注意力,从而影响生成的报告。进一步开发了一个基于指标的专家投票策略来生成最终报告。
效果:实验结果表明,所提出的模型在两个广泛使用的基准测试上表现出有希望的性能。最后但并非最不重要的是,该框架级别的创新使我们的工作能够整合现有“单专家”模型的进步,进一步提高其性能。

In clinical scenarios, multi-specialist consultation could significantly benefit the diagnosis, especially for intricate cases. This inspires us to explore a "multi-expert joint diagnosis" mechanism to upgrade the existing "single expert" framework commonly seen in the current literature. To this end, we propose METransformer, a method to realize this idea with a transformer-based backbone. The key design of our method is the introduction of multiple learnable "expert" tokens into both the transformer encoder and decoder. In the encoder, each expert token interacts with both vision tokens and other expert tokens to learn to attend different image regions for image representation. These expert tokens are encouraged to capture complementary information by an orthogonal loss that minimizes their overlap. In the decoder, each attended expert token guides the cross-attention between input words and visual tokens, thus influencing the generated report. A metrics-based expert voting strategy is further developed to generate the final report. By the multi-experts concept, our model enjoys the merits of an ensemble-based approach but through a manner that is computationally more efficient and supports more sophisticated interactions among experts. Experimental results demonstrate the promising performance of our proposed model on two widely used benchmarks. Last but not least, the framework-level innovation makes our work ready to incorporate advances on existing "single-expert" models to further improve its performance.

Omni Aggregation Networks for Lightweight Image Super-Resolution
Wang, HangandChen, XuanhongandNi, BingbingandLiu, YutianandLiu, Jinfan



研究问题:ViT框架在图像超分辨率方面取得了巨大进步,但其单研究问题:ViT框架在图像超分辨率方面取得了巨大进步,但其单维自注意力模型和同质聚合方案限制了其有效感受野(ERF),无法包含更全面的来自空间和通道维度的交互。
动机:为了解决这些缺点,本文提出了一种新的Omni-SR架构,包括两个增强组件。
方法:首先,基于密集交互原理提出了Omni Self-Attention(OSA)范式,可以同时从空间和通道维度对像素交互进行建模,挖掘跨omni轴(即空间和通道)的潜在相关性。与主流的窗口划分策略结合,OSA可以在具有强大计算预算的情况下实现优越的性能。其次,提出了一种多尺度交互方案,以减轻浅层模型中次优ERF(即过早饱和)的问题,促进局部传播和中/全局尺度的交互,形成一个omni-scale聚合构建块。
效果:大量实验证明,Omni-SR在轻量级超分辨率基准测试中实现了创纪录的性能(例如,仅使用792K参数时,Urban100 x4达到26.95dB)。

While lightweight ViT framework has made tremendous progress in image super-resolution, its uni-dimensional self-attention modeling, as well as homogeneous aggregation scheme, limit its effective receptive field (ERF) to include more comprehensive interactions from both spatial and channel dimensions. To tackle these drawbacks, this work proposes two enhanced components under a new Omni-SR architecture. First, an Omni Self-Attention (OSA) paradigm is proposed based on dense interaction principle, which can simultaneously model pixel-interaction from both spatial and channel dimensions, mining the potential correlations across omni-axis (i.e., spatial and channel). Coupling with mainstream window partitioning strategies, OSA can achieve superior performance with compelling computational budgets. Second, a multi-scale interaction scheme is proposed to mitigate sub-optimal ERF (i.e., premature saturation) in shallow models, which facilitates local propagation and meso-/global-scale interactions, rendering a omni-scale aggregation building block. Extensive experiments demonstrate that Omni-SR achieves record-high performance on lightweight super-resolution benchmarks (e.g., 26.95dB@Urban100 x4 with only 792K parameters). Our code is available at https://github.com/Francis0625/Omni-SR.

Correlational Image Modeling for Self-Supervised Visual Pre-Training
Li, WeiandXie, JiahaoandLoy, ChenChange



研究问题:本文旨在介绍一种新的、有效的自我监督视觉预训练方法——关联图像建模(CIM)。
动机:目前的预训练模型在视觉任务上的表现仍有提升空间,作者提出通过自我监督的方式对图像进行预训练。
方法:CIM采用一种简单的预训练任务:从输入图像中随机裁剪出图像区域(样例),并预测样例与上下文之间的关联映射。设计了三个关键部分使得关联图像建模成为一个有意义且不平凡的自我监督任务。首先,为了生成有用的样例-上下文对,考虑以不同的尺度、形状、旋转和变换来裁剪图像区域。其次,采用引导学习和目标网络的自举学习框架。在预训练过程中,前者将样例作为输入,后者将上下文进行转换。最后,通过一个简单的交叉注意力模块来对输出的关联映射进行建模,其中上下文作为查询,样例提供值和键。
效果:实验结果表明,CIM在自我监督和迁移基准测试上的表现与当前最先进的技术相当甚至更好。

We introduce Correlational Image Modeling (CIM), a novel but surprisingly effective approach to self-supervised visual pre-training. Our CIM performs a simple pretext task: we randomly crop image regions (exemplar) from an input image (context) and predict correlation maps between the exemplars and the context. Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task. First, to generate useful exemplar-context pairs, we consider cropping image regions with various scales, shapes, rotations, and transformations. Second, we employ a bootstrap learning framework that involves online and target networks. During pre-training, the former takes exemplars as inputs while the latter converts the context. Third, we model the output correlation maps via a simple cross-attention block, within which the context serves as queries and the exemplars offer values and keys. We show that CIM performs on par or better than the current state of the art on self-supervised and transfer benchmarks.

Self-Supervised Implicit Glyph Attention for Text Recognition
Guan, TongkunandGu, ChaochenandTu, JingzhengandYang, XueandFeng, QiandZhao, YudiandShen, Wei



研究问题:如何提高场景文本识别(STR)中的注意力机制效果。
动机:目前的注意力机制,无论是隐式注意力还是监督注意力,都存在一些问题,如隐式注意力可能提取到粗略的或错误的注意力区域,而监督注意力需要大量的字符级边界框注释且类别特定。
方法:提出一种新的注意力机制——自我监督隐式字形注意力(SIGA)。SIGA通过联合自我监督的文本分割和隐式注意力对齐来描绘文本图像的字形结构,以此作为改进注意力正确性的监督,而无需额外的字符级注释。
效果:实验结果表明,SIGA在注意力正确性和最终识别性能上都显著优于之前的注意力基础STR方法。

The attention mechanism has become the de facto module in scene text recognition (STR) methods, due to its capability of extracting character-level representations. These methods can be summarized into implicit attention based and supervised attention based, depended on how the attention is computed, i.e., implicit attention and supervised attention are learned from sequence-level text annotations and character-level bounding box annotations, respectively. Implicit attention, as it may extract coarse or even incorrect spatial regions as character attention, is prone to suffering from an alignment-drifted issue. Supervised attention can alleviate the above issue, but it is category-specific, which requires extra laborious character-level bounding box annotations and would be memory-intensive when the number of character categories is large. To address the aforementioned issues, we propose a novel attention mechanism for STR, self-supervised implicit glyph attention (SIGA). SIGA delineates the glyph structures of text images by jointly self-supervised text segmentation and implicit attention alignment, which serve as the supervision to improve attention correctness without extra character-level annotations. Experimental results demonstrate that SIGA performs consistently and significantly better than previous attention-based STR methods, in terms of both attention correctness and final recognition performance on publicly available context benchmarks and our contributed contextless benchmarks.

ACL-SPC: Adaptive Closed-Loop System for Self-Supervised Point Cloud Completion
Hong, SangminandYavartanoo, MohsenandNeshatavar, ReyhanehandLee, KyoungMu



研究问题:本文旨在解决点云补全问题,即如何从深度传感器获取的部分点云中填充缺失部分并生成完整的点云。
动机:虽然在合成点云补全任务上,有监督方法取得了很大进展,但由于合成数据集和真实世界数据集之间的领域差距或对先验信息的需求,这些方法很难应用于真实世界场景。
方法:为克服这些限制,我们提出了一种新的自我监督框架ACL-SPC用于点云补全,以在同一数据上进行训练和测试。ACL-SPC采用自适应闭环(ACL)系统,将单个部分输入尝试输出完整的点云,该系统强制输出对于输入的变化相同。
效果:我们在各种数据集上评估我们的ACL-SPC,证明它可以成功地学习完成部分点云,这是第一个自我监督方案。结果显示,我们的方法与无监督方法相当,并在真实世界数据集上比在合成数据集上训练的有监督方法表现更好。大量实验证明了自我监督学习的必要性以及我们提出的方法在真实世界点云补全任务上的有效性。代码可在此链接公开获取。

Point cloud completion addresses filling in the missing parts of a partial point cloud obtained from depth sensors and generating a complete point cloud. Although there has been steep progress in the supervised methods on the synthetic point cloud completion task, it is hardly applicable in real-world scenarios due to the domain gap between the synthetic and real-world datasets or the requirement of prior information. To overcome these limitations, we propose a novel self-supervised framework ACL-SPC for point cloud completion to train and test on the same data. ACL-SPC takes a single partial input and attempts to output the complete point cloud using an adaptive closed-loop (ACL) system that enforces the output same for the variation of an input. We evaluate our ACL-SPC on various datasets to prove that it can successfully learn to complete a partial point cloud as the first self-supervised scheme. Results show that our method is comparable with unsupervised methods and achieves superior performance on the real-world dataset compared to the supervised methods trained on the synthetic dataset. Extensive experiments justify the necessity of self-supervised learning and the effectiveness of our proposed method for the real-world point cloud completion task. The code is publicly available from this link.

Focus on Details: Online Multi-Object Tracking With Diverse Fine-Grained Representation
Ren, HaoandHan, ShoudongandDing, HuilinandZhang, ZiwenandWang, HongweiandWang, Faquan



研究问题:在多目标跟踪(MOT)中,如何提取具有区分性的特征表示以保持每个目标的唯一标识符。
动机:现有的MOT方法主要通过边界框区域或中心点的特征进行身份嵌入,但当目标被遮挡时,这些粗粒度的全局表示变得不可靠。
方法:提出了一种多样化的细粒度表示方法,从全局和局部两个角度全面描述目标的外观。为了有效缓解由无序的上下文信息聚合引起的语义错位,提出了流对齐FPN(FAFPN)进行多尺度特征对齐聚合。此外,还提出了多头部分掩码生成器(MPMG)基于对齐后的特征图提取细粒度表示。
效果:实验结果表明,该方法在MOT17和MOT20测试集上取得了最先进的性能。即使在目标外观极其相似的DanceTrack上,该方法也比ByteTrack在HOTA和IDF1上分别提高了5.0%和5.6%。大量实验证明,多样化的细粒度表示使Re-ID在MOT中再次表现出色。

Discriminative representation is essential to keep a unique identifier for each target in Multiple object tracking (MOT). Some recent MOT methods extract features of the bounding box region or the center point as identity embeddings. However, when targets are occluded, these coarse-grained global representations become unreliable. To this end, we propose exploring diverse fine-grained representation, which describes appearance comprehensively from global and local perspectives. This fine-grained representation requires high feature resolution and precise semantic information. To effectively alleviate the semantic misalignment caused by indiscriminate contextual information aggregation, Flow Alignment FPN (FAFPN) is proposed for multi-scale feature alignment aggregation. It generates semantic flow among feature maps from different resolutions to transform their pixel positions. Furthermore, we present a Multi-head Part Mask Generator (MPMG) to extract fine-grained representation based on the aligned feature maps. Multiple parallel branches of MPMG allow it to focus on different parts of targets to generate local masks without label supervision. The diverse details in target masks facilitate fine-grained representation. Eventually, benefiting from a Shuffle-Group Sampling (SGS) training strategy with positive and negative samples balanced, we achieve state-of-the-art performance on MOT17 and MOT20 test sets. Even on DanceTrack, where the appearance of targets is extremely similar, our method significantly outperforms ByteTrack by 5.0% on HOTA and 5.6% on IDF1. Extensive experiments have proved that diverse fine-grained representation makes Re-ID great again in MOT.

Structure Aggregation for Cross-Spectral Stereo Image Guided Denoising
Sheng, ZehuaandYu, ZhuandLiu, XiongweiandCao, Si-YuanandLiu, YuqiandShen, Hui-LiangandZhang, Huaqi



研究问题:如何从噪声观测中获取具有显著结构的清晰图像?
动机:当前去噪研究中的普遍做法是利用高信噪比的额外指导图像,但目前的跨光谱立体匹配方法无法完全保证像素级的对齐精度,且很少考虑噪声污染的情况。
方法:首次提出一种用于跨光谱立体图像的引导去噪框架。不通过传统的立体匹配来对齐输入图像,而是从指导图像中聚合结构以估计噪声目标图像的清洁结构图,然后使用空间可变的线性表示模型回归最终的去噪结果。基于此,设计了一个名为SANet的神经网络来完成整个引导去噪过程。
效果:实验结果表明,我们的SANet能够有效地将结构从未对齐的指导图像转移到恢复结果,并在各种立体图像数据集上优于最先进的去噪器。此外,我们结构聚合策略也显示出处理其他未对齐的引导恢复任务(如超分辨率和去模糊)的潜力。

To obtain clean images with salient structures from noisy observations, a growing trend in current denoising studies is to seek the help of additional guidance images with high signal-to-noise ratios, which are often acquired in different spectral bands such as near infrared. Although previous guided denoising methods basically require the input images to be well-aligned, a more common way to capture the paired noisy target and guidance images is to exploit a stereo camera system. However, current studies on cross-spectral stereo matching cannot fully guarantee the pixel-level registration accuracy, and rarely consider the case of noise contamination. In this work, for the first time, we propose a guided denoising framework for cross-spectral stereo images. Instead of aligning the input images via conventional stereo matching, we aggregate structures from the guidance image to estimate a clean structure map for the noisy target image, which is then used to regress the final denoising result with a spatially variant linear representation model. Based on this, we design a neural network, called as SANet, to complete the entire guided denoising process. Experimental results show that, our SANet can effectively transfer structures from an unaligned guidance image to the restoration result, and outperforms state-of-the-art denoisers on various stereo image datasets. Besides, our structure aggregation strategy also shows its potential to handle other unaligned guided restoration tasks such as super-resolution and deblurring. The source code is available at https://github.com/lustrouselixir/SANet.

One-Stage 3D Whole-Body Mesh Recovery With Component Aware Transformer
Lin, JingandZeng, AilingandWang, HaoqianandZhang, LeiandLi, Yu



研究问题:如何从单张图片中估计3D人体、面部和手部的参数。
动机:由于分辨率问题,即面部和手部通常位于极小的区域,使用单一网络执行此任务具有挑战性。
方法:提出了一种名为OSX的一阶段全身网格恢复管道,无需为每个部分单独的网络。设计了一个由全局身体编码器和局部面部/手部解码器组成的组件感知变压器(CAT)。
效果:实验结果表明,OSX的效果显著,整个流程简单有效,无需任何手动后处理,自然避免了不合理的预测。同时构建了一个大规模的上半身数据集(UBody),包含在各种真实生活场景中部分可见身体的人员,以弥合基本任务和下游应用之间的差距。

Whole-body mesh recovery aims to estimate the 3D human body, face, and hands parameters from a single image. It is challenging to perform this task with a single network due to resolution issues, i.e., the face and hands are usually located in extremely small regions. Existing works usually detect hands and faces, enlarge their resolution to feed in a specific network to predict the parameter, and finally fuse the results. While this copy-paste pipeline can capture the fine-grained details of the face and hands, the connections between different parts cannot be easily recovered in late fusion, leading to implausible 3D rotation and unnatural pose. In this work, we propose a one-stage pipeline for expressive whole-body mesh recovery, named OSX, without separate networks for each part. Specifically, we design a Component Aware Transformer (CAT) composed of a global body encoder and a local face/hand decoder. The encoder predicts the body parameters and provides a high-quality feature map for the decoder, which performs a feature-level upsample-crop scheme to extract high-resolution part-specific features and adopt keypoint-guided deformable attention to estimate hand and face precisely. The whole pipeline is simple yet effective without any manual post-processing and naturally avoids implausible prediction. Comprehensive experiments demonstrate the effectiveness of OSX. Lastly, we build a large-scale Upper-Body dataset (UBody) with high-quality 2D and 3D whole-body annotations. It contains persons with partially visible bodies in diverse real-life scenarios to bridge the gap between the basic task and downstream applications.

Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers
Ren, BinandLiu, YahuiandSong, YueandBi, WeiandCucchiara, RitaandSebe, NicuandWang, Wei



研究问题:本文旨在解决视觉转换器中位置嵌入(PEs)可能导致的隐私泄露问题。
动机:尽管位置嵌入在提升视觉转换器性能上起着关键作用,但其可能暴露输入图像块的空间信息,从而引发隐私泄露问题。
方法:提出一种被遮蔽的拼图位置嵌入(MJP)方法。首先通过分块随机拼图洗牌算法对选定的图像块进行混洗,并对其对应的位置嵌入进行遮蔽。对于未被遮蔽的图像块,其位置嵌入保持原样,但其空间关系通过密集绝对定位回归器得到加强。
效果:实验结果显示,1) 位置嵌入明确编码了二维空间关系,并在梯度反转攻击下导致严重的隐私泄露问题;2) 使用简单混洗的图像块训练视觉转换器可以缓解此问题,但这会损害准确性;3) 在一定混洗比例下,提出的MJP不仅提升了大型数据集(如ImageNet-1K和ImageNet-C, -A/O)的性能和鲁棒性,而且在典型的梯度攻击下也大幅提高了隐私保护能力。

Position Embeddings (PEs), an arguably indispensable component in Vision Transformers (ViTs), have been shown to improve the performance of ViTs on many vision tasks. However, PEs have a potentially high risk of privacy leakage since the spatial information of the input patches is exposed. This caveat naturally raises a series of interesting questions about the impact of PEs on accuracy, privacy, prediction consistency, etc. To tackle these issues, we propose a Masked Jigsaw Puzzle (MJP) position embedding method. In particular, MJP first shuffles the selected patches via our block-wise random jigsaw puzzle shuffle algorithm, and their corresponding PEs are occluded. Meanwhile, for the non-occluded patches, the PEs remain the original ones but their spatial relation is strengthened via our dense absolute localization regressor. The experimental results reveal that 1) PEs explicitly encode the 2D spatial relationship and lead to severe privacy leakage problems under gradient inversion attack; 2) Training ViTs with the naively shuffled patches can alleviate the problem, but it harms the accuracy; 3) Under a certain shuffle ratio, the proposed MJP not only boosts the performance and robustness on large-scale datasets (i.e., ImageNet-1K and ImageNet-C, -A/O) but also improves the privacy preservation ability under typical gradient attacks by a large margin. The source code and trained models are available at https://github.com/yhlleo/MJP.

Robust Multiview Point Cloud Registration With Reliable Pose Graph Initialization and History Reweighting
Wang, HaipingandLiu, YuanandDong, ZhenandGuo, YulanandLiu, Yu-ShenandWang, WenpingandYang, Bisheng



研究问题:本文旨在解决点云多视角注册的问题。
动机:现有的多视角注册方法依赖于详尽的配对注册来构建密集连接的姿态图,并在姿态图上应用迭代加权最小二乘法(IRLS)来计算扫描姿态,但构建密集连接的图耗时且包含许多异常边,使得后续的IRLS难以找到正确的姿态。
方法:首先提出使用神经网络来估计扫描对之间的重叠,从而构建稀疏但可靠的姿态图。然后设计了一种新的历史重加权函数在IRLS方案中,该函数对图中的异常边具有很强的鲁棒性。
效果:与现有的多视角注册方法相比,该方法在3DMatch数据集上的注册召回率提高了11%,在ScanNet数据集上的注册误差降低了13%,同时减少了70%所需的配对注册。通过全面的消融研究证明了我们设计的有效性。源代码可在https://github.com/WHU-USI3DV/SGHR获取。

In this paper, we present a new method for the multiview registration of point cloud. Previous multiview registration methods rely on exhaustive pairwise registration to construct a densely-connected pose graph and apply Iteratively Reweighted Least Square (IRLS) on the pose graph to compute the scan poses. However, constructing a densely-connected graph is time-consuming and contains lots of outlier edges, which makes the subsequent IRLS struggle to find correct poses. To address the above problems, we first propose to use a neural network to estimate the overlap between scan pairs, which enables us to construct a sparse but reliable pose graph. Then, we design a novel history reweighting function in the IRLS scheme, which has strong robustness to outlier edges on the graph. In comparison with existing multiview registration methods, our method achieves 11% higher registration recall on the 3DMatch dataset and 13% lower registration errors on the ScanNet dataset while reducing 70% required pairwise registrations. Comprehensive ablation studies are conducted to demonstrate the effectiveness of our designs. The source code is available at https://github.com/WHU-USI3DV/SGHR.

PointCMP: Contrastive Mask Prediction for Self-Supervised Learning on Point Cloud Videos
Shen, ZhiqiangandSheng, XiaoxiaoandWang, LongguangandGuo, YulanandLiu, QiongandZhou, Xi



研究问题:如何利用无标签数据从点云视频中提取高质量的表示。
动机:由于高标注成本,点云视频的自监督学习具有吸引力。
方法:提出一种对比掩码预测(PointCMP)框架进行点云视频的自监督学习。具体来说,PointCMP采用双分支结构同时学习局部和全局时空信息,并在该双分支结构上开发基于互相似性的增强模块,在特征层面合成困难样本。通过屏蔽主要标记和擦除主要通道,生成困难样本以促进学习具有更好区分能力和泛化性能的表示。
效果:大量实验表明,PointCMP在基准数据集上实现了最先进的性能,并优于现有的全监督对应物。转移学习结果展示了所学表示在不同数据集和任务上的优越性。

Self-supervised learning can extract representations of good quality from solely unlabeled data, which is appealing for point cloud videos due to their high labelling cost. In this paper, we propose a contrastive mask prediction (PointCMP) framework for self-supervised learning on point cloud videos. Specifically, our PointCMP employs a two-branch structure to achieve simultaneous learning of both local and global spatio-temporal information. On top of this two-branch structure, a mutual similarity based augmentation module is developed to synthesize hard samples at the feature level. By masking dominant tokens and erasing principal channels, we generate hard samples to facilitate learning representations with better discrimination and generalization performance. Extensive experiments show that our PointCMP achieves the state-of-the-art performance on benchmark datasets and outperforms existing full-supervised counterparts. Transfer learning results demonstrate the superiority of the learned representations across different datasets and tasks.

Multimodal Industrial Anomaly Detection via Hybrid Fusion
Wang, YueandPeng, JinlongandZhang, JiangningandYi, RanandWang, YabiaoandWang, Chengjie



研究问题:如何有效地进行基于3D点云和RGB图像的多模态工业异常检测。
动机:现有的多模态工业异常检测方法直接连接多模态特征,导致特征之间存在强烈的干扰,影响检测性能。
方法:提出一种新颖的多模态异常检测方法Multi-3D-Memory (M3DM),采用混合融合方案:首先设计了一种无监督的特征融合方法,通过补丁对比学习来鼓励不同模态特征之间的交互;其次,使用具有多个记忆库的决策层融合,避免信息丢失,并添加新的分类器做出最终决定。进一步提出了一种点特征对齐操作,以更好地对齐点云和RGB特征。
效果:大量实验表明,我们的多模态工业异常检测模型在MVTec-3D AD数据集上的检测精度和分割精度均优于现有最先进的方法。

2D-based Industrial Anomaly Detection has been widely discussed, however, multimodal industrial anomaly detection based on 3D point clouds and RGB images still has many untouched fields. Existing multimodal industrial anomaly detection methods directly concatenate the multimodal features, which leads to a strong disturbance between features and harms the detection performance. In this paper, we propose Multi-3D-Memory (M3DM), a novel multimodal anomaly detection method with hybrid fusion scheme: firstly, we design an unsupervised feature fusion with patch-wise contrastive learning to encourage the interaction of different modal features; secondly, we use a decision layer fusion with multiple memory banks to avoid loss of information and additional novelty classifiers to make the final decision. We further propose a point feature alignment operation to better align the point cloud and RGB features. Extensive experiments show that our multimodal industrial anomaly detection model outperforms the state-of-the-art (SOTA) methods on both detection and segmentation precision on MVTec-3D AD dataset. Code at github.com/nomewang/M3DM.

BEV@DC: Bird's-Eye View Assisted Training for Depth Completion
Zhou, WendingandYan, XuandLiao, YinghongandLin, YuankaiandHuang, JinandZhao, GangmingandCui, ShuguangandLi, Zhen



研究问题:如何提高自动驾驶中图像引导的深度补全效果。
动机:现有的方法利用LiDAR的空间几何约束来增强图像引导的深度补全,但效率低下且泛化能力差。
方法:提出BEV@DC模型,通过在训练阶段充分利用具有丰富几何细节的LiDAR,并在推理阶段采用仅以图像(RGB和深度)为输入的增强深度补全方式。具体地,将几何感知的LiDAR特征投影到统一的BEV空间,与RGB特征结合进行BEV补全。通过引入新提出的点体素空间传播网络(PV-SPN),该辅助分支通过3D密集监督和特征一致性为原始图像分支提供强大指导。
效果:实验结果表明,该方法在仅有图像输入的情况下取得了显著改进,在一些基准测试中达到了最先进的水平,例如在具有挑战性的KITTI深度补全基准测试中排名第一。

Depth completion plays a crucial role in autonomous driving, in which cameras and LiDARs are two complementary sensors. Recent approaches attempt to exploit spatial geometric constraints hidden in LiDARs to enhance image-guided depth completion. However, only low efficiency and poor generalization can be achieved. In this paper, we propose BEV@DC, a more efficient and powerful multi-modal training scheme, to boost the performance of image-guided depth completion. In practice, the proposed BEV@DC model comprehensively takes advantage of LiDARs with rich geometric details in training, employing an enhanced depth completion manner in inference, which takes only images (RGB and depth) as input. Specifically, the geometric-aware LiDAR features are projected onto a unified BEV space, combining with RGB features to perform BEV completion. By equipping a newly proposed point-voxel spatial propagation network (PV-SPN), this auxiliary branch introduces strong guidance to the original image branches via 3D dense supervision and feature consistency. As a result, our baseline model demonstrates significant improvements with the sole image inputs. Concretely, it achieves state-of-the-art on several benchmarks, e.g., ranking Top-1 on the challenging KITTI depth completion benchmark.

LiDAR2Map: In Defense of LiDAR-Based Semantic Map Construction Using Online Camera Distillation
Wang, SongandLi, WentongandLiu, WenyuandLiu, XiaoluandZhu, Jianke



研究问题:如何利用激光雷达在鸟瞰图(BEV)中有效地构建语义地图。
动机:与摄像头图像相比,激光雷达提供了准确的3D观察,可以自然地将捕获的3D特征投影到BEV空间。然而,普通的基于激光雷达的BEV特征通常包含许多不确定的噪声,其中空间特征几乎没有纹理和语义线索。
方法:提出了一种有效的基于激光雷达的方法来构建语义地图。具体来说,引入了一个BEV金字塔特征解码器,用于学习用于语义地图构建的稳健多尺度BEV特征,这大大提高了基于激光雷达的方法的准确性。为了缓解激光雷达数据缺乏语义线索的问题,提出了一种在线摄像头到激光雷达的蒸馏方案,以促进从图像到点云的语义学习。
效果:在具有挑战性的nuScenes数据集上的实验结果表明,我们提出的LiDAR2Map在语义地图构建方面非常有效,比之前的基于激光雷达的方法提高了27.9% mIoU,甚至比最先进的基于摄像头的方法表现更好。

Semantic map construction under bird's-eye view (BEV) plays an essential role in autonomous driving. In contrast to camera image, LiDAR provides the accurate 3D observations to project the captured 3D features onto BEV space inherently. However, the vanilla LiDAR-based BEV feature often contains many indefinite noises, where the spatial features have little texture and semantic cues. In this paper, we propose an effective LiDAR-based method to build semantic map. Specifically, we introduce a BEV pyramid feature decoder that learns the robust multi-scale BEV features for semantic map construction, which greatly boosts the accuracy of the LiDAR-based method. To mitigate the defects caused by lacking semantic cues in LiDAR data, we present an online Camera-to-LiDAR distillation scheme to facilitate the semantic learning from image to point cloud. Our distillation scheme consists of feature-level and logit-level distillation to absorb the semantic information from camera in BEV. The experimental results on challenging nuScenes dataset demonstrate the efficacy of our proposed LiDAR2Map on semantic map construction, which significantly outperforms the previous LiDAR-based methods over 27.9% mIoU and even performs better than the state-of-the-art camera-based approaches. Source code is available at: https://github.com/songw-zju/LiDAR2Map.

PSVT: End-to-End Multi-Person 3D Pose and Shape Estimation With Progressive Video Transformers
Qiu, ZhongweiandYang, QianshengandWang, JianandFeng, HaochengandHan, JunyuandDing, ErruiandXu, ChangandFu, DongmeiandWang, Jingdong



研究问题:现有的多人体视频3D人体姿态和形状估计方法通常采用两阶段策略,首先在每帧中检测出人体实例,然后使用时间模型进行单人体姿态和形状估计。然而,空间实例之间的全局时空上下文无法被捕获。
动机:为了解决上述问题,我们提出了一种新的端到端的多人体3D姿态和形状估计框架,称为PSVT。
方法:在PSVT中,我们首先使用一个时空编码器(STE)来捕获空间对象之间的全局特征依赖关系。然后,我们使用时空姿态解码器(STPD)和形状解码器(STSD)分别捕获姿态查询和特征标记、形状查询和特征标记之间的全局依赖关系。为了处理随着时间推移的对象变化,我们采用了一种新颖的渐进解码方案,在每一帧更新姿态和形状查询。此外,我们还提出了一种新的姿态引导注意力(PGA)机制,以更好地预测形状参数。这两个组件加强了PSVT的解码器,提高了性能。
效果:我们在四个数据集上进行了广泛的实验,结果表明PSVT达到了最先进的结果。

Existing methods of multi-person video 3D human Pose and Shape Estimation (PSE) typically adopt a two-stage strategy, which first detects human instances in each frame and then performs single-person PSE with temporal model. However, the global spatio-temporal context among spatial instances can not be captured. In this paper, we propose a new end-to-end multi-person 3D Pose and Shape estimation framework with progressive Video Transformer, termed PSVT. In PSVT, a spatio-temporal encoder (STE) captures the global feature dependencies among spatial objects. Then, spatio-temporal pose decoder (STPD) and shape decoder (STSD) capture the global dependencies between pose queries and feature tokens, shape queries and feature tokens, respectively. To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used to update pose and shape queries at each frame. Besides, we propose a novel pose-guided attention (PGA) for shape decoder to better predict shape parameters. The two components strengthen the decoder of PSVT to improve performance. Extensive experiments on the four datasets show that PSVT achieves stage-of-the-art results.

VoxFormer: Sparse Voxel Transformer for Camera-Based 3D Semantic Scene Completion
Li, YimingandYu, ZhidingandChoy, ChristopherandXiao, ChaoweiandAlvarez, JoseM.andFidler, SanjaandFeng, ChenandAnandkumar, Anima



研究问题:如何使AI系统像人类一样,通过2D图像就能想象出被遮挡物体和场景的完整3D几何形状?
动机:这种能力对于识别和理解至关重要,但目前的AI系统还无法实现。
方法:提出了VoxFormer,一种基于Transformer的语义场景补全框架,该框架从深度估计的稀疏可见和占用体素查询开始,然后通过密集化阶段从稀疏体素生成密集3D体素。
效果:在SemanticKITTI上的实验表明,VoxFormer优于现有技术,在几何形状和语义方面的相对改进分别为20.0%和18.1%,并且在训练过程中减少了GPU内存的使用量至不到16GB。

Humans can easily imagine the complete 3D geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformer-based semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images. Our framework adopts a two-stage design where we start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense 3D voxels from the sparse ones. A key idea of this design is that the visual features on 2D images correspond only to the visible scene structures rather than the occluded or empty spaces. Therefore, starting with the featurization and prediction of the visible structures is more reliable. Once we obtain the set of sparse queries, we apply a masked autoencoder design to propagate the information to all the voxels by self-attention. Experiments on SemanticKITTI show that VoxFormer outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics and reduces GPU memory during training to less than 16GB. Our code is available on https://github.com/NVlabs/VoxFormer.

NeuDA: Neural Deformable Anchor for High-Fidelity Implicit Surface Reconstruction
Cai, BowenandHuang, JinchiandJia, RongfeiandLv, ChengfeiandFu, Huan



研究问题:本文旨在解决在3D空间中预测和渲染表面时,先前的方法如IDR和NeuS忽视了空间上下文,可能无法捕捉到小孔和结构等尖锐的局部拓扑结构的问题。
动机:为了缓解这个问题,我们提出了一种灵活的神经隐式表示方法,利用分层体素网格进行高保真表面重建,即神经可变形锚点(NeuDA)。
方法:NeuDA保持了分层锚点网格,其中每个顶点都存储一个3D位置(或锚点),而不是直接嵌入(或特征)。我们优化了锚点网格,使得不同的局部几何结构可以被自适应地编码。此外,我们还探讨了频率编码策略,并引入了一种简单的分层位置编码方法,以便灵活地利用高频和低频几何和外观的特性。
效果:在DTU和BlendedMVS数据集上的实验表明,NeuDA可以生成有前景的网格表面。

This paper studies implicit surface reconstruction leveraging differentiable ray casting. Previous works such as IDR and NeuS overlook the spatial context in 3D space when predicting and rendering the surface, thereby may fail to capture sharp local topologies such as small holes and structures. To mitigate the limitation, we propose a flexible neural implicit representation leveraging hierarchical voxel grids, namely Neural Deformable Anchor (NeuDA), for high-fidelity surface reconstruction. NeuDA maintains the hierarchical anchor grids where each vertex stores a 3d position (or anchor) instead of the direct embedding (or feature). We optimize the anchor grids such that different local geometry structures can be adaptively encoded. Besides, we dig into the frequency encoding strategies and introduce a simple hierarchical positional encoding method for the hierarchical anchor structure to flexibly exploited the properties of high-frequency and low-frequency geometry and appearance. Experiments on both the DTU and BlendedMVS datasets demonstrate that NeuDA can produce promising mesh surfaces.

DINER: Disorder-Invariant Implicit Neural Representation
Xie, ShaowenandZhu, HaoandLiu, ZhenandZhang, QiandZhou, YouandCao, XunandMa, Zhan



研究问题:本文旨在解决现有隐含神经表示(INR)在网络训练中频谱偏差的问题。
动机:INR的容量受到网络训练中频谱偏差的限制,影响了其解决逆问题的能力。
方法:通过在传统INR主干上增加哈希表,提出无序不变的隐含神经表示(DINER)。对于具有相同属性直方图但不同排列顺序的离散信号,哈希表可以将坐标映射到相同的分布,从而改善后续INR网络对映射信号的建模,显著减轻频谱偏差。
效果:实验表明,DINER可以广泛应用于不同的INR主干(MLP和SIREN)和各种任务(图像/视频表示、相位检索和折射率恢复),并在质量和速度上都优于最先进的算法。

Implicit neural representation (INR) characterizes the attributes of a signal as a function of corresponding coordinates which emerges as a sharp weapon for solving inverse problems. However, the capacity of INR is limited by the spectral bias in the network training. In this paper, we find that such a frequency-related problem could be largely solved by re-arranging the coordinates of the input signal, for which we propose the disorder-invariant implicit neural representation (DINER) by augmenting a hash-table to a traditional INR backbone. Given discrete signals sharing the same histogram of attributes and different arrangement orders, the hash-table could project the coordinates into the same distribution for which the mapped signal can be better modeled using the subsequent INR network, leading to significantly alleviated spectral bias. Experiments not only reveal the generalization of the DINER for different INR backbones (MLP vs. SIREN) and various tasks (image/video representation, phase retrieval, and refractive index recovery) but also show the superiority over the state-of-the-art algorithms both in quality and speed.

Deep Graph-Based Spatial Consistency for Robust Non-Rigid Point Cloud Registration
Qin, ZhengandYu, HaoandWang, ChangjianandPeng, YuxingandXu, Kai



研究问题:本文旨在解决非刚性点云配准中的异常对应关系剪枝问题。
动机:在刚性配准中,空间一致性被广泛用于区分异常和正常对应关系,但在非刚性情况下不再适用,因此对非刚性配准的异常剔除问题尚未得到充分研究。
方法:本文提出基于图的空间一致性网络(GraphSCNet)来过滤非刚性配准时的异常对应关系。该方法基于非刚性变形通常局部刚硬或局部形状保持的事实,首先设计了一种局部空间一致性度量,仅评估点云变形图中节点附近对应关系的 spatial compatibility。然后设计了一个基于注意力的非刚性对应关系嵌入模块,从局部空间一致性中学习稳健的非刚性对应关系表示。
效果:尽管方法简单,但GraphSCNet有效提高了潜在对应关系的质量,并在三个具有挑战性的基准测试上取得了最先进的性能。

We study the problem of outlier correspondence pruning for non-rigid point cloud registration. In rigid registration, spatial consistency has been a commonly used criterion to discriminate outliers from inliers. It measures the compatibility of two correspondences by the discrepancy between the respective distances in two point clouds. However, spatial consistency no longer holds in non-rigid cases and outlier rejection for non-rigid registration has not been well studied. In this work, we propose Graph-based Spatial Consistency Network (GraphSCNet) to filter outliers for non-rigid registration. Our method is based on the fact that non-rigid deformations are usually locally rigid, or local shape preserving. We first design a local spatial consistency measure over the deformation graph of the point cloud, which evaluates the spatial compatibility only between the correspondences in the vicinity of a graph node. An attention-based non-rigid correspondence embedding module is then devised to learn a robust representation of non-rigid correspondences from local spatial consistency. Despite its simplicity, GraphSCNet effectively improves the quality of the putative correspondences and attains state-of-the-art performance on three challenging benchmarks. Our code and models are available at https://github.com/qinzheng93/GraphSCNet.

Slide-Transformer: Hierarchical Vision Transformer With Local Self-Attention
Pan, XuranandYe, TianzhuandXia, ZhuofanandSong, ShijiandHuang, Gao



研究问题:现有的自注意力方法在减少计算复杂度的同时可能会影响局部特征学习,且依赖于一些手工设计。
动机:为了解决上述问题,本文提出了一种新的局部注意力模块——滑窗注意力(Slide Attention)。
方法:滑窗注意力模块首先从行的角度重新解释了基于列的Im2Col函数,并使用深度卷积作为有效的替代。在此基础上,提出了一种基于重参数化技术的变形移位模块,该模块进一步放松了固定的关键/值位置,使其能够在局部区域内适应变形的特征。
效果:实验表明,滑窗注意力模块适用于各种先进的视觉转换模型,并与各种硬件设备兼容,在综合基准测试中实现了持续的性能提升。

Self-attention mechanism has been a key factor in the recent progress of Vision Transformer (ViT), which enables adaptive feature extraction from global contexts. However, existing self-attention methods either adopt sparse global attention or window attention to reduce the computation complexity, which may compromise the local feature learning or subject to some handcrafted designs. In contrast, local attention, which restricts the receptive field of each query to its own neighboring pixels, enjoys the benefits of both convolution and self-attention, namely local inductive bias and dynamic feature selection. Nevertheless, current local attention modules either use inefficient Im2Col function or rely on specific CUDA kernels that are hard to generalize to devices without CUDA support. In this paper, we propose a novel local attention module, Slide Attention, which leverages common convolution operations to achieve high efficiency, flexibility and generalizability. Specifically, we first re-interpret the column-based Im2Col function from a new row-based perspective and use Depthwise Convolution as an efficient substitution. On this basis, we propose a deformed shifting module based on the re-parameterization technique, which further relaxes the fixed key/value positions to deformed features in the local region. In this way, our module realizes the local attention paradigm in both efficient and flexible manner. Extensive experiments show that our slide attention module is applicable to a variety of advanced Vision Transformer models and compatible with various hardware devices, and achieves consistently improved performances on comprehensive benchmarks.

Neural Intrinsic Embedding for Non-Rigid Point Cloud Matching
Jiang, PuhuaandSun, MingzeandHuang, Ruqi



研究问题:如何直接在变形形状的点云样本之间建立对应关系。
动机:由于点云作为原始的3D数据表示,缺乏底层对象的内在结构信息,这给直接建立对应关系带来了巨大挑战。
方法:提出神经内在嵌入(NIE)方法,将每个顶点嵌入到一个高维空间中,以尊重内在结构。基于NIE,进一步提出了一种弱监督学习框架用于非刚性点云配准。
效果:实验结果表明,该框架的表现与或甚至优于需要更多监督和/或更多结构几何输入的最先进的基线方法。

As a primitive 3D data representation, point clouds are prevailing in 3D sensing, yet short of intrinsic structural information of the underlying objects. Such discrepancy poses great challenges in directly establishing correspondences between point clouds sampled from deformable shapes. In light of this, we propose Neural Intrinsic Embedding (NIE) to embed each vertex into a high-dimensional space in a way that respects the intrinsic structure. Based upon NIE, we further present a weakly-supervised learning framework for non-rigid point cloud registration. Unlike the prior works, we do not require expansive and sensitive off-line basis construction (e.g., eigen-decomposition of Laplacians), nor do we require ground-truth correspondence labels for supervision. We empirically show that our framework performs on par with or even better than the state-of-the-art baselines, which generally require more supervision and/or more structural geometric input.

SHS-Net: Learning Signed Hyper Surfaces for Oriented Normal Estimation of Point Clouds
Li, QingandFeng, HuifangandShi, KanleandGao, YueandFang, YiandLiu, Yu-ShenandHan, Zhizhong



研究问题:本文旨在提出一种新的方法,称为SHS-Net,通过学习有符号超表面进行点云的定向法线估计。
动机:现有的方法通常通过两阶段流程(无向法线估计和法线定向)来估计定向法线,并且每一步都由单独的算法实现。然而,这些方法对参数设置敏感,导致在具有噪声、密度变化和复杂几何形状的点云上结果不佳。
方法:我们引入了有符号超表面(SHS),这是一种由多层感知器(MLP)层参数化的模型,用于端到端地从点云中学习估计定向法线。有符号超表面是在高维特征空间中隐式学习的,其中局部和全局信息被聚合。具体来说,我们引入了一个片编码模块和一个形状编码模块,分别将3D点云编码为局部潜在代码和全局潜在代码。然后,提出了一个注意力加权的法线预测模块作为解码器,该模块将局部和全局潜在代码作为输入,预测定向法线。
效果:实验结果表明,我们的SHS-Net在常用的基准测试上,无论是无向还是定向法线估计,都优于最先进的方法。

We propose a novel method called SHS-Net for oriented normal estimation of point clouds by learning signed hyper surfaces, which can accurately predict normals with global consistent orientation from various point clouds. Almost all existing methods estimate oriented normals through a two-stage pipeline, i.e., unoriented normal estimation and normal orientation, and each step is implemented by a separate algorithm. However, previous methods are sensitive to parameter settings, resulting in poor results from point clouds with noise, density variations and complex geometries. In this work, we introduce signed hyper surfaces (SHS), which are parameterized by multi-layer perceptron (MLP) layers, to learn to estimate oriented normals from point clouds in an end-to-end manner. The signed hyper surfaces are implicitly learned in a high-dimensional feature space where the local and global information is aggregated. Specifically, we introduce a patch encoding module and a shape encoding module to encode a 3D point cloud into a local latent code and a global latent code, respectively. Then, an attention-weighted normal prediction module is proposed as a decoder, which takes the local and global latent codes as input to predict oriented normals. Experimental results show that our SHS-Net outperforms the state-of-the-art methods in both unoriented and oriented normal estimation on the widely used benchmarks. The code, data and pretrained models are available at https://github.com/LeoQLi/SHS-Net.

Think Twice Before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving
Jia, XiaosongandWu, PenghaoandChen, LiandXie, JiangweiandHe, ConghuiandYan, JunchiandLi, Hongyang



研究问题:现有的自动驾驶方法通常采用解耦的编码器-解码器模型,其中编码器从原始传感器数据中提取隐藏特征,解码器输出自我车辆的未来轨迹或行动。这种模式下,编码器无法获取自我代理的预期行为,使得寻找安全关键区域和推断未来情况的任务全部落在解码器上。
动机:为了解决上述问题,本文提出了两个原则:充分利用编码器的容量;增加解码器的容量。具体来说,我们首先根据编码器的特征预测出粗略的未来位置和行动,然后在这个位置和行动的基础上想象未来的环境,检查如果我们按照预测的行动行驶会产生什么后果。
方法:我们还检索预测坐标周围的编码器特征,以获取关于安全关键区域的精细信息。最后,基于预测的未来和检索到的关键特征,我们通过预测其与地面实况的偏移量来细化粗略的位置和行动。
效果:我们在CARLA模拟器上进行实验,在闭环基准测试中取得了最先进的性能。广泛的消融研究表明了每个提出的模块的有效性。代码和模型可以在https://github.com/opendrivelab/ThinkTwice 获取。

End-to-end autonomous driving has made impressive progress in recent years. Existing methods usually adopt the decoupled encoder-decoder paradigm, where the encoder extracts hidden features from raw sensor data, and the decoder outputs the ego-vehicle's future trajectories or actions. Under such a paradigm, the encoder does not have access to the intended behavior of the ego agent, leaving the burden of finding out safety-critical regions from the massive receptive field and inferring about future situations to the decoder. Even worse, the decoder is usually composed of several simple multi-layer perceptrons (MLP) or GRUs while the encoder is delicately designed (e.g., a combination of heavy ResNets or Transformer). Such an imbalanced resource-task division hampers the learning process. In this work, we aim to alleviate the aforementioned problem by two principles: (1) fully utilizing the capacity of the encoder; (2) increasing the capacity of the decoder. Concretely, we first predict a coarse-grained future position and action based on the encoder features. Then, conditioned on the position and action, the future scene is imagined to check the ramification if we drive accordingly. We also retrieve the encoder features around the predicted coordinate to obtain fine-grained information about the safety-critical region. Finally, based on the predicted future and the retrieved salient feature, we refine the coarse-grained position and action by predicting its offset from ground-truth. The above refinement module could be stacked in a cascaded fashion, which extends the capacity of the decoder with spatial-temporal prior knowledge about the conditioned future. We conduct experiments on the CARLA simulator and achieve state-of-the-art performance in closed-loop benchmarks. Extensive ablation studies demonstrate the effectiveness of each proposed module. Code and models are available at https://github.com/opendrivelab/ThinkTwice.

DSVT: Dynamic Sparse Voxel Transformer With Rotated Sets
Wang, HaiyangandShi, ChenandShi, ShaoshuaiandLei, MengandWang, SenandHe, DiandSchiele, BerntandWang, Liwei



研究问题:设计一种高效且易于部署的3D骨干网络来处理稀疏点云是3D感知中的基本问题。
动机:相比于定制化的稀疏卷积,Transformers中的注意机制更适合灵活地建模长距离关系,并且更容易在现实世界的应用中进行部署。然而,由于点云的稀疏特性,将标准的Transformer应用于稀疏点是具有挑战性的。
方法:本文提出了动态稀疏体素变换器(DSVT),这是一种基于窗口的单步体素Transformer骨干网络,用于户外3D感知。为了有效地并行处理稀疏点,我们提出了动态稀疏窗口注意力,该方法根据每个窗口的稀疏性对一系列局部区域进行分区,然后以全并行的方式计算所有区域的特征。为了实现跨集连接,我们设计了一种旋转的集分区策略,该策略在连续的自我注意层之间交替使用两种分区配置。为了支持有效的降采样和更好地编码几何信息,我们还提出了一种基于注意力的稀疏点的3D池化模块,该模块无需使用任何定制的CUDA操作即可实现强大的性能和易于部署。
效果:我们的模型在广泛的3D感知任务上取得了最先进的性能。更重要的是,DSVT可以轻松地通过TensorRT进行部署,实现实时推理速度(27Hz)。

Designing an efficient yet deployment-friendly 3D backbone to handle sparse point clouds is a fundamental problem in 3D perception. Compared with the customized sparse convolution, the attention mechanism in Transformers is more appropriate for flexibly modeling long-range relationships and is easier to be deployed in real-world applications. However, due to the sparse characteristics of point clouds, it is non-trivial to apply a standard transformer on sparse points. In this paper, we present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception. In order to efficiently process sparse points in parallel, we propose Dynamic Sparse Window Attention, which partitions a series of local regions in each window according to its sparsity and then computes the features of all regions in a fully parallel manner. To allow the cross-set connection, we design a rotated set partitioning strategy that alternates between two partitioning configurations in consecutive self-attention layers. To support effective downsampling and better encode geometric information, we also propose an attention-style 3D pooling module on sparse points, which is powerful and deployment-friendly without utilizing any customized CUDA operations. Our model achieves state-of-the-art performance with a broad range of 3D perception tasks. More importantly, DSVT can be easily deployed by TensorRT with real-time inference speed (27Hz). Code will be available at https://github.com/Haiyang-W/DSVT.

Disentangling Orthogonal Planes for Indoor Panoramic Room Layout Estimation With Cross-Scale Distortion Awareness
Shen, ZhijieandZheng, ZishuoandLin, ChunyuandNie, LangandLiao, KangandZheng, ShuaiandZhao, Yao



研究问题:现有的室内布局估计方案主要关注从垂直压缩的一维序列中恢复布局,但压缩过程混淆了不同平面的语义,导致性能较差且解释性模糊。
动机:为了解决这个问题,我们提出了一种通过预先分割复杂场景中的正交(垂直和水平)平面来解耦一维表示的方法,以明确捕捉室内布局估计的几何线索。
方法:我们设计了一种软翻转融合策略来协助预分割,并提出了特征组装机制来有效地整合浅层和深层特征,同时考虑畸变分布。此外,我们还利用三元注意力重构解耦序列以弥补预分割中的潜在错误。
效果:在四个流行的基准测试上进行的实验表明,我们的方法优于现有的最先进解决方案,尤其是在3DIoU指标上。

Based on the Manhattan World assumption, most existing indoor layout estimation schemes focus on recovering layouts from vertically compressed 1D sequences. However, the compression procedure confuses the semantics of different planes, yielding inferior performance with ambiguous interpretability. To address this issue, we propose to disentangle this 1D representation by pre-segmenting orthogonal (vertical and horizontal) planes from a complex scene, explicitly capturing the geometric cues for indoor layout estimation. Considering the symmetry between the floor boundary and ceiling boundary, we also design a soft-flipping fusion strategy to assist the pre-segmentation. Besides, we present a feature assembling mechanism to effectively integrate shallow and deep features with distortion distribution awareness. To compensate for the potential errors in pre-segmentation, we further leverage triple attention to reconstruct the disentangled sequences for better performance. Experiments on four popular benchmarks demonstrate our superiority over existing SoTA solutions, especially on the 3DIoU metric. The code is available at https://github.com/zhijieshen-bjtu/DOPNet.

PEAL: Prior-Embedded Explicit Attention Learning for Low-Overlap Point Cloud Registration
Yu, JunleandRen, LuweiandZhang, YuandZhou, WenhuiandLin, LiliandDai, Guojun



研究问题:如何提高低重叠点云注册的性能。
动机:在几何空间中,全局依赖性可能模糊不清,缺乏显著性,特别是在室内低重叠场景中,与大量非重叠点的依赖关系引入了模糊性。
方法:提出了一种基于先验知识的显式注意力学习模型(PEAL),通过将先验知识纳入学习过程,将点分为两部分,一部分是位于假设重叠区域的点,另一部分是位于假设非重叠区域的点,然后PEAL显式地学习了与假设重叠点之间的单向注意力。
效果:该方法在具有挑战性的3DLoMatch基准上提高了6%以上的注册召回率,并在特征匹配召回率、内联比率和注册召回率方面在3DMatch和3DLoMatch上都取得了最先进的性能。

Learning distinctive point-wise features is critical for low-overlap point cloud registration. Recently, it has achieved huge success in incorporating Transformer into point cloud feature representation, which usually adopts a self-attention module to learn intra-point-cloud features first, then utilizes a cross-attention module to perform feature exchange between input point clouds. Self-attention is computed by capturing the global dependency in geometric space. However, this global dependency can be ambiguous and lacks distinctiveness, especially in indoor low-overlap scenarios, as which the dependence with an extensive range of non-overlapping points introduces ambiguity. To address this issue, we present PEAL, a Prior-embedded Explicit Attention Learning model. By incorporating prior knowledge into the learning process, the points are divided into two parts. One includes points lying in the putative overlapping region and the other includes points lying in the putative non-overlapping region. Then PEAL explicitly learns one-way attention with the putative overlapping points. This simplistic design attains surprising performance, significantly relieving the aforementioned feature ambiguity. Our method improves the Registration Recall by 6+% on the challenging 3DLoMatch benchmark and achieves state-of-the-art performance on Feature Matching Recall, Inlier Ratio, and Registration Recall on both 3DMatch and 3DLoMatch. Code will be made publicly available.

GeoVLN: Learning Geometry-Enhanced Visual Representation With Slot Attention for Vision-and-Language Navigation
Huo, JingyangandSun, QiangandJiang, BoyanandLin, HaitaoandFu, Yanwei



研究问题:现有的解决Room-to-Room VLN问题的方法仅利用RGB图像,没有考虑候选视图周围的局部上下文,缺乏足够的周围环境视觉线索。
动机:自然语言包含复杂的语义信息,因此其与视觉输入的相关性很难仅通过交叉注意力进行建模。
方法:我们提出了GeoVLN,它基于插槽注意力学习几何增强的视觉表示,以实现稳健的视觉和语言导航。我们将RGB图像与Omnidata预测的相应深度图和法线图相结合作为视觉输入。
效果:我们引入了一个两阶段模块,结合局部插槽注意力和CLIP模型从这种输入中产生几何增强的表示。我们使用V&L BERT学习一个融合语言和视觉信息的交流模态表示。此外,设计了一种新的多路注意力模块,鼓励不同的输入指令短语从视觉输入中提取最相关的特征。广泛的实验证明了我们新设计的模块的有效性,并展示了所提出方法的强大性能。

Most existing works solving Room-to-Room VLN problem only utilize RGB images and do not consider local context around candidate views, which lack sufficient visual cues about surrounding environment. Moreover, natural language contains complex semantic information thus its correlations with visual inputs are hard to model merely with cross attention. In this paper, we propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation. The RGB images are compensated with the corresponding depth maps and normal maps predicted by Omnidata as visual inputs. Technically, we introduce a two-stage module that combine local slot attention and CLIP model to produce geometry-enhanced representation from such input. We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations. Additionally, a novel multiway attention module is designed, encouraging different phrases of input instruction to exploit the most related features from visual input. Extensive experiments demonstrate the effectiveness of our newly designed modules and show the compelling performance of the proposed method.

Progressive Neighbor Consistency Mining for Correspondence Pruning
Liu, XinandYang, Jufeng



研究问题:本文旨在解决在特征匹配任务中,如何从初始对应关系中识别出正确的对应关系。
动机:由于错误对应关系的分布极其不规则,因此难以确保在坐标和特征空间中寻找的邻居始终一致。
方法:提出一种新颖的全局图空间来基于加权的全局图搜索一致的邻居,以明确探索对应关系之间的长程依赖性。此外,根据不同的邻居搜索空间逐步构建三种邻居嵌入,并设计一个邻居一致性块来提取邻居上下文并按顺序探索它们的交互。最终,开发了一个名为Neighbor Consistency Mining Network(NCMNet)的网络,用于准确恢复相机姿态和识别内联。
效果:实验结果表明,NCMNet在具有挑战性的户外和室内匹配场景上的性能明显优于最先进的竞争对手。

The goal of correspondence pruning is to recognize correct correspondences (inliers) from initial ones, with applications to various feature matching based tasks. Seeking neighbors in the coordinate and feature spaces is a common strategy in many previous methods. However, it is difficult to ensure that these neighbors are always consistent, since the distribution of false correspondences is extremely irregular. For addressing this problem, we propose a novel global-graph space to search for consistent neighbors based on a weighted global graph that can explicitly explore long-range dependencies among correspondences. On top of that, we progressively construct three neighbor embeddings according to different neighbor search spaces, and design a Neighbor Consistency block to extract neighbor context and explore their interactions sequentially. In the end, we develop a Neighbor Consistency Mining Network (NCMNet) for accurately recovering camera poses and identifying inliers. Experimental results indicate that our NCMNet achieves a significant performance advantage over state-of-the-art competitors on challenging outdoor and indoor matching scenes. The source code can be found at https://github.com/xinliu29/NCMNet.

From Node Interaction To Hop Interaction: New Effective and Scalable Graph Learning Paradigm
Chen, JieandLi, ZilongandZhu, YinandZhang, JunpingandPu, Jian



研究问题:现有的图神经网络(GNNs)在大规模工业应用中存在扩展性和过平滑问题。
动机:解决GNNs的扩展性问题和过平滑问题,提高节点的判别能力。
方法:提出一种新颖的跳跃交互范式,将节点间的交互目标转化为每个节点内的预处理多跳特征,设计了易于利用现有GNN实现跳跃交互的HopGNN框架,并提出了带有自监督学习目标的多任务学习策略来增强HopGNN。
效果:在12个不同领域、规模和图形平滑度的基准数据集上进行大量实验,结果显示该方法在保持高可扩展性和效率的同时,取得了优越的性能。

Existing Graph Neural Networks (GNNs) follow the message-passing mechanism that conducts information interaction among nodes iteratively. While considerable progress has been made, such node interaction paradigms still have the following limitation. First, the scalability limitation precludes the broad application of GNNs in large-scale industrial settings since the node interaction among rapidly expanding neighbors incurs high computation and memory costs. Second, the over-smoothing problem restricts the discrimination ability of nodes, i.e., node representations of different classes will converge to indistinguishable after repeated node interactions. In this work, we propose a novel hop interaction paradigm to address these limitations simultaneously. The core idea is to convert the interaction target among nodes to pre-processed multi-hop features inside each node. We design a simple yet effective HopGNN framework that can easily utilize existing GNNs to achieve hop interaction. Furthermore, we propose a multi-task learning strategy with a self-supervised learning objective to enhance HopGNN. We conduct extensive experiments on 12 benchmark datasets in a wide range of domains, scales, and smoothness of graphs. Experimental results show that our methods achieve superior performance while maintaining high scalability and efficiency. The code is at https://github.com/JC-202/HopGNN.

Understanding and Improving Features Learned in Deep Functional Maps
Attaiki, SouhaibandOvsjanikov, Maks



研究问题:深度功能映射在非刚性3D形状对应任务中是一种成功的范例,但学习并存储在这些函数中的信息的具体性质尚未完全理解。
动机:主要问题是这些特征除了在解决功能映射矩阵时的纯代数作用外,是否可以用于其他目标。
方法:本文表明,在某些温和条件下,深度功能映射方法中学习的特征可以用作点状描述符,因此可以直接比较不同形状,甚至在测试时间无需解决功能映射。
效果:基于我们的研究,我们提出了对标准深度功能映射流程的有效修改,这促进了学习到的特征的结构属性,显著提高了匹配结果。我们还证明,以前使用外在架构进行深度功能映射特征提取的失败尝试可以通过简单的架构改变来补救,这推动了我们分析所建议的理论特性。

Deep functional maps have recently emerged as a successful paradigm for non-rigid 3D shape correspondence tasks. An essential step in this pipeline consists in learning feature functions that are used as constraints to solve for a functional map inside the network. However, the precise nature of the information learned and stored in these functions is not yet well understood. Specifically, a major question is whether these features can be used for any other objective, apart from their purely algebraic role, in solving for functional map matrices. In this paper, we show that under some mild conditions, the features learned within deep functional map approaches can be used as point-wise descriptors and thus are directly comparable across different shapes, even without the necessity of solving for a functional map at test time. Furthermore, informed by our analysis, we propose effective modifications to the standard deep functional map pipeline, which promotes structural properties of learned features, significantly improving the matching results. Finally, we demonstrate that previously unsuccessful attempts at using extrinsic architectures for deep functional map feature extraction can be remedied via simple architectural changes, which promote the theoretical properties suggested by our analysis. We thus bridge the gap between intrinsic and extrinsic surface-based learning, suggesting the necessary and sufficient conditions for successful shape matching. Our code is available at https://github.com/pvnieo/clover.

High-Frequency Stereo Matching Network
Zhao, HaoliangandZhou, HuizhouandZhang, YongjunandChen, JieandYang, YitongandZhao, Yong



研究问题:在双目立体匹配领域,迭代方法如RAFT-Stereo和CREStereo取得了显著进展,但这些方法在迭代过程中丢失信息,难以生成充分利用高频信息的详细差异图。
动机:为了解决数据耦合问题并允许包含细微细节的特征在迭代中传递,我们提出了Decouple模块。同时,为了进一步捕捉高频细节,我们提出了Normalization Refinement模块。
方法:我们的方法包括Decouple模块、Normalization Refinement模块以及引入通道自注意力机制的多尺度多阶段特征提取器。
效果:我们的方法(DLNR)在Middlebury排行榜上排名第一,比第二名高出13.04%。在KITTI-2015基准测试中,我们的方法和D1-fg也达到了最先进的性能。

In the field of binocular stereo matching, remarkable progress has been made by iterative methods like RAFT-Stereo and CREStereo. However, most of these methods lose information during the iterative process, making it difficult to generate more detailed difference maps that take full advantage of high-frequency information. We propose the Decouple module to alleviate the problem of data coupling and allow features containing subtle details to transfer across the iterations which proves to alleviate the problem significantly in the ablations. To further capture high-frequency details, we propose a Normalization Refinement module that unifies the disparities as a proportion of the disparities over the width of the image, which address the problem of module failure in cross-domain scenarios. Further, with the above improvements, the ResNet-like feature extractor that has not been changed for years becomes a bottleneck. Towards this end, we proposed a multi-scale and multi-stage feature extractor that introduces the channel-wise self-attention mechanism which greatly addresses this bottleneck. Our method (DLNR) ranks 1st on the Middlebury leaderboard, significantly outperforming the next best method by 13.04%. Our method also achieves SOTA performance on the KITTI-2015 benchmark for D1-fg.

Spatial-Then-Temporal Self-Supervised Learning for Video Correspondence
Li, RuiandLiu, Dong



研究问题:本文旨在解决现有视频分析中,对空间和时间线索协同利用不足的问题。
动机:现有的视频分析方法主要集中在空间分辨特征或时间重复特征上,对于空间和时间线索的协同利用关注不够。
方法:提出一种新颖的空间-然后-时间自我监督学习方法。首先通过对比学习从无标签图像中提取空间特征,然后通过重建学习利用无标签视频中的时间线索增强这些特征。设计全局关联蒸馏损失确保学习过程中不忘记空间线索,设计局部关联蒸馏损失对抗可能破坏重建的时间不连续性。
效果:实验结果表明,该方法在一系列基于对应关系的视频分析任务上优于最先进的自我监督学习方法。消融研究验证了两步设计和蒸馏损失的有效性。

In low-level video analyses, effective representations are important to derive the correspondences between video frames. These representations have been learned in a self-supervised fashion from unlabeled images/videos, using carefully designed pretext tasks in some recent studies. However, the previous work concentrates on either spatial-discriminative features or temporal-repetitive features, with little attention to the synergy between spatial and temporal cues. To address this issue, we propose a novel spatial-then-temporal self-supervised learning method. Specifically, we firstly extract spatial features from unlabeled images via contrastive learning, and secondly enhance the features by exploiting the temporal cues in unlabeled videos via reconstructive learning. In the second step, we design a global correlation distillation loss to ensure the learning not to forget the spatial cues, and we design a local correlation distillation loss to combat the temporal discontinuity that harms the reconstruction. The proposed method outperforms the state-of-the-art self-supervised methods, as established by the experimental results on a series of correspondence-based video analysis tasks. Also, we performed ablation studies to verify the effectiveness of the two-step design as well as the distillation losses.

Super-Resolution Neural Operator
Wei, MinandZhang, Xuesong



研究问题:本文旨在提出一种深度操作学习框架——超分辨率神经操作器(SRNO),用于从低分辨率图像重建高分辨率图像。
动机:现有的超分辨率方法通常需要固定网格大小,限制了其处理任意尺度的图像的能力。
方法:SRNO将低分辨率和高分辨率图像对视为连续函数,通过嵌入低分辨率输入到更高维度的潜在表示空间中,并迭代地使用核积分机制近似隐式图像函数,最后进行降维生成目标坐标的RGB表示。
效果:SRNO通过在每一层使用高效的伽辽金型注意力实现核积分,并在多层注意力架构中实现了动态潜在基更新,从而在准确性和运行时间上优于现有的连续超分辨率方法。

We propose Super-resolution Neural Operator (SRNO), a deep operator learning framework that can resolve high-resolution (HR) images at arbitrary scales from the low-resolution (LR) counterparts. Treating the LR-HR image pairs as continuous functions approximated with different grid sizes, SRNO learns the mapping between the corresponding function spaces. From the perspective of approximation theory, SRNO first embeds the LR input into a higher-dimensional latent representation space, trying to capture sufficient basis functions, and then iteratively approximates the implicit image function with a kernel integral mechanism, followed by a final dimensionality reduction step to generate the RGB representation at the target coordinates. The key characteristics distinguishing SRNO from prior continuous SR works are: 1) the kernel integral in each layer is efficiently implemented via the Galerkin-type attention, which possesses non-local properties in the spatial domain and therefore benefits the grid-free continuum; and 2) the multilayer attention architecture allows for the dynamic latent basis update, which is crucial for SR problems to "hallucinate" high-frequency information from the LR image. Experiments show that SRNO outperforms existing continuous SR methods in terms of both accuracy and running time. Our code is at https://github.com/2y7c3/Super-Resolution-Neural-Operator.

LP-DIF: Learning Local Pattern-Specific Deep Implicit Function for 3D Objects and Scenes
Wang, MengandLiu, Yu-ShenandGao, YueandShi, KanleandFang, YiandHan, Zhizhong



研究问题:如何有效地捕捉3D形状的几何细节。
动机:现有的主流方法通过将3D形状划分为局部区域并使用共享几何相似性的单个解码器学习每个局部区域的局部潜在代码来捕获几何细节,但这种方法在处理所有区域时存在难度,并且对不同局部区域的多样性和不平衡分布的处理不佳。
方法:提出一种新的局部模式特定隐式函数(LP-DIF)方法,该方法使用多个解码器分别关注具有某种模式的局部区域群组,并通过核密度估计器为每个模式特定的解码器引入区域再权重模块以动态地在学习过程中重新加权区域,从而简化了学习精细几何细节的过程。
效果:实验证明,LP-DIF可以恢复更多的几何细节,从而提高3D重建的质量,并在性能上超过了先前的方法。

Deep Implicit Function (DIF) has gained much popularity as an efficient 3D shape representation. To capture geometry details, current mainstream methods divide 3D shapes into local regions and then learn each one with a local latent code via a decoder, where the decoder shares the geometric similarities among different local regions. Although such local methods can capture more local details, a large diversity of different local regions increases the difficulty of learning an implicit function when treating all regions equally using only a single decoder. In addition, these local regions often exhibit imbalanced distributions, where certain regions have significantly fewer observations. This leads that fine geometry details could not be preserved well. To solve this problem, we propose a novel Local Pattern-specific Implicit Function, named LP-DIF, for representing a shape with some clusters of local regions and multiple decoders, where each decoder only focuses on one cluster of local regions which share a certain pattern. Specifically, we first extract local codes for all regions, and then cluster them into multiple groups in the latent space, where similar regions sharing a common pattern fall into one group. After that, we train multiple decoders for mining local patterns of different groups, which simplifies learning of fine geometric details by reducing the diversity of local regions seen by each decoder. To further alleviate the data-imbalance problem, we introduce a region re-weighting module to each pattern-specific decoder by kernel density estimator, which dynamically re-weights the regions during learning. Our LP-DIF can restore more geometry details, and thus improve the quality of 3D reconstruction. Experiments demonstrate that our method can achieve the state-of-the-art performance over previous methods. Code is available at https://github.com/gtyxyz/lpdif.

PeakConv: Learning Peak Receptive Field for Radar Semantic Segmentation
Zhang, LiwenandZhang, XinyanandZhang, YouchengandGuo, YufeiandChen, YuanpeiandHuang, XuhuiandMa, Zhe



研究问题:如何利用现代机器学习技术进行雷达场景理解,特别是雷达语义分割。
动机:现有的卷积操作对雷达信号的解读并不特异,因此需要一种针对雷达信号特性的新方法。
方法:提出峰值卷积操作(PeakConv),将卷积的感知域定义为峰值感知域,并以此在端到端网络中学习物体特征。
效果:通过在编码器中引入PeakConv层,我们的雷达语义分割网络在多视角真实测量数据集上的表现优于其他最新方法。

The modern machine learning-based technologies have shown considerable potential in automatic radar scene understanding. Among these efforts, radar semantic segmentation (RSS) can provide more refined and detailed information including the moving objects and background clutters within the effective receptive field of the radar. Motivated by the success of convolutional networks in various visual computing tasks, these networks have also been introduced to solve RSS task. However, neither the regular convolution operation nor the modified ones are specific to interpret radar signals. The receptive fields of existing convolutions are defined by the object presentation in optical signals, but these two signals have different perception mechanisms. In classic radar signal processing, the object signature is detected according to a local peak response, i.e., CFAR detection. Inspired by this idea, we redefine the receptive field of the convolution operation as the peak receptive field (PRF) and propose the peak convolution operation (PeakConv) to learn the object signatures in an end-to-end network. By incorporating the proposed PeakConv layers into the encoders, our RSS network can achieve better segmentation results compared with other SoTA methods on a multi-view real-measured dataset collected from an FMCW radar. Our code for PeakConv is available at https://github.com/zlw9161/PKC.

Parallel Diffusion Models of Operator and Image for Blind Inverse Problems
Chung, HyungjinandKim, JeongsolandKim, SehuiandYe, JongChul



研究问题:扩散模型在已知前向算子(非盲)的情况下,已在逆问题求解中表现出了最先进的性能,但其在解决盲逆问题方面的适用性尚未得到探索。
动机:通过为前向算子构建另一种扩散先验,我们能够解决一类盲逆问题。
方法:具体来说,通过并行反向扩散并利用中间阶段的梯度进行引导,可以同时优化前向算子的参数和图像,使得两者都在并行反向扩散过程结束时被联合估计。
效果:我们在两个具有代表性的问题上展示了该方法的有效性——盲去模糊和通过湍流成像,结果显示我们的方法产生了最先进的性能,并且在我们知道函数形式的情况下,该方法可以灵活地应用于一般的盲逆问题。

Diffusion model-based inverse problem solvers have demonstrated state-of-the-art performance in cases where the forward operator is known (i.e. non-blind). However, the applicability of the method to blind inverse problems has yet to be explored. In this work, we show that we can indeed solve a family of blind inverse problems by constructing another diffusion prior for the forward operator. Specifically, parallel reverse diffusion guided by gradients from the intermediate stages enables joint optimization of both the forward operator parameters as well as the image, such that both are jointly estimated at the end of the parallel reverse diffusion procedure. We show the efficacy of our method on two representative tasks --- blind deblurring, and imaging through turbulence --- and show that our method yields state-of-the-art performance, while also being flexible to be applicable to general blind inverse problems when we know the functional forms. Code available: https://github.com/BlindDPS/blind-dps

Local-Guided Global: Paired Similarity Representation for Visual Reinforcement Learning
Choi, HyesongandLee, HunsangandSong, WonilandJeon, SangryulandSohn, KwanghoonandMin, Dongbo



研究问题:现有的视觉强化学习方法主要关注从原始像素中提取高级特征,忽视了连续堆叠的帧中的局部空间结构。
动机:本文提出了一种新的无监督学习方式,称为自我监督的成对相似性表示学习(PSRL),以有效地编码空间结构。
方法:首先,使用编码器分别生成输入帧的潜在体积,然后利用这些潜在体积捕获局部空间结构的方差,即多个帧之间的对应关系图。然后,在全局预测模块中尝试学习未来状态表示的全局语义表示,其中使用动作向量作为媒介进行预测。
效果:在Atari游戏和DeepMind控制套件的复杂任务上进行的实验结果表明,通过提出的方法学习结构化表示,可以显著提高强化学习方法的性能。

Recent vision-based reinforcement learning (RL) methods have found extracting high-level features from raw pixels with self-supervised learning to be effective in learning policies. However, these methods focus on learning global representations of images, and disregard local spatial structures present in the consecutively stacked frames. In this paper, we propose a novel approach, termed self-supervised Paired Similarity Representation Learning (PSRL) for effectively encoding spatial structures in an unsupervised manner. Given the input frames, the latent volumes are first generated individually using an encoder, and they are used to capture the variance in terms of local spatial structures, i.e., correspondence maps among multiple frames. This enables for providing plenty of fine-grained samples for training the encoder of deep RL. We further attempt to learn the global semantic representations in the global prediction module that predicts future state representations using action vector as a medium. The proposed method imposes similarity constraints on the three latent volumes; transformed query representations by estimated pixel-wise correspondence, predicted query representations from the global prediction model, and target representations of future state, guiding global prediction with locality-inherent volume. Experimental results on complex tasks in Atari Games and DeepMind Control Suite demonstrate that the RL methods are significantly boosted by the proposed self-supervised learning of structured representations.

LargeKernel3D: Scaling Up Kernels in 3D Sparse CNNs
Chen, YukangandLiu, JianhuiandZhang, XiangyuandQi, XiaojuanandJia, Jiaya



研究问题:直接在3D CNNs上应用大型卷积核时遇到了严重困难,2D上的成功模块设计在3D网络上效果惊人地差。
动机:解决这个关键挑战,提出空间分区卷积及其大核模块。
方法:避免直接使用大型卷积核的优化和效率问题,通过空间分区卷积和大核模块进行改进。
效果:提出的LargeKernel3D网络在语义分割和对象检测的3D任务上取得了显著改进,并在nuScenes激光雷达排行榜上排名第一。

Recent advance in 2D CNNs has revealed that large kernels are important. However, when directly applying large convolutional kernels in 3D CNNs, severe difficulties are met, where those successful module designs in 2D become surprisingly ineffective on 3D networks, including the popular depth-wise convolution. To address this vital challenge, we instead propose the spatial-wise partition convolution and its large-kernel module. As a result, it avoids the optimization and efficiency issues of naive 3D large kernels. Our large-kernel 3D CNN network, LargeKernel3D, yields notable improvement in 3D tasks of semantic segmentation and object detection. It achieves 73.9% mIoU on the ScanNetv2 semantic segmentation and 72.8% NDS nuScenes object detection benchmarks, ranking 1st on the nuScenes LIDAR leaderboard. The performance further boosts to 74.2% NDS with a simple multi-modal fusion. In addition, LargeKernel3D can be scaled to 17x17x17 kernel size on Waymo 3D object detection. For the first time, we show that large kernels are feasible and essential for 3D visual tasks.

Long Range Pooling for 3D Large-Scale Scene Understanding
Li, Xiang-LiandGuo, Meng-HaoandMu, Tai-JiangandMartin, RalphR.andHu, Shi-Min



研究问题:本文旨在分析并探索视觉转换器和大型卷积核设计在卷积神经网络(CNNs)中成功的关键因素。
动机:通过借鉴视觉转换器的成功以及大型卷积核设计,作者认为更大的感受野和更强的非线性操作是实现大规模三维场景理解的两个关键因素。
方法:作者提出了一种简单而有效的长距离池化(LRP)模块,使用膨胀最大池化来提供网络的自适应大感受野。基于LRP,作者还展示了一个用于三维理解的完整网络架构,LRPNet。
效果:消融研究表明,LRP模块在减少计算的同时,比大型卷积核实现了更好的结果,这归功于其非线性特性。此外,LRPNet在各种基准测试中表现优越,证明了其有效性。

Inspired by the success of recent vision transformers and large kernel design in convolutional neural networks (CNNs), in this paper, we analyze and explore essential reasons for their success. We claim two factors that are critical for 3D large-scale scene understanding: a larger receptive field and operations with greater non-linearity. The former is responsible for providing long range contexts and the latter can enhance the capacity of the network. To achieve the above properties, we propose a simple yet effective long range pooling (LRP) module using dilation max pooling, which provides a network with a large adaptive receptive field. LRP has few parameters, and can be readily added to current CNNs. Also, based on LRP, we present an entire network architecture, LRPNet, for 3D understanding. Ablation studies are presented to support our claims, and show that the LRP module achieves better results than large kernel convolution yet with reduced computation, due to its non-linearity. We also demonstrate the superiority of LRPNet on various benchmarks: LRPNet performs the best on ScanNet and surpasses other CNN-based methods on S3DIS and Matterport3D. Code will be avalible at https://github.com/li-xl/LRPNet.

TriVol: Point Cloud Rendering via Triple Volumes
Hu, TaoandXu, XiaogangandChu, RuihangandJia, Jiaya



研究问题:现有的基于学习的点云渲染方法在提取连续和判别性的3D特征时面临挑战,导致渲染的图像中出现伪影。
动机:为了解决这一问题,本文提出了一种密集而轻量级的3D表示方法TriVol,可以与NeRF结合,从点云中渲染出照片级真实的图像。
方法:TriVol由三部分组成,每个部分都从输入的点云中编码。这种表示法有两个优点:一是融合了不同尺度的各自领域,从而提取局部和非局部特征进行判别性表示;二是由于体积大大减小,因此我们的3D解码器可以高效地推断,允许我们增加3D空间的分辨率以渲染更多的点细节。
效果:通过在不同场景/物体上进行广泛的实验,并与当前的方法进行比较,证明了我们的框架的有效性。此外,我们的框架具有良好的泛化能力,可以在不进行微调的情况下渲染一类场景或物体。

Existing learning-based methods for point cloud rendering adopt various 3D representations and feature querying mechanisms to alleviate the sparsity problem of point clouds. However, artifacts still appear in the rendered images, due to the challenges in extracting continuous and discriminative 3D features from point clouds. In this paper, we present a dense while lightweight 3D representation, named TriVol, that can be combined with NeRF to render photo-realistic images from point clouds. Our TriVol consists of triple slim volumes, each of which is encoded from the input point cloud. Our representation has two advantages. First, it fuses the respective fields at different scales and thus extracts local and non-local features for discriminative representation. Second, since the volume size is greatly reduced, our 3D decoder can be efficiently inferred, allowing us to increase the resolution of the 3D space to render more point details. Extensive experiments on different benchmarks with varying kinds of scenes/objects demonstrate our framework's effectiveness compared with current approaches. Moreover, our framework has excellent generalization ability to render a category of scenes or objects without fine-tuning.

(ML)\${\textasciicircum
Liu, ZimingandGuo, SongandLu, XiaochengandGuo, JingcaiandZhang, JieweiandZeng, YueandHuo, Fushuo



研究问题:本文旨在解决多标签零样本学习(MLZSL)中的问题,即现有的方法通常在空间类别相关性上进行视觉语义映射,这可能会消耗大量的计算资源,并且无法捕捉到精细的类别特定语义。
动机:作者观察到不同的通道对于类别的敏感性通常是不同的,这种内在的通道-类别关联性为更准确和和谐的类别特征表示提供了可能。
方法:本文提出了一种轻量而高效的基于多层感知器的编码器(ML^2P-Encoder),用于提取和保留通道级的语义。我们将生成的特征图重新组织成几个组,每个组都可以独立地用(ML^2P-Encoder)进行训练。此外,我们还设计了一个全局的组间注意力模块,以建立不同类别之间的多标签特定类关系,最终形成了一种新的通道-类别关联MLZSL框架(C^3-MLZSL)。
效果:在包括NUS-WIDE和Open-Images-V4在内的大规模MLZSL基准测试中,我们的模型在性能上优于其他代表性的最先进的模型。

Recent studies usually approach multi-label zero-shot learning (MLZSL) with visual-semantic mapping on spatial-class correlation, which can be computationally costly, and worse still, fails to capture fine-grained class-specific semantics. We observe that different channels may usually have different sensitivities on classes, which can correspond to specific semantics. Such an intrinsic channel-class correlation suggests a potential alternative for the more accurate and class-harmonious feature representations. In this paper, our interest is to fully explore the power of channel-class correlation as the unique base for MLZSL. Specifically, we propose a light yet efficient Multi-Label Multi-Layer Perceptron-based Encoder, dubbed (ML)^2P-Encoder, to extract and preserve channel-wise semantics. We reorganize the generated feature maps into several groups, of which each of them can be trained independently with (ML)^2P-Encoder. On top of that, a global group-wise attention module is further designed to build the multi-label specific class relationships among different classes, which eventually fulfills a novel Channel-Class Correlation MLZSL framework (C^3-MLZSL). Extensive experiments on large-scale MLZSL benchmarks including NUS-WIDE and Open-Images-V4 demonstrate the superiority of our model against other representative state-of-the-art models.

MeMaHand: Exploiting Mesh-Mano Interaction for Single Image Two-Hand Reconstruction
Wang, CongyiandZhu, FeidaandWen, Shilei



研究问题:本文旨在解决手部重建任务,提出一种从单张RGB图像同时重建两只手的网格和估计MANO参数的方法。
动机:现有的手部重建方法通常采用参数化或非参数化的3D手模型,但两者各有优缺点。为了充分利用两种手表示的优点,本文提出了一种新的方法。
方法:本文提出了Mesh-Mano交互模块(MMIB),该模块将网格顶点位置和MANO参数作为两种查询标记。MMIB由一个图残差块和一个配备不同非对称注意力掩码的两个变压器编码器组成,以分别建模手内和手间的注意力。此外,还引入了网格对齐细化模块来进一步提高网格与图像的对齐。
效果:在InterHand2.6M基准测试中,该方法在各种手部重建任务上取得了优于现有方法的结果。

Existing methods proposed for hand reconstruction tasks usually parameterize a generic 3D hand model or predict hand mesh positions directly. The parametric representations consisting of hand shapes and rotational poses are more stable, while the non-parametric methods can predict more accurate mesh positions. In this paper, we propose to reconstruct meshes and estimate MANO parameters of two hands from a single RGB image simultaneously to utilize the merits of two kinds of hand representations. To fulfill this target, we propose novel Mesh-Mano interaction blocks (MMIBs), which take mesh vertices positions and MANO parameters as two kinds of query tokens. MMIB consists of one graph residual block to aggregate local information and two transformer encoders to model long-range dependencies. The transformer encoders are equipped with different asymmetric attention masks to model the intra-hand and inter-hand attention, respectively. Moreover, we introduce the mesh alignment refinement module to further enhance the mesh-image alignment. Extensive experiments on the InterHand2.6M benchmark demonstrate promising results over the state-of-the-art hand reconstruction methods.

Asymmetric Feature Fusion for Image Retrieval
Wu, HuiandWang, MinandZhou, WengangandLu, ZhenboandLi, Houqiang



研究问题:本文旨在解决现有非对称检索系统中存在的检索效率与非对称准确性之间的两难问题。
动机:由于轻量级查询模型的容量较低,现有的方法在检索效率和非对称准确性之间存在困境。
方法:本文提出了一种非对称特征融合(AFF)范式,通过仅在图库侧考虑不同特征的互补性来改进现有的非对称检索系统。具体来说,它首先将每个图库图像嵌入到各种特征中,例如局部特征和全局特征。然后引入动态混合器将这些特征聚合为一个紧凑的嵌入以进行有效搜索。在查询侧,只部署了一个用于特征提取的轻量级模型。查询模型和动态混合器通过共享一个动量更新的分类器进行联合训练。值得注意的是,所提出的方法在不增加查询侧任何额外开销的情况下提高了非对称检索的准确性。
效果:通过对各种地标检索数据集的大量实验,证明了我们的范式的优越性。

In asymmetric retrieval systems, models with different capacities are deployed on platforms with different computational and storage resources. Despite the great progress, existing approaches still suffer from a dilemma between retrieval efficiency and asymmetric accuracy due to the low capacity of the lightweight query model. In this work, we propose an Asymmetric Feature Fusion (AFF) paradigm, which advances existing asymmetric retrieval systems by considering the complementarity among different features just at the gallery side. Specifically, it first embeds each gallery image into various features, e.g., local features and global features. Then, a dynamic mixer is introduced to aggregate these features into a compact embedding for efficient search. On the query side, only a single lightweight model is deployed for feature extraction. The query model and dynamic mixer are jointly trained by sharing a momentum-updated classifier. Notably, the proposed paradigm boosts the accuracy of asymmetric retrieval without introducing any extra overhead to the query side. Exhaustive experiments on various landmark retrieval datasets demonstrate the superiority of our paradigm.

Context-Aware Pretraining for Efficient Blind Image Decomposition
Wang, ChaoandZheng, ZhedongandQuan, RuijieandSun, YifanandYang, Yi



研究问题:本文旨在解决盲图像分解(BID)中同时去除多种类型的退化而不预先知道噪声类型的问题。
动机:现有的方法通常需要大量的数据监督,使其在现实世界的场景中不可行。此外,传统的范式通常关注挖掘叠加图像的异常模式以分离噪声,这实际上与主要图像恢复任务相冲突。
方法:我们提出了一种高效且简化的范式,称为上下文感知预训练(CP),并设计了两个预训练任务:混合图像分离和掩蔽图像重建。这种范式减少了标注需求,并明确促进了上下文感知特征学习。我们还引入了一个上下文感知预训练网络(CPNet)。
效果:广泛的实验表明,我们的方法在各种BID任务上取得了有竞争力的性能。

In this paper, we study Blind Image Decomposition (BID), which is to uniformly remove multiple types of degradation at once without foreknowing the noise type. There remain two practical challenges: (1) Existing methods typically require massive data supervision, making them infeasible to real-world scenarios. (2) The conventional paradigm usually focuses on mining the abnormal pattern of a superimposed image to separate the noise, which de facto conflicts with the primary image restoration task. Therefore, such a pipeline compromises repairing efficiency and authenticity. In an attempt to solve the two challenges in one go, we propose an efficient and simplified paradigm, called Context-aware Pretraining (CP), with two pretext tasks: mixed image separation and masked image reconstruction. Such a paradigm reduces the annotation demands and explicitly facilitates context-aware feature learning. Assuming the restoration process follows a structure-to-texture manner, we also introduce a Context-aware Pretrained network (CPNet). In particular, CPNet contains two transformer-based parallel encoders, one information fusion module, and one multi-head prediction module. The information fusion module explicitly utilizes the mutual correlation in the spatial-channel dimension, while the multi-head prediction module facilitates texture-guided appearance flow. Moreover, a new sampling loss along with an attribute label constraint is also deployed to make use of the spatial context, leading to high-fidelity image restoration. Extensive experiments on both real and synthetic benchmarks show that our method achieves competitive performance for various BID tasks.

3D Line Mapping Revisited
Liu, ShaohuiandYu, YifanandPautrat, R\'emiandPollefeys, MarcandLarsson, Viktor



研究问题:当前基于线的重建方法远落后于基于点的重建方法。
动机:线段可以简洁地编码高层场景布局,且在城市景观和室内场景中普遍存在,但目前的线重建方法却无法有效利用这一优势。
方法:本文提出了LIMAP,一个用于从多视图图像创建3D线地图的库。通过重新审视线三角测量的退化问题,精心设计的评分和轨迹构建,以及利用线重合、平行和正交等结构先验,实现了高效稳健的3D线地图创建。
效果:实验表明,LIMAP显著优于现有的3D线地图创建方法。此外,该方法还能恢复线与点/消失点之间的3D关联图。在视觉定位和光束法平差两个示例应用中,整合线和点的方法取得了最佳效果。

In contrast to sparse keypoints, a handful of line segments can concisely encode the high-level scene layout, as they often delineate the main structural elements. In addition to offering strong geometric cues, they are also omnipresent in urban landscapes and indoor scenes. Despite their apparent advantages, current line-based reconstruction methods are far behind their point-based counterparts. In this paper we aim to close the gap by introducing LIMAP, a library for 3D line mapping that robustly and efficiently creates 3D line maps from multi-view imagery. This is achieved through revisiting the degeneracy problem of line triangulation, carefully crafted scoring and track building, and exploiting structural priors such as line coincidence, parallelism, and orthogonality. Our code integrates seamlessly with existing point-based Structure-from-Motion methods and can leverage their 3D points to further improve the line reconstruction. Furthermore, as a byproduct, the method is able to recover 3D association graphs between lines and points / vanishing points (VPs). In thorough experiments, we show that LIMAP significantly outperforms existing approaches for 3D line mapping. Our robust 3D line maps also open up new research directions. We show two example applications: visual localization and bundle adjustment, where integrating lines alongside points yields the best results. Code is available at https://github.com/cvg/limap.

Self-Supervised Pre-Training With Masked Shape Prediction for 3D Scene Understanding
Jiang, LiandYang, ZetongandShi, ShaoshuaiandGolyanik, VladislavandDai, DengxinandSchiele, Bernt



研究问题:本文旨在探索在3D场景理解中应用屏蔽信号建模的新方法。
动机:尽管屏蔽信号建模已在语言和二维图像的自监督预训练中取得了显著进展,但在3D场景理解中的应用尚未得到充分探索。
方法:本文提出了一种新的框架——屏蔽形状预测(MSP),用于在3D场景中进行屏蔽信号建模。MSP使用关键的3D语义线索,即几何形状,作为屏蔽点的预测目标。同时,提出了包含显式形状上下文和隐式深度形状特征的上下文增强形状目标,以便于在形状预测中利用上下文线索。此外,MSP的预训练架构经过精心设计,以减轻点坐标导致的屏蔽形状泄漏。
效果:在多个室内外数据集上的3D理解任务的实验表明,MSP在学习良好的特征表示以持续提升下游性能方面具有有效性。

Masked signal modeling has greatly advanced self-supervised pre-training for language and 2D images. However, it is still not fully explored in 3D scene understanding. Thus, this paper introduces Masked Shape Prediction (MSP), a new framework to conduct masked signal modeling in 3D scenes. MSP uses the essential 3D semantic cue, i.e., geometric shape, as the prediction target for masked points. The context-enhanced shape target consisting of explicit shape context and implicit deep shape feature is proposed to facilitate exploiting contextual cues in shape prediction. Meanwhile, the pre-training architecture in MSP is carefully designed to alleviate the masked shape leakage from point coordinates. Experiments on multiple 3D understanding tasks on both indoor and outdoor datasets demonstrate the effectiveness of MSP in learning good feature representations to consistently boost downstream performance.

Efficient and Explicit Modelling of Image Hierarchies for Image Restoration
Li, YaweiandFan, YuchenandXiang, XiaoyuandDemandolx, DenisandRanjan, RakeshandTimofte, RaduandVanGool, Luc



研究问题:本文旨在提出一种机制,以有效地和明确地在全局、区域和局部范围内对图像进行恢复。
动机:通过对自然图像的交叉尺度相似性和各向异性图像特征的分析,作者受到启发,提出了锚定带状自注意力机制,以实现自注意力的空间和时间复杂度与超出区域范围的建模能力之间的良好平衡。
方法:作者提出了一种新的网络架构GRL,通过锚定带状自注意力、窗口自注意力和通道注意力增强卷积,明确地在全局、区域和局部范围内对图像层次进行建模。
效果:所提出的网络被应用于7种图像恢复类型,包括真实和合成设置。对于其中几种类型,该方法达到了新的最先进的水平。代码将在https://github.com/ofsoundof/GRL-Image-Restoration.git上提供。

The aim of this paper is to propose a mechanism to efficiently and explicitly model image hierarchies in the global, regional, and local range for image restoration. To achieve that, we start by analyzing two important properties of natural images including cross-scale similarity and anisotropic image features. Inspired by that, we propose the anchored stripe self-attention which achieves a good balance between the space and time complexity of self-attention and the modelling capacity beyond the regional range. Then we propose a new network architecture dubbed GRL to explicitly model image hierarchies in the Global, Regional, and Local range via anchored stripe self-attention, window self-attention, and channel attention enhanced convolution. Finally, the proposed network is applied to 7 image restoration types, covering both real and synthetic settings. The proposed method sets the new state-of-the-art for several of those. Code will be available at https://github.com/ofsoundof/GRL-Image-Restoration.git.

Progressive Random Convolutions for Single Domain Generalization
Choi, SeokeonandDas, DebasmitandChoi, SunghaandYang, SeunghanandPark, HyunsinandYun, Sungrack



研究问题:本文旨在解决单领域泛化问题,即如何训练一个具有单一源领域的模型,使其能良好地执行任意未见过的目标领域任务。
动机:现有的基于随机卷积(RandConv)的图像增强方法虽然简单且轻量级,但其生成的图像随着内核大小的增加容易失去语义,且缺乏单个卷积操作的内在多样性。
方法:为解决这个问题,本文提出了一种渐进式随机卷积(Pro-RandConv)方法,该方法通过递归堆叠小内核大小的随机卷积层,而不是增大内核大小。这种渐进式方法不仅可以减少远离理论感受野中心的像素的影响,从而减轻语义失真的影响,而且可以通过逐渐增加风格多样性来创建更有效的虚拟领域。此外,我们还将基本的随机卷积层开发为包含形变偏移和仿射变换的随机卷积块,以支持纹理和对比度多样化,这两者也都是随机初始化的。
效果:在无需复杂生成器或对抗性学习的情况下,我们证明了我们的这种简单而有效的增强策略在单领域泛化基准测试上优于最先进的方法。

Single domain generalization aims to train a generalizable model with only one source domain to perform well on arbitrary unseen target domains. Image augmentation based on Random Convolutions (RandConv), consisting of one convolution layer randomly initialized for each mini-batch, enables the model to learn generalizable visual representations by distorting local textures despite its simple and lightweight structure. However, RandConv has structural limitations in that the generated image easily loses semantics as the kernel size increases, and lacks the inherent diversity of a single convolution operation. To solve the problem, we propose a Progressive Random Convolution (Pro-RandConv) method that recursively stacks random convolution layers with a small kernel size instead of increasing the kernel size. This progressive approach can not only mitigate semantic distortions by reducing the influence of pixels away from the center in the theoretical receptive field, but also create more effective virtual domains by gradually increasing the style diversity. In addition, we develop a basic random convolution layer into a random convolution block including deformable offsets and affine transformation to support texture and contrast diversification, both of which are also randomly initialized. Without complex generators or adversarial learning, we demonstrate that our simple yet effective augmentation strategy outperforms state-of-the-art methods on single domain generalization benchmarks.

OPE-SR: Orthogonal Position Encoding for Designing a Parameter-Free Upsampling Module in Arbitrary-Scale Image Super-Resolution
Song, GaochaoandSun, QianandZhang, LuoandSu, RanandShi, JianfengandHe, Ying



研究问题:本文旨在解决任意尺度图像超分辨率问题,通过引入正交位置编码(OPE)和OPE-Upscale模块来改进现有的隐式神经表示(INR)方法。
动机:目前的任意尺度图像超分辨率方法主要依赖隐式神经表示(INR),但其需要大量的训练参数且计算效率低下。因此,本文提出了一种不需要训练参数的OPE-Upscale模块,以提高超分辨率的效率和性能。
方法:本文提出了一种新的任意尺度图像超分辨率方法,该方法使用正交位置编码(OPE)和OPE-Upscale模块来替代现有的INR-based上采样模块。OPE-Upscale模块直接执行线性组合操作,无需任何训练参数,从而实现了连续的图像重建和任意尺度的图像重建。
效果:实验结果表明,新的方法在任意尺度图像超分辨率任务上取得了与现有方法相当的结果,同时具有更高的计算效率和更少的内存消耗。此外,我们还验证了OPE对应于一组正交基,从而证实了我们设计原则的正确性。

Arbitrary-scale image super-resolution (SR) is often tackled using the implicit neural representation (INR) approach, which relies on a position encoding scheme to improve its representation ability. In this paper, we introduce orthogonal position encoding (OPE), an extension of position encoding, and an OPE-Upscale module to replace the INR-based upsampling module for arbitrary-scale image super-resolution. Our OPE-Upscale module takes 2D coordinates and latent code as inputs, just like INR, but does not require any training parameters. This parameter-free feature allows the OPE-Upscale module to directly perform linear combination operations, resulting in continuous image reconstruction and achieving arbitrary-scale image reconstruction. As a concise SR framework, our method is computationally efficient and consumes less memory than state-of-the-art methods, as confirmed by extensive experiments and evaluations. In addition, our method achieves comparable results with state-of-the-art methods in arbitrary-scale image super-resolution. Lastly, we show that OPE corresponds to a set of orthogonal basis, validating our design principle.

Implicit Surface Contrastive Clustering for LiDAR Point Clouds
Zhang, ZaiweiandBai, MinandLi, Erran



研究问题:如何利用大规模无标签数据集进行自我监督预训练,以改善计算机视觉任务的性能。
动机:尽管这种方法在许多计算机视觉任务中取得了巨大的成功,但由于户外LiDAR点云的复杂性和范围广泛,这种技术尚未广泛应用于户外LiDAR点云感知。
方法:本文提出了一种新的自我监督预训练方法ISCC,其核心是针对LiDAR点云设计的两种新的预训练任务。第一个任务通过对比学习将场景中的局部点群排序为一组全局一致且具有语义意义的簇,从而学习语义信息。第二个任务通过隐式表面重建来推理场景各个部分的精确表面,以学习几何结构。
效果:实验结果表明,该方法在现实世界的LiDAR场景中的3D对象检测和语义分割的迁移学习性能上非常有效。我们还设计了一个无监督的语义分组任务,以展示我们的方法学习的高度语义有意义的特征。

Self-supervised pretraining on large unlabeled datasets has shown tremendous success on improving the task performance of many computer vision tasks. However, such techniques have not been widely used for outdoor LiDAR point cloud perception due to its scene complexity and wide range. This prevents impactful application from 2D pretraining frameworks. In this paper, we propose ISCC, a new self-supervised pretraining method, core of which are two pretext tasks newly designed for LiDAR point clouds. The first task focuses on learning semantic information by sorting local groups of points in the scene into a globally consistent set of semantically meaningful clusters using contrastive learning. This is augmented with a second task which reasons about precise surfaces of various parts of the scene through implicit surface reconstruction to learn geometric structures. We demonstrate their effectiveness on transfer learning performance on 3D object detection and semantic segmentation in real world LiDAR scenes. We further design an unsupervised semantic grouping task to showcase the highly semantically meaningful features learned by our approach.

Learning Compact Representations for LiDAR Completion and Generation
Xiong, YuwenandMa, Wei-ChiuandWang, JingkangandUrtasun, Raquel



研究问题:如何利用低成本的稀疏LiDAR数据生成高精度的三维世界模型?
动机:现有的密集LiDAR设备昂贵,而低光束LiDAR捕获的点云通常稀疏。
方法:提出UltraLiDAR数据驱动框架,通过将稀疏点云的表示与密集点云对齐,实现点云的稠密化,并学习一个离散码本以生成多样、真实的LiDAR点云。
效果:实验证明,使用UltraLiDAR可以显著提高感知系统的性能,且生成的点云比现有技术更真实,人类参与者在A/B测试中超过98.5%的时间更喜欢其结果。

LiDAR provides accurate geometric measurements of the 3D world. Unfortunately, dense LiDARs are very expensive and the point clouds captured by low-beam LiDAR are often sparse. To address these issues, we present UltraLiDAR, a data-driven framework for scene-level LiDAR completion, LiDAR generation, and LiDAR manipulation. The crux of UltraLiDAR is a compact, discrete representation that encodes the point cloud's geometric structure, is robust to noise, and is easy to manipulate. We show that by aligning the representation of a sparse point cloud to that of a dense point cloud, we can densify the sparse point clouds as if they were captured by a real high-density LiDAR, drastically reducing the cost. Furthermore, by learning a prior over the discrete codebook, we can generate diverse, realistic LiDAR point clouds for self-driving. We evaluate the effectiveness of UltraLiDAR on sparse-to-dense LiDAR completion and LiDAR generation. Experiments show that densifying real-world point clouds with our approach can significantly improve the performance of downstream perception systems. Compared to prior art on LiDAR generation, our approach generates much more realistic point clouds. According to A/B test, over 98.5% of the time human participants prefer our results over those of previous methods. Please refer to project page https://waabi.ai/research/ultralidar/ for more information.

Improving Graph Representation for Point Cloud Segmentation via Attentive Filtering
Zhang, NanandPan, ZhiyiandLi, ThomasH.andGao, WeiandLi, Ge



研究问题:如何结合图卷积和自注意力机制,提高点云分割的性能。
动机:虽然自注意力网络在点云分割上表现优秀,但图卷积在捕捉局部几何信息上有更强的能力且计算成本更低。
方法:提出一种混合架构设计,构建了具有注意力过滤的图卷积网络(AF-GCN)。该网络采用图卷积来聚合浅层编码器阶段的局部特征,深层阶段则使用一种名为图注意力滤波器(GAF)的自注意力类似模块来更好地建模远程邻居的长程上下文。此外,为了进一步改善点云分割的图形表示,还引入了一种用于图卷积的空间特征投影(SFP)模块以处理非结构化点云的空间变化。最后,引入了一种图共享的降采样和上采样策略,以充分利用点云处理中的图形结构。
效果:在S3DIS、ScanNetV2、Toronto-3D和ShapeNetPart等多个数据集上进行大量实验,结果显示AF-GCN取得了有竞争力的性能。

Recently, self-attention networks achieve impressive performance in point cloud segmentation due to their superiority in modeling long-range dependencies. However, compared to self-attention mechanism, we find graph convolutions show a stronger ability in capturing local geometry information with less computational cost. In this paper, we employ a hybrid architecture design to construct our Graph Convolution Network with Attentive Filtering (AF-GCN), which takes advantage of both graph convolution and self-attention mechanism. We adopt graph convolutions to aggregate local features in the shallow encoder stages, while in the deeper stages, we propose a self-attention-like module named Graph Attentive Filter (GAF) to better model long-range contexts from distant neighbors. Besides, to further improve graph representation for point cloud segmentation, we employ a Spatial Feature Projection (SFP) module for graph convolutions which helps to handle spatial variations of unstructured point clouds. Finally, a graph-shared down-sampling and up-sampling strategy is introduced to make full use of the graph structures in point cloud processing. We conduct extensive experiments on multiple datasets including S3DIS, ScanNetV2, Toronto-3D, and ShapeNetPart. Experimental results show our AF-GCN obtains competitive performance.

Activating More Pixels in Image Super-Resolution Transformer
Chen, XiangyuandWang, XintaoandZhou, JiantaoandQiao, YuandDong, Chao



研究问题:现有的Transformer模型在低层次视觉任务中表现出色,但通过分析发现其只能利用有限的输入信息空间。
动机:为了充分利用Transformer的潜力,提高重建质量,提出一种新型的混合注意力Transformer(HAT)。
方法:HAT结合了通道注意力和基于窗口的自我注意力机制,以利用全局统计信息和强大的局部拟合能力。同时,引入了重叠交叉注意力模块,以增强相邻窗口特征之间的交互。
效果:实验表明,所提出的模块有效,且通过预训练策略进一步优化模型性能。与现有技术相比,HAT的性能提高了1dB以上。

Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better reconstruction, we propose a novel Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages of being able to utilize global statistics and strong local fitting capability. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to exploit the potential of the model for further improvement. Extensive experiments show the effectiveness of the proposed modules, and we further scale up the model to demonstrate that the performance of this task can be greatly improved. Our overall method significantly outperforms the state-of-the-art methods by more than 1dB. Codes and models are available at https://github.com/XPixelGroup/HAT.

BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks
Chi, XiaoweiandLiu, JiamingandLu, MingandZhang, RongyuandWang, ZhaoqingandGuo, YandongandZhang, Shanghang



研究问题:如何有效地进行三维物体检测的鸟瞰图(BEV)特征构建。
动机:现有的方法通过将多视角相机特征聚合到展平的网格中来构建BEV特征,但这种方法未能强调不同高度的有信息量的特征。
方法:本文提出了一种名为BEV切片注意力网络(BEV-SAN)的新方法,该方法首先沿着高度维度进行采样以构建全局和局部BEV切片,然后从相机特征中聚合BEV切片的特征并使用注意力机制进行合并,最后通过变换器融合合并后的局部和全局BEV特征以生成任务头部的最终特征图。
效果:实验结果表明,与均匀采样相比,BEV-SAN能够确定更多有信息量的高度,从而有效地进行三维物体检测的鸟瞰图特征构建。

Bird's-Eye-View (BEV) 3D Object Detection is a crucial multi-view technique for autonomous driving systems. Recently, plenty of works are proposed, following a similar paradigm consisting of three essential components, i.e., camera feature extraction, BEV feature construction, and task heads. Among the three components, BEV feature construction is BEV-specific compared with 2D tasks. Existing methods aggregate the multi-view camera features to the flattened grid in order to construct the BEV feature. However, flattening the BEV space along the height dimension fails to emphasize the informative features of different heights. For example, the barrier is located at a low height while the truck is located at a high height. In this paper, we propose a novel method named BEV Slice Attention Network (BEV-SAN) for exploiting the intrinsic characteristics of different heights. Instead of flattening the BEV space, we first sample along the height dimension to build the global and local BEV slices. Then, the features of BEV slices are aggregated from the camera features and merged by the attention mechanism. Finally, we fuse the merged local and global BEV features by a transformer to generate the final feature map for task heads. The purpose of local BEV slices is to emphasize informative heights. In order to find them, we further propose a LiDAR-guided sampling strategy to leverage the statistical distribution of LiDAR to determine the heights of local slices. Compared with uniform sampling, LiDAR-guided sampling can determine more informative heights. We conduct detailed experiments to demonstrate the effectiveness of BEV-SAN. Code will be released.

NeuMap: Neural Coordinate Mapping by Auto-Transdecoder for Camera Localization
Tang, ShitaoandTang, SicongandTagliasacchi, AndreaandTan, PingandFurukawa, Yasutaka



研究问题:如何有效地进行相机定位?
动机:现有的特征匹配方法需要大量的存储空间,而压缩则会影响性能;坐标回归方法虽然可以压缩数据,但鲁棒性较差。
方法:提出了一种名为NeuMap的端到端神经映射方法,将整个场景编码为潜在代码的网格,然后使用基于Transformer的自动解码器对查询像素的3D坐标进行回归。
效果:在五个基准测试中,NeuMap显著优于其他坐标回归方法,并在保持网络权重固定的情况下,快速优化新场景的代码,同时实现与特征匹配方法相当的性能。

This paper presents an end-to-end neural mapping method for camera localization, dubbed NeuMap, encoding a whole scene into a grid of latent codes, with which a Transformer-based auto-decoder regresses 3D coordinates of query pixels. State-of-the-art feature matching methods require each scene to be stored as a 3D point cloud with per-point features, consuming several gigabytes of storage per scene. While compression is possible, performance drops significantly at high compression rates. Conversely, coordinate regression methods achieve high compression by storing scene information in a neural network but suffer from reduced robustness. NeuMap combines the advantages of both approaches by utilizing 1) learnable latent codes for efficient scene representation and 2) a scene-agnostic Transformer-based auto-decoder to infer coordinates for query pixels. This scene-agnostic network design learns robust matching priors from large-scale data and enables rapid optimization of codes for new scenes while keeping the network weights fixed. Extensive evaluations on five benchmarks show that NeuMap significantly outperforms other coordinate regression methods and achieves comparable performance to feature matching methods while requiring a much smaller scene representation size. For example, NeuMap achieves 39.1% accuracy in the Aachen night benchmark with only 6MB of data, whereas alternative methods require 100MB or several gigabytes and fail completely under high compression settings. The codes are available at https://github.com/Tangshitao/NeuMap.

AShapeFormer: Semantics-Guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers
Li, ZechuanandYu, HongshanandYang, ZhengengandChen, TongjiaandAkhtar, Naveed



研究问题:目前的3D物体检测技术主要通过聚合预测的物体中心点特征来计算候选点,但这种方法忽略了物体级别的形状信息,导致3D物体检测效果不佳。
动机:为了解决这个问题,本文提出了一种名为AShapeFormer的语义引导的物体级别形状编码模块,用于改善3D物体检测的效果。
方法:AShapeFormer是一个即插即用的模块,利用多头注意力机制来编码物体的形状信息。同时,还提出了形状令牌和物体场景位置编码,以确保形状信息被充分利用。此外,还引入了一个语义指导子模块,以更好地感知物体形状。
效果:在流行的SUN RGB-D和ScanNetV2数据集上进行的大量实验表明,使用AShapeFormer增强的模型能够显著提高检测性能,最高可提升8.1%。

3D object detection techniques commonly follow a pipeline that aggregates predicted object central point features to compute candidate points. However, these candidate points contain only positional information, largely ignoring the object-level shape information. This eventually leads to sub-optimal 3D object detection. In this work, we propose AShapeFormer, a semantics-guided object-level shape encoding module for 3D object detection. This is a plug-n-play module that leverages multi-head attention to encode object shape information. We also propose shape tokens and object-scene positional encoding to ensure that the shape information is fully exploited. Moreover, we introduce a semantic guidance sub-module to sample more foreground points and suppress the influence of background points for a better object shape perception. We demonstrate a straightforward enhancement of multiple existing methods with our AShapeFormer. Through extensive experiments on the popular SUN RGB-D and ScanNetV2 dataset, we show that our enhanced models are able to outperform the baselines by a considerable absolute margin of up to 8.1%. Code will be available at https://github.com/ZechuanLi/AShapeFormer

Adaptive Spot-Guided Transformer for Consistent Local Feature Matching
Yu, JiahuanandChang, JiahaoandHe, JianfengandZhang, TianzhuandYu, JiyangandWu, Feng



研究问题:本文旨在解决局部特征匹配中的问题,如保持局部一致性和处理大规模变化。
动机:尽管现有的无检测器方法利用Transformer架构取得了显著的性能,但很少有工作考虑保持局部一致性,并且大多数方法在处理大规模变化时表现不佳。
方法:为此,我们提出了一种自适应点引导的Transformer(ASTR)用于局部特征匹配,该模型在一个统一的粗到细的架构中联合建模局部一致性和规模变化。
效果:我们的ASTR具有几个优点。首先,我们设计了一个点引导的聚合模块,以避免在特征聚合过程中干扰无关区域。其次,我们设计了一个自适应缩放模块,根据精细阶段的计算深度信息调整网格的大小。我们在五个标准基准上进行的大量实验结果表明,我们的ASTR优于最先进的方法。

Local feature matching aims at finding correspondences between a pair of images. Although current detector-free methods leverage Transformer architecture to obtain an impressive performance, few works consider maintaining local consistency. Meanwhile, most methods struggle with large scale variations. To deal with the above issues, we propose Adaptive Spot-Guided Transformer (ASTR) for local feature matching, which jointly models the local consistency and scale variations in a unified coarse-to-fine architecture. The proposed ASTR enjoys several merits. First, we design a spot-guided aggregation module to avoid interfering with irrelevant areas during feature aggregation. Second, we design an adaptive scaling module to adjust the size of grids according to the calculated depth information at fine stage. Extensive experimental results on five standard benchmarks demonstrate that our ASTR performs favorably against state-of-the-art methods.Our code will be released on https://astr2023.github.io.

Heat Diffusion Based Multi-Scale and Geometric Structure-Aware Transformer for Mesh Segmentation
Wong, Chi-Chong



研究问题:如何将Transformer模型从自然语言处理应用到3D网格处理,特别是在三角形网格分割等3D形状分析任务中。
动机:Transformer模型的输入置换不变性使其成为3D网格处理的理想候选模型,但如何提取网格数据的多尺度信息和捕获网格数据的形状判别特征是两个主要挑战。
方法:提出了一种基于热扩散的方法来解决这些问题,设计了一种新的Transformer模型MeshFormer,该模型将热扩散方法整合到多头自注意力操作中,以自适应地捕获局部邻域到全局上下文的特征,并应用了一种新颖的基于热核签名的结构编码来嵌入网格实例的内在几何结构。
效果:在三角形网格分割等任务上的大量实验验证了MeshFormer模型的有效性,并在当前最先进的方法上取得了显著的改进。

Triangle mesh segmentation is an important task in 3D shape analysis, especially in applications such as digital humans and AR/VR. Transformer model is inherently permutation-invariant to input, which makes it a suitable candidate model for 3D mesh processing. However, two main challenges involved in adapting Transformer from natural languages to 3D mesh are yet to be solved, such as i) extracting the multi-scale information of mesh data in an adaptive manner; ii) capturing geometric structures of mesh data as the discriminative characteristics of the shape. Current point based Transformer models fail to tackle such challenges and thus provide inferior performance for discretized surface segmentation. In this work, heat diffusion based method is exploited to tackle these problems. A novel Transformer model called MeshFormer is proposed, which i) integrates Heat Diffusion method into Multi-head Self-Attention operation (HDMSA) to adaptively capture the features from local neighborhood to global contexts; ii) applies a novel Heat Kernel Signature based Structure Encoding (HKSSE) to embed the intrinsic geometric structures of mesh instances into Transformer for structure-aware processing. Extensive experiments on triangle mesh segmentation validate the effectiveness of the proposed MeshFormer model and show significant improvements over current state-of-the-art methods.

Paired-Point Lifting for Enhanced Privacy-Preserving Visual Localization
Lee, ChunghwanandKim, JaihoonandYun, ChanhyukandHong, JeHyeong



研究问题:如何通过视觉定位从已知场景的输入图像中恢复相机姿态,这是许多视觉和机器人系统的基础。
动机:虽然许多算法使用通过结构从运动(SfM)获得的稀疏3D点云进行定位,但最近的研究表明,这种方法可能会泄露场景的高保真外观,引发隐私问题。
方法:我们提出了一种名为配对点提升(PPL)的替代轻量级策略来构建3D线云。PPL将3D点分割成对,并将每对连接起来形成3D线,而不是像以前的方法那样为每个3D点绘制一条随机定向的线。
效果:实验结果表明,PPL在不牺牲定位精度的情况下有效地隐藏了场景细节,提高了对隐私攻击的保护能力,解锁了3D线云的真正潜力。

Visual localization refers to the process of recovering camera pose from input image relative to a known scene, forming a cornerstone of numerous vision and robotics systems. While many algorithms utilize sparse 3D point cloud of the scene obtained via structure-from-motion (SfM) for localization, recent studies have raised privacy concerns by successfully revealing high-fidelity appearance of the scene from such sparse 3D representation. One prominent approach for bypassing this attack was to lift 3D points to randomly oriented 3D lines thereby hiding scene geometry, but latest work have shown such random line cloud has a critical statistical flaw that can be exploited to break through protection. In this work, we present an alternative lightweight strategy called Paired-Point Lifting (PPL) for constructing 3D line clouds. Instead of drawing one randomly oriented line per 3D point, PPL splits 3D points into pairs and joins each pair to form 3D lines. This seemingly simple strategy yields 3 benefits, i) new ambiguity in feature selection, ii) increased line cloud sparsity, and iii) non-trivial distribution of 3D lines, all of which contributes to enhanced protection against privacy attacks. Extensive experimental results demonstrate the strength of PPL in concealing scene details without compromising localization accuracy, unlocking the true potential of 3D line clouds.

Depth Estimation From Camera Image and mmWave Radar Point Cloud
Singh, AkashDeepandBa, YunhaoandSarker, AnkurandZhang, HowardandKadambi, AchutaandSoatto, StefanoandSrivastava, ManiandWong, Alex



研究问题:如何从相机图像和稀疏有噪声的雷达点云中推断密集深度?
动机:毫米波雷达点云形成的原理及其带来的挑战,包括模糊的海拔高度、有噪声的深度和方位组件在投影到图像上时会产生错误的位置,而现有的工作忽视了这些在相机-雷达融合中的细微差别。
方法:设计一种网络,将每个雷达点映射到其在图像平面上可能投影到的可能表面。与现有工作不同,我们不是将原始雷达点云处理为错误的深度图,而是独立查询每个原始点,将其与图像中可能的像素关联起来,从而生成半密集的雷达深度图。为了融合雷达深度和图像,我们提出了一种门控融合方案,该方案考虑了对应关系的信心分数,以便我们选择性地结合雷达和相机嵌入来生成密集的深度图。
效果:我们在NuScenes基准测试中测试了我们的方法,结果显示,相比于最佳方法,我们的均方根误差降低了9.1%,平均绝对误差降低了10.3%。

We present a method for inferring dense depth from a camera image and a sparse noisy radar point cloud. We first describe the mechanics behind mmWave radar point cloud formation and the challenges that it poses, i.e. ambiguous elevation and noisy depth and azimuth components that yields incorrect positions when projected onto the image, and how existing works have overlooked these nuances in camera-radar fusion. Our approach is motivated by these mechanics, leading to the design of a network that maps each radar point to the possible surfaces that it may project onto in the image plane. Unlike existing works, we do not process the raw radar point cloud as an erroneous depth map, but query each raw point independently to associate it with likely pixels in the image -- yielding a semi-dense radar depth map. To fuse radar depth with an image, we propose a gated fusion scheme that accounts for the confidence scores of the correspondence so that we selectively combine radar and camera embeddings to yield a dense depth map. We test our method on the NuScenes benchmark and show a 10.3% improvement in mean absolute error and a 9.1% improvement in root-mean-square error over the best method.

Prototypical Residual Networks for Anomaly Detection and Localization
Zhang, HuiandWu, ZuxuanandWang, ZhengandChen, ZhinengandJiang, Yu-Gang



研究问题:工业制造中广泛使用异常检测和定位,但现有监督模型容易过拟合少数异常样本,且异常难以察觉和定位。
动机:为了解决这些问题,我们提出了一个名为原型残差网络(PRN)的框架,通过学习异常和正常模式之间不同尺度和大小的残差特征来准确重建异常区域的分割图。
方法:PRN主要由两部分构成:一是多尺度原型,明确表示异常到正常模式的残差特征;二是多尺寸自注意力机制,实现可变大小的异常特征学习。此外,我们还提出了多种考虑可见和不可见外观变化的异常生成策略,以扩大和多样化异常。
效果:在具有挑战性和广泛应用的MVTec AD基准测试集上进行的大量实验表明,PRN优于当前最先进的无监督和有监督方法。我们在另外三个数据集上也取得了最新的结果,证明了PRN的有效性和泛化能力。

Anomaly detection and localization are widely used in industrial manufacturing for its efficiency and effectiveness. Anomalies are rare and hard to collect and supervised models easily over-fit to these seen anomalies with a handful of abnormal samples, producing unsatisfactory performance. On the other hand, anomalies are typically subtle, hard to discern, and of various appearance, making it difficult to detect anomalies and let alone locate anomalous regions. To address these issues, we propose a framework called Prototypical Residual Network (PRN), which learns feature residuals of varying scales and sizes between anomalous and normal patterns to accurately reconstruct the segmentation maps of anomalous regions. PRN mainly consists of two parts: multi-scale prototypes that explicitly represent the residual features of anomalies to normal patterns; a multi-size self-attention mechanism that enables variable-sized anomalous feature learning. Besides, we present a variety of anomaly generation strategies that consider both seen and unseen appearance variance to enlarge and diversify anomalies. Extensive experiments on the challenging and widely used MVTec AD benchmark show that PRN outperforms current state-of-the-art unsupervised and supervised methods. We further report SOTA results on three additional datasets to demonstrate the effectiveness and generalizability of PRN.

Vector Quantization With Self-Attention for Quality-Independent Representation Learning
Yang, ZhouandDong, WeishengandLi, XinandHuang, MengluanandSun, YulinandShi, Guangming



研究问题:如何提高深度神经网络模型对低质量图像的识别鲁棒性。
动机:由于训练和测试数据之间的潜在分布偏移,深度神经网络的鲁棒性引起了广泛关注。
方法:通过引入离散向量量化(VQ)来消除识别模型中的冗余,具体做法是首先在网络中添加一个码本模块以量化深层特征,然后将它们连接起来并设计一个自我注意模块以增强表示。在训练过程中,我们强制将来自清洁和损坏图像的特征量化到相同的离散嵌入空间中,以便学习质量无关的特征表示,从而提高低质量图像的识别鲁棒性。
效果:定性和定量的实验结果表明,该方法有效地实现了这一目标,在ImageNet-C上使用ResNet50作为主干的网络达到了新的最先进的结果43.1% mCE。在其他鲁棒性基准数据集上,如ImageNet-R,该方法的准确率也提高了近2%。

Recently, the robustness of deep neural networks has drawn extensive attention due to the potential distribution shift between training and testing data (e.g., deep models trained on high-quality images are sensitive to corruption during testing). Many researchers attempt to make the model learn invariant representations from multiple corrupted data through data augmentation or image-pair-based feature distillation to improve the robustness. Inspired by sparse representation in image restoration, we opt to address this issue by learning image-quality-independent feature representation in a simple plug-and-play manner, that is, to introduce discrete vector quantization (VQ) to remove redundancy in recognition models. Specifically, we first add a codebook module to the network to quantize deep features. Then we concatenate them and design a self-attention module to enhance the representation. During training, we enforce the quantization of features from clean and corrupted images in the same discrete embedding space so that an invariant quality-independent feature representation can be learned to improve the recognition robustness of low-quality images. Qualitative and quantitative experimental results show that our method achieved this goal effectively, leading to a new state-of-the-art result of 43.1% mCE on ImageNet-C with ResNet50 as the backbone. On other robustness benchmark datasets, such as ImageNet-R, our method also has an accuracy improvement of almost 2%.

DeepSolo: Let Transformer Decoder With Explicit Points Solo for Text Spotting
Ye, MaoyuanandZhang, JingandZhao, ShanshanandLiu, JuhuaandLiu, TongliangandDu, BoandTao, Dacheng



研究问题:本文旨在解决端到端文本识别中场景文本检测和识别的整合问题,以及这两个子任务之间的关系处理。
动机:尽管基于Transformer的方法消除了启发式后处理,但它们仍然受到子任务协同作用的影响,并且训练效率较低。
方法:本文提出了DeepSolo,一种简单的DETR类似基线,让单个具有显式点的解码器同时进行文本检测和识别。具体来说,对于每个文本实例,我们将字符序列表示为有序点,并用可学习的显式点查询对其进行建模。通过单个解码器后,点查询已经编码了必要的文本语义和位置,因此可以通过非常简单的并行预测头进一步解码为文本的中心线、边界、脚本和置信度。此外,我们还引入了一种文本匹配标准来提供更准确的监督信号,从而实现更高效的训练。
效果:定量实验表明,DeepSolo优于先前最先进的方法,并实现了更好的训练效率。此外,DeepSolo也与线条注释兼容,其所需的标注成本远低于多边形。

End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. In this paper, we present DeepSolo, a simple DETR-like baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations, thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel. Besides, we also introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training. Quantitative experiments on public benchmarks demonstrate that DeepSolo outperforms previous state-of-the-art methods and achieves better training efficiency. In addition, DeepSolo is also compatible with line annotations, which require much less annotation cost than polygons. The code is available at https://github.com/ViTAE-Transformer/DeepSolo.

TINC: Tree-Structured Implicit Neural Compression
Yang, Runzhao



研究问题:如何有效地压缩和表示复杂的、具有多样性的数据。
动机:隐式神经表示(INR)虽然能够以少量参数高精度地描述目标场景,但其频谱覆盖有限,且在复杂多样的数据中去除冗余信息困难。
方法:提出一种树形结构的隐式神经压缩(TINC)方法,对局部区域进行紧凑表示,并按空间距离在层次结构中提取这些局部表示的共享特征。具体来说,使用多层感知器拟合分区的局部区域,并将这些MLPs组织成树状结构以根据空间距离共享参数。这种参数共享方案不仅确保了相邻区域之间的连续性,而且联合消除了局部和非局部冗余。
效果:实验表明,TINC提高了INR的压缩保真度,并在商业工具和其他深度学习方法上显示出了令人印象深刻的压缩能力。此外,该方法具有很高的灵活性,可以针对不同的数据和参数设置进行调整。

Implicit neural representation (INR) can describe the target scenes with high fidelity using a small number of parameters, and is emerging as a promising data compression technique. However, limited spectrum coverage is intrinsic to INR, and it is non-trivial to remove redundancy in diverse complex data effectively. Preliminary studies can only exploit either global or local correlation in the target data and thus of limited performance. In this paper, we propose a Tree-structured Implicit Neural Compression (TINC) to conduct compact representation for local regions and extract the shared features of these local representations in a hierarchical manner. Specifically, we use Multi-Layer Perceptrons (MLPs) to fit the partitioned local regions, and these MLPs are organized in tree structure to share parameters according to the spatial distance. The parameter sharing scheme not only ensures the continuity between adjacent regions, but also jointly removes the local and non-local redundancy. Extensive experiments show that TINC improves the compression fidelity of INR, and has shown impressive compression capabilities over commercial tools and other deep learning based methods. Besides, the approach is of high flexibility and can be tailored for different data and parameter settings. The source code can be found at https://github.com/RichealYoung/TINC.

NerVE: Neural Volumetric Edges for Parametric Curve Extraction From Point Cloud
Zhu, XiangyuandDu, DongandChen, WeikaiandZhao, ZhiyouandNie, YinyuandHan, Xiaoguang



研究问题:从点云中提取参数化边缘曲线是3D视觉和几何处理的基本问题。
动机:现有的方法主要依赖于关键点检测,这是一个有挑战性的过程,往往会产生噪声输出,使得后续的边缘提取容易出错。
方法:我们提出了一种新的神经体积边缘表示方法NerVE,通过体积学习框架可以很容易地学习到这种表示。NerVE可以被无缝转换为一种通用的分段线性(PWL)曲线表示,从而实现对所有类型的自由形式曲线的统一学习策略。
效果:我们在具有挑战性的ABC数据集上评估了我们的方法,结果显示基于NerVE的简单网络已经能够大大超过先前最先进的方法。

Extracting parametric edge curves from point clouds is a fundamental problem in 3D vision and geometry processing. Existing approaches mainly rely on keypoint detection, a challenging procedure that tends to generate noisy output, making the subsequent edge extraction error-prone. To address this issue, we propose to directly detect structured edges to circumvent the limitations of the previous point-wise methods. We achieve this goal by presenting NerVE, a novel neural volumetric edge representation that can be easily learned through a volumetric learning framework. NerVE can be seamlessly converted to a versatile piece-wise linear (PWL) curve representation, enabling a unified strategy for learning all types of free-form curves. Furthermore, as NerVE encodes rich structural information, we show that edge extraction based on NerVE can be reduced to a simple graph search problem. After converting NerVE to the PWL representation, parametric curves can be obtained via off-the-shelf spline fitting algorithms. We evaluate our method on the challenging ABC dataset. We show that a simple network based on NerVE can already outperform the previous state-of-the-art methods by a great margin.

Generalized Relation Modeling for Transformer Tracking
Gao, ShenyuanandZhou, ChunluanandZhang, Jun



研究问题:现有的单流跟踪器在搜索区域内的所有部分都让模板进行交互,可能导致目标和背景混淆。
动机:为了解决这个问题,我们提出了一种基于自适应令牌分割的广义关系建模方法。
方法:该方法是Transformer跟踪中基于注意力的关系建模的通用形式,通过选择适当的搜索令牌与模板令牌进行交互,继承前两种流管线的优点,实现更灵活的关系建模。引入了注意力掩蔽策略和Gumbel-Softmax技术来促进令牌分割模块的并行计算和端到端学习。
效果:大量实验表明,我们的方法优于两流和单流管线,并在六个具有挑战性的基准测试上实现了最先进的性能,同时保持了实时运行速度。

Compared with previous two-stream trackers, the recent one-stream tracking pipeline, which allows earlier interaction between the template and search region, has achieved a remarkable performance gain. However, existing one-stream trackers always let the template interact with all parts inside the search region throughout all the encoder layers. This could potentially lead to target-background confusion when the extracted feature representations are not sufficiently discriminative. To alleviate this issue, we propose a generalized relation modeling method based on adaptive token division. The proposed method is a generalized formulation of attention-based relation modeling for Transformer tracking, which inherits the merits of both previous two-stream and one-stream pipelines whilst enabling more flexible relation modeling by selecting appropriate search tokens to interact with template tokens. An attention masking strategy and the Gumbel-Softmax technique are introduced to facilitate the parallel computation and end-to-end learning of the token division module. Extensive experiments show that our method is superior to the two-stream and one-stream pipelines and achieves state-of-the-art performance on six challenging benchmarks with a real-time running speed.

SmartAssign: Learning a Smart Knowledge Assignment Strategy for Deraining and Desnowing
Wang, YinglongandMa, ChaoandLiu, Jianzhuang



研究问题:现有方法主要处理单一天气类型,但不同天气条件之间的深层联系通常被忽视。
动机:如果正确使用,这些联系可以生成互补的表示,以弥补不足的训练数据,获得积极的性能提升和更好的泛化能力。
方法:本文专注于研究密切相关的雨和雪在深层表示层面的联系。提出了一种名为SmartAssign的智能知识分配策略,以优化地将从两个任务中学到的知识分配给一个特定的任务。
效果:广泛的实验证明,提出的SmartAssign有效地探索了雨和雪之间的有效联系,明显提高了除雨和除雪的性能。

Existing methods mainly handle single weather types. However, the connections of different weather conditions at deep representation level are usually ignored. These connections, if used properly, can generate complementary representations for each other to make up insufficient training data, obtaining positive performance gains and better generalization. In this paper, we focus on the very correlated rain and snow to explore their connections at deep representation level. Because sub-optimal connections may cause negative effect, another issue is that if rain and snow are handled in a multi-task learning way, how to find an optimal connection strategy to simultaneously improve deraining and desnowing performance. To build desired connection, we propose a smart knowledge assignment strategy, called SmartAssign, to optimally assign the knowledge learned from both tasks to a specific one. In order to further enhance the accuracy of knowledge assignment, we propose a novel knowledge contrast mechanism, so that the knowledge assigned to different tasks preserves better uniqueness. The inherited inductive biases usually limit the modelling ability of CNNs, we introduce a novel transformer block to constitute the backbone of our network to effectively combine long-range context dependency and local image details. Extensive experiments on seven benchmark datasets verify that proposed SmartAssign explores effective connection between rain and snow, and improves the performances of both deraining and desnowing apparently. The implementation code will be available at https://gitee.com/mindspore/models/tree/master/research/cv/SmartAssign.

Regularize Implicit Neural Representation by Itself
Li, ZheminandWang, HongxiaandMeng, Deyu



研究问题:本文旨在提出一种名为隐式神经表示正则化器(INRR)的正则化器,以提高隐式神经表示(INR)的泛化能力。
动机:尽管INR是一种可以表示不受网格分辨率限制的信号细节的全连接网络,但其泛化能力仍有待提高,特别是在非均匀采样数据上。
方法:提出的INRR基于学习到的狄利克雷能量(DE),该能量度量矩阵行/列之间的相似性。通过将DE参数化为微小的INR,进一步整合了拉普拉斯矩阵的平滑度。
效果:通过精心设计的数值实验,论文还揭示了一系列从INRR派生的性质,包括类似于收敛轨迹和多尺度相似性的动量方法。此外,所提出的方法还可以提高其他信号表示方法的性能。

This paper proposes a regularizer called Implicit Neural Representation Regularizer (INRR) to improve the generalization ability of the Implicit Neural Representation (INR). The INR is a fully connected network that can represent signals with details not restricted by grid resolution. However, its generalization ability could be improved, especially with non-uniformly sampled data. The proposed INRR is based on learned Dirichlet Energy (DE) that measures similarities between rows/columns of the matrix. The smoothness of the Laplacian matrix is further integrated by parameterizing DE with a tiny INR. INRR improves the generalization of INR in signal representation by perfectly integrating the signal's self-similarity with the smoothness of the Laplacian matrix. Through well-designed numerical experiments, the paper also reveals a series of properties derived from INRR, including momentum methods like convergence trajectory and multi-scale similarity. Moreover, the proposed method could improve the performance of other signal representation methods.

DropKey for Vision Transformer
Li, BonanandHu, YinhanandNie, XuechengandHan, CongyingandJiang, XiangjianandGuo, TiandeandLiu, Luoqi



研究问题:本文关注并改进了视觉转换器的自我注意力层的丢弃技术,这是一个被先前的工作所忽视的重要问题。
动机:不同于文献中丢弃注意力权重的做法,我们提出在计算注意力矩阵之前提前进行丢弃操作,并将键作为丢弃单位,从而提出了一种新颖的丢弃-然后-softmax方案。
方法:我们提出了三个核心问题的解决方案:首先,在自我注意力层中丢弃什么?其次,如何安排连续层的丢弃比率?最后,是否需要像CNN那样执行结构化的丢弃操作?
效果:实验结果表明,提出的DropKey方法通过将键视为丢弃单位并使用递减的丢弃比率计划,可以有效地提高各种ViT架构和视觉任务的性能。

In this paper, we focus on analyzing and improving the dropout technique for self-attention layers of Vision Transformer, which is important while surprisingly ignored by prior works. In particular, we conduct researches on three core questions: First, what to drop in self-attention layers? Different from dropping attention weights in literature, we propose to move dropout operations forward ahead of attention matrix calculation and set the Key as the dropout unit, yielding a novel dropout-before-softmax scheme. We theoretically verify that this scheme helps keep both regularization and probability features of attention weights, alleviating the overfittings problem to specific patterns and enhancing the model to globally capture vital information; Second, how to schedule the drop ratio in consecutive layers? In contrast to exploit a constant drop ratio for all layers, we present a new decreasing schedule that gradually decreases the drop ratio along the stack of self-attention layers. We experimentally validate the proposed schedule can avoid overfittings in low-level features and missing in high-level semantics, thus improving the robustness and stableness of model training; Third, whether need to perform structured dropout operation as CNN? We attempt patch-based block-version of dropout operation and find that this useful trick for CNN is not essential for ViT. Given exploration on the above three questions, we present the novel DropKey method that regards Key as the drop unit and exploits decreasing schedule for drop ratio, improving ViTs in a general way. Comprehensive experiments demonstrate the effectiveness of DropKey for various ViT architectures, e.g. T2T, VOLO, CeiT and DeiT, as well as for various vision tasks, e.g., image classification, object detection, human-object interaction detection and human body shape recovery.

Meta Architecture for Point Cloud Analysis
Lin, HaojiaandZheng, XiawuandLi, LijiangandChao, FeiandWang, ShanshanandWang, YanandTian, YonghongandJi, Rongrong



研究问题:本文旨在解决3D点云分析领域缺乏统一解释框架的问题,以便于进行系统比较、对比和分析。
动机:目前3D点云分析网络架构多样,但缺乏统一的解释框架,使得系统性的比较、对比和分析具有挑战性,限制了该领域的健康发展。
方法:本文提出一个名为PointMeta的统一解释框架,可以适用于流行的3D点云分析方法。通过这个框架,可以进行公平的比较,快速验证任何从比较中总结出的实证观察或假设。同时,PointMeta也使我们可以跨越不同的组件进行思考,重新审视流行方法的共同信念和关键设计决策。
效果:基于前两个分析的学习,我们通过在现有方法上进行简单的调整,得出了一个基本构建模块——PointMetaBase。大量实验表明,它在效率和效果上都表现出强大的性能,并在具有挑战性的基准测试上超过了先前最先进的方法。特别是在S3DIS数据集上,PointMetaBase仅使用2%/11%/13%的计算成本,就比先前最先进的方法高出0.7%/1.4%/2.1% mIoU。

Recent advances in 3D point cloud analysis bring a diverse set of network architectures to the field. However, the lack of a unified framework to interpret those networks makes any systematic comparison, contrast, or analysis challenging, and practically limits healthy development of the field. In this paper, we take the initiative to explore and propose a unified framework called PointMeta, to which the popular 3D point cloud analysis approaches could fit. This brings three benefits. First, it allows us to compare different approaches in a fair manner, and use quick experiments to verify any empirical observations or assumptions summarized from the comparison. Second, the big picture brought by PointMeta enables us to think across different components, and revisit common beliefs and key design decisions made by the popular approaches. Third, based on the learnings from the previous two analyses, by doing simple tweaks on the existing approaches, we are able to derive a basic building block, termed PointMetaBase. It shows very strong performance in efficiency and effectiveness through extensive experiments on challenging benchmarks, and thus verifies the necessity and benefits of high-level interpretation, contrast, and comparison like PointMeta. In particular, PointMetaBase surpasses the previous state-of-the-art method by 0.7%/1.4/%2.1% mIoU with only 2%/11%/13% of the computation cost on the S3DIS datasets. Codes are available in the supplementary materials.

FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer
Liu, ZhijianandYang, XinyuandTang, HaotianandYang, ShangandHan, Song



研究问题:如何提高3D点云转换器的处理效率,以适应资源有限、延迟敏感的应用。
动机:现有的3D点云转换器在准确性上已经达到顶级水平,但其延迟性却比稀疏卷积模型慢3倍,这阻碍了其在自动驾驶等资源有限、延迟敏感的应用中的使用。
方法:本文提出了FlatFormer,通过将空间邻近性转化为更好的计算规则性来缩小这种延迟差距。首先,我们通过基于窗口的排序对点云进行展平,并将点分为大小相等的组,而不是形状相等的窗口。然后,我们在组内应用自我注意力以提取局部特征,交替改变排序轴以从不同方向收集特征,并移动窗口以在不同组之间交换特征。
效果:FlatFormer在Waymo开放数据集上实现了最先进的精度,比基于变压器的SST快4.6倍,比稀疏卷积的CenterPoint快1.4倍。这是第一个在边缘GPU上实现实时性能的点云转换器,其速度比稀疏卷积方法更快,同时在大尺度基准测试中实现了同等甚至更高的精度。

Transformer, as an alternative to CNN, has been proven effective in many modalities (e.g., texts and images). For 3D point cloud transformers, existing efforts focus primarily on pushing their accuracy to the state-of-the-art level. However, their latency lags behind sparse convolution-based models (3x slower), hindering their usage in resource-constrained, latency-sensitive applications (such as autonomous driving). This inefficiency comes from point clouds' sparse and irregular nature, whereas transformers are designed for dense, regular workloads. This paper presents FlatFormer to close this latency gap by trading spatial proximity for better computational regularity. We first flatten the point cloud with window-based sorting and partition points into groups of equal sizes rather than windows of equal shapes. This effectively avoids expensive structuring and padding overheads. We then apply self-attention within groups to extract local features, alternate sorting axis to gather features from different directions, and shift windows to exchange features across groups. FlatFormer delivers state-of-the-art accuracy on Waymo Open Dataset with 4.6x speedup over (transformer-based) SST and 1.4x speedup over (sparse convolutional) CenterPoint. This is the first point cloud transformer that achieves real-time performance on edge GPUs and is faster than sparse convolutional methods while achieving on-par or even superior accuracy on large-scale benchmarks.

Dynamic Graph Learning With Content-Guided Spatial-Frequency Relation Reasoning for Deepfake Detection
Wang, YuanandYu, KunandChen, ChenandHu, XiyuanandPeng, Silong



研究问题:随着人脸合成技术的出现,如何开发强大的人脸伪造检测方法成为了一个突出的问题。
动机:由于安全考虑,现有的一些方法试图结合辅助的频率感知信息和CNN主干网络来发现伪造的线索。然而,由于与图像内容的信息交互不足,提取的频率特征在空间上不相关,难以在越来越真实的伪造类型上进行泛化。
方法:为了解决这个问题,我们提出了一种空间-频率动态图方法,通过动态图学习在空间和频率域中利用关系感知的特征。为此,我们引入了三个精心设计的组件:1)内容引导的自适应频率提取模块,用于挖掘内容自适应的伪造频率线索;2)多领域注意力图学习模块,利用多尺度注意力图丰富空间-频率上下文特征;3)动态图空间-频率特征融合网络,探索空间和频率特征的高阶关系。
效果:我们在几个基准数据集上进行了广泛的实验,结果表明我们提出的方法持续地大幅度超过了最先进的方法。

With the springing up of face synthesis techniques, it is prominent in need to develop powerful face forgery detection methods due to security concerns. Some existing methods attempt to employ auxiliary frequency-aware information combined with CNN backbones to discover the forged clues. Due to the inadequate information interaction with image content, the extracted frequency features are thus spatially irrelavant, struggling to generalize well on increasingly realistic counterfeit types. To address this issue, we propose a Spatial-Frequency Dynamic Graph method to exploit the relation-aware features in spatial and frequency domains via dynamic graph learning. To this end, we introduce three well-designed components: 1) Content-guided Adaptive Frequency Extraction module to mine the content-adaptive forged frequency clues. 2) Multiple Domains Attention Map Learning module to enrich the spatial-frequency contextual features with multiscale attention maps. 3) Dynamic Graph Spatial-Frequency Feature Fusion Network to explore the high-order relation of spatial and frequency features. Extensive experiments on several benchmark show that our proposed method sustainedly exceeds the state-of-the-arts by a considerable margin.

Learning Anchor Transformations for 3D Garment Animation
Zhao, FangandLi, ZekunandHuang, ShaoliandWeng, JunwuandZhou, TianfeiandXie, Guo-SenandWang, JueandShan, Ying



研究问题:本文旨在提出一种基于锚点的变形模型,用于从身体运动序列预测3D服装动画。
动机:现有的3D服装动画预测方法在处理宽松服装时效果不佳,因此需要一种新的方法来提高预测的准确性和稳定性。
方法:本文提出了一种名为AnchorDEF的基于锚点的变形模型,该模型通过刚性变换和额外的非线性位移来变形服装网格模板。在网格表面引入一组锚点来指导刚性变换矩阵的学习。一旦找到锚点变换,就可以在正则空间中回归服装模板的每个顶点的非线性位移,从而降低变形空间学习的难度。
效果:定性和定量实验表明,AnchorDEF在不同类型的服装上实现了最先进的性能,特别是在预测宽松服装的运动变形方面表现出色。

This paper proposes an anchor-based deformation model, namely AnchorDEF, to predict 3D garment animation from a body motion sequence. It deforms a garment mesh template by a mixture of rigid transformations with extra nonlinear displacements. A set of anchors around the mesh surface is introduced to guide the learning of rigid transformation matrices. Once the anchor transformations are found, per-vertex nonlinear displacements of the garment template can be regressed in a canonical space, which reduces the complexity of deformation space learning. By explicitly constraining the transformed anchors to satisfy the consistencies of position, normal and direction, the physical meaning of learned anchor transformations in space is guaranteed for better generalization. Furthermore, an adaptive anchor updating is proposed to optimize the anchor position by being aware of local mesh topology for learning representative anchor transformations. Qualitative and quantitative experiments on different types of garments demonstrate that AnchorDEF achieves the state-of-the-art performance on 3D garment deformation prediction in motion, especially for loose-fitting garments.

Tree Instance Segmentation With Temporal Contour Graph
Firoze, AdnanandWingren, CameronandYeh, RaymondA.andBenes, BedrichandAliaga, Daniel



研究问题:如何对密集的自相似树进行实例分割和计数。
动机:现有的方法无法有效处理紧密排列的树木,需要一种新方法来提高分割和计数的准确性。
方法:利用顶部视角RGB图像序列,首先对图像序列进行初始过度分割,并将结构特征聚合到带有时间信息的轮廓图中。然后,使用图卷积网络及其固有的局部消息传递能力,将相邻的树冠补丁合并为最终的树冠集。
效果:该方法在所有先前的方法中表现优越,即使在树木紧密排列的情况下,也能实现高精度的实例分割和计数。同时,提供了适用于后续基准测试和评估的不同海拔和叶条件的各种森林图像序列数据集。

We present a novel approach to perform instance segmentation, and counting, for densely packed self-similar trees using a top-view RGB image sequence. We propose a solution that leverages pixel content, shape, and self-occlusion. First, we perform an initial over-segmentation of the image sequence and aggregate structural characteristics into a contour graph with temporal information incorporated. Second, using a graph convolutional network and its inherent local messaging passing abilities, we merge adjacent tree crown patches into a final set of tree crowns. Per various studies and comparisons, our method is superior to all prior methods and results in high-accuracy instance segmentation and counting, despite the trees being tightly packed. Finally, we provide various forest image sequence datasets suitable for subsequent benchmarking and evaluation captured at different altitudes and leaf conditions.

Grad-PU: Arbitrary-Scale Point Cloud Upsampling via Gradient Descent With Learned Distance Functions
He, YunandTang, DanhangandZhang, YindaandXue, XiangyangandFu, Yanwei



研究问题:现有的点云上采样方法存在固定上采样率和预测3D坐标困难导致的异常值或收缩伪影两个关键问题。
动机:为了解决这些问题,我们提出了一种新的精确点云上采样框架,该框架支持任意的上采样率。
方法:我们的方法首先根据给定的上采样率对低分辨率点云进行插值,然后通过一个训练模型来估计当前点云与高分辨率目标之间的差异,从而指导迭代优化过程来细化插值点的位置。
效果:我们在基准测试和下游任务上的大量定量和定性结果表明,我们的方法实现了最先进的准确性和效率。

Most existing point cloud upsampling methods have roughly three steps: feature extraction, feature expansion and 3D coordinate prediction. However, they usually suffer from two critical issues: (1) fixed upsampling rate after one-time training, since the feature expansion unit is customized for each upsampling rate; (2) outliers or shrinkage artifact caused by the difficulty of precisely predicting 3D coordinates or residuals of upsampled points. To adress them, we propose a new framework for accurate point cloud upsampling that supports arbitrary upsampling rates. Our method first interpolates the low-res point cloud according to a given upsampling rate. And then refine the positions of the interpolated points with an iterative optimization process, guided by a trained model estimating the difference between the current point cloud and the high-res target. Extensive quantitative and qualitative results on benchmarks and downstream tasks demonstrate that our method achieves the state-of-the-art accuracy and efficiency.

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions
Wang, WenhaiandDai, JifengandChen, ZheandHuang, ZhenhangandLi, ZhiqiandZhu, XizhouandHu, XiaoweiandLu, TongandLu, LeweiandLi, HongshengandWang, XiaogangandQiao, Yu



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs.

MED-VT: Multiscale Encoder-Decoder Video Transformer With Application To Object Segmentation
Karim, RezaulandZhao, HeandWildes, RichardP.andSiam, Mennatullah



研究问题:本文旨在探索一种统一的多尺度编码器-解码器转换器,专注于视频中的密集预测任务。
动机:目前的多尺度处理仅限于编码器或解码器,而本研究提出了一种统一的多尺度编码器-解码器转换器,能够同时提取空间-时间特征和进行高级别的语义检测。
方法:通过在编码器和解码器中都使用多尺度表示,实现了隐式提取空间-时间特征以及编码和解码的时序一致性。此外,还提出了一种转导学习方案,通过多对多的标签传播来提供时序一致的预测。
效果:在自动视频对象分割和演员/动作分割等任务上,该模型在多个基准测试中优于最先进的方法,且无需使用光流。

Multiscale video transformers have been explored in a wide variety of vision tasks. To date, however, the multiscale processing has been confined to the encoder or decoder alone. We present a unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in videos. Multiscale representation at both encoder and decoder yields key benefits of implicit extraction of spatiotemporal features (i.e. without reliance on input optical flow) as well as temporal consistency at encoding and coarse-to-fine detection for high-level (e.g. object) semantics to guide precise localization at decoding. Moreover, we propose a transductive learning scheme through many-to-many label propagation to provide temporally consistent predictions.We showcase our Multiscale Encoder-Decoder Video Transformer (MED-VT) on Automatic Video Object Segmentation (AVOS) and actor/action segmentation, where we outperform state-of-the-art approaches on multiple benchmarks using only raw images, without using optical flow.

Spatially Adaptive Self-Supervised Learning for Real-World Image Denoising
Li, JunyiandZhang, ZhiluandLiu, XiaoyuandFeng, ChaoyuandWang, XiaotaoandLei, LeiandZuo, Wangmeng



研究问题:现有的图像去噪方法主要针对空间独立的噪声,对于真实世界中具有空间相关性噪声的sRGB图像效果不佳。
动机:为了解决这一问题,本文提出了一种新的视角,即寻求对真实世界的sRGB图像进行空间自适应监督的去噪方法。
方法:具体来说,我们考虑了噪声图像中平坦区域和纹理区域各自的特点,并分别对它们进行监督。对于平坦区域,我们可以安全地从非相邻像素中获取监督,以排除受噪声相关像素影响的可能。同时,我们将盲点网络扩展到盲邻域网络(BNN),为平坦区域提供监督。对于纹理区域,其监督必须与相邻像素的内容紧密相关。因此,我们提出了一种局部感知网络(LAN)来满足这一需求,而LAN本身则由BNN的输出有选择地进行监督。通过结合这两种监督方式,我们可以训练出性能良好的去噪网络(如U-Net)。
效果:大量实验表明,我们的方法在真实世界的sRGB照片上优于最先进的自监督图像去噪方法。

Significant progress has been made in self-supervised image denoising (SSID) in the recent few years. However, most methods focus on dealing with spatially independent noise, and they have little practicality on real-world sRGB images with spatially correlated noise. Although pixel-shuffle downsampling has been suggested for breaking the noise correlation, it breaks the original information of images, which limits the denoising performance. In this paper, we propose a novel perspective to solve this problem, i.e., seeking for spatially adaptive supervision for real-world sRGB image denoising. Specifically, we take into account the respective characteristics of flat and textured regions in noisy images, and construct supervisions for them separately. For flat areas, the supervision can be safely derived from non-adjacent pixels, which are much far from the current pixel for excluding the influence of the noise-correlated ones. And we extend the blind-spot network to a blind-neighborhood network (BNN) for providing supervision on flat areas. For textured regions, the supervision has to be closely related to the content of adjacent pixels. And we present a locally aware network (LAN) to meet the requirement, while LAN itself is selectively supervised with the output of BNN. Combining these two supervisions, a denoising network (e.g., U-Net) can be well-trained. Extensive experiments show that our method performs favorably against state-of-the-art SSID methods on real-world sRGB photographs. The code is available at https://github.com/nagejacob/SpatiallyAdaptiveSSID.

Binarizing Sparse Convolutional Networks for Efficient Point Cloud Analysis
Xu, XiuweiandWang, ZiweiandZhou, JieandLu, Jiwen



研究问题:如何有效地进行点云分析。
动机:现有的稀疏卷积操作在进行量化时会产生较大的误差,导致性能下降。
方法:提出一种名为BSC-Net的二值稀疏卷积网络,通过寻找激活稀疏卷积的最佳位置来缓解量化误差,无需增加计算复杂度。
效果:在ScanNet和NYU Depth v2数据集上的实验结果表明,BSC-Net在高效的点云分析上取得了显著的改进,且优于最先进的网络二值化方法。

In this paper, we propose binary sparse convolutional networks called BSC-Net for efficient point cloud analysis. We empirically observe that sparse convolution operation causes larger quantization errors than standard convolution. However, conventional network quantization methods directly binarize the weights and activations in sparse convolution, resulting in performance drop due to the significant quantization loss. On the contrary, we search the optimal subset of convolution operation that activates the sparse convolution at various locations for quantization error alleviation, and the performance gap between real-valued and binary sparse convolutional networks is closed without complexity overhead. Specifically, we first present the shifted sparse convolution that fuses the information in the receptive field for the active sites that match the pre-defined positions. Then we employ the differentiable search strategies to discover the optimal opsitions for active site matching in the shifted sparse convolution, and the quantization errors are significantly alleviated for efficient point cloud analysis. For fair evaluation of the proposed method, we empirically select the recently advances that are beneficial for sparse convolution network binarization to construct a strong baseline. The experimental results on ScanNet and NYU Depth v2 show that our BSC-Net achieves significant improvement upon our srtong baseline and outperforms the state-of-the-art network binarization methods by a remarkable margin without additional computation overhead for binarizing sparse convolutional networks.

Quantum-Inspired Spectral-Spatial Pyramid Network for Hyperspectral Image Classification
Zhang, JieandZhang, YongshanandZhou, Yicong



研究问题:本文旨在利用量子理论设计一种新的深度学习模型,用于高光谱图像(HSI)的特征提取和分类。
动机:现有的深度学习模型在处理高光谱图像时,通常采用传统的学习模式。而量子计算机作为新兴的机器,虽然在嘈杂的中型量子(NISQ)时代中受到限制,但其量子理论为设计深度学习模型提供了新的范式。
方法:受量子电路(QC)模型的启发,我们提出了一种量子启发的光谱-空间网络(QSSN)用于HSI特征提取。该网络由一个相位预测模块(PPM)和一个测量类似的融合模块(MFM)组成,灵感来自量子理论,用于动态融合光谱和空间信息。具体来说,QSSN使用量子表示来表示一个HSI立方体,并使用MFM提取联合光谱-空间特征。
效果:通过将QSSN作为构建块,我们提出了一种端到端的量子启发的光谱-空间金字塔网络(QSSPN)用于HSI特征提取和分类。在该金字塔框架中,QSSPN通过级联QSSN块逐步学习特征表示,并使用softmax分类器进行分类。这是首次尝试在HSI处理模型设计中引入量子理论。我们在三个HSI数据集上进行了大量实验,以验证所提出的QSSPN框架优于最先进的方法。

Hyperspectral image (HSI) classification aims at assigning a unique label for every pixel to identify categories of different land covers. Existing deep learning models for HSIs are usually performed in a traditional learning paradigm. Being emerging machines, quantum computers are limited in the noisy intermediate-scale quantum (NISQ) era. The quantum theory offers a new paradigm for designing deep learning models. Motivated by the quantum circuit (QC) model, we propose a quantum-inspired spectral-spatial network (QSSN) for HSI feature extraction. The proposed QSSN consists of a phase-prediction module (PPM) and a measurement-like fusion module (MFM) inspired from quantum theory to dynamically fuse spectral and spatial information. Specifically, QSSN uses a quantum representation to represent an HSI cuboid and extracts joint spectral-spatial features using MFM. An HSI cuboid and its phases predicted by PPM are used in the quantum representation. Using QSSN as the building block, we propose an end-to-end quantum-inspired spectral-spatial pyramid network (QSSPN) for HSI feature extraction and classification. In this pyramid framework, QSSPN progressively learns feature representations by cascading QSSN blocks and performs classification with a softmax classifier. It is the first attempt to introduce quantum theory in HSI processing model design. Substantial experiments are conducted on three HSI datasets to verify the superiority of the proposed QSSPN framework over the state-of-the-art methods.

DETRs With Hybrid Matching
Jia, DingandYuan, YuhuiandHe, HaodiandWu, XiaopeiandYu, HaojunandLin, WeihongandSun, LeiandZhang, ChaoandHu, Han



研究问题:本文旨在解决目标检测中需要手动设计非最大抑制(NMS)去除重复检测的问题。
动机:现有的DETR模型在训练过程中,正样本的查询被分配为少数,一对一匹配大大减少了正样本的训练效率。
方法:提出一种基于混合匹配策略的方法,将原始的一对一匹配分支与辅助的一对多匹配分支结合进行训练。
效果:实验表明,这种混合策略可以显著提高准确性。在推理阶段,仅使用原始的一对一匹配分支,从而保持了DETR的端到端优势和相同的推理效率。该方法被称为H-DETR,并在各种视觉任务上对一系列代表性的DETR方法进行了一致的改进。

One-to-one set matching is a key design for DETR to establish its end-to-end capability, so that object detection does not require a hand-crafted NMS (non-maximum suppression) to remove duplicate detections. This end-to-end signature is important for the versatility of DETR, and it has been generalized to broader vision tasks. However, we note that there are few queries assigned as positive samples and the one-to-one set matching significantly reduces the training efficacy of positive samples. We propose a simple yet effective method based on a hybrid matching scheme that combines the original one-to-one matching branch with an auxiliary one-to-many matching branch during training. Our hybrid strategy has been shown to significantly improve accuracy. In inference, only the original one-to-one match branch is used, thus maintaining the end-to-end merit and the same inference efficiency of DETR. The method is named H-DETR, and it shows that a wide range of representative DETR methods can be consistently improved across a wide range of visual tasks, including Deformable-DETR, PETRv2, PETR, and TransTrack, among others.

A Rotation-Translation-Decoupled Solution for Robust and Efficient Visual-Inertial Initialization
He, YijiaandXu, BoandOuyang, ZhanpengandLi, Hongdong



研究问题:提出一种新的视觉惯性里程计(VIO)初始化方法,该方法将旋转和平移估计解耦,以实现更高的效率和更好的鲁棒性。
动机:现有的松散耦合的VIO初始化方法在视觉结构运动恢复(SfM)的稳定性上表现不佳,而那些紧密耦合的方法往往忽略了闭型解决方案中的陀螺仪偏差,导致准确性有限。此外,上述两类方法都存在计算成本高的问题,因为需要同时重建3D点云。
方法:我们的新型方法充分利用了惯性和视觉测量数据进行旋转和平移初始化。首先,设计了一种仅旋转的解决方案用于陀螺仪偏差估计,该方案紧密地结合了陀螺仪和相机观测。其次,使用线性平移约束在全球范围内最优地解决了初始速度和重力向量,无需重建3D点云。
效果:大量实验证明,我们的方法比最先进的方法快872倍(基于10帧集),并且表现出显著的更高鲁棒性和准确性。源代码可在https://github.com/boxuLibrary/drt-vio-init获取。

We propose a novel visual-inertial odometry (VIO) initialization method, which decouples rotation and translation estimation, and achieves higher efficiency and better robustness. Existing loosely-coupled VIO-initialization methods suffer from poor stability of visual structure-from-motion (SfM), whereas those tightly-coupled methods often ignore the gyroscope bias in the closed-form solution, resulting in limited accuracy. Moreover, the aforementioned two classes of methods are computationally expensive, because 3D point clouds need to be reconstructed simultaneously. In contrast, our new method fully combines inertial and visual measurements for both rotational and translational initialization. First, a rotation-only solution is designed for gyroscope bias estimation, which tightly couples the gyroscope and camera observations. Second, the initial velocity and gravity vector are solved with linear translation constraints in a globally optimal fashion and without reconstructing 3D point clouds. Extensive experiments have demonstrated that our method is 8 72 times faster (w.r.t. a 10-frame set) than the state-of-the-art methods, and also presents significantly higher robustness and accuracy. The source code is available at https://github.com/boxuLibrary/drt-vio-init.

Multi-Modal Gait Recognition via Effective Spatial-Temporal Feature Fusion
Cui, YufengandKang, Yimei



研究问题:如何通过融合和聚合骨架和剪影的空间-时间信息,获得更鲁棒和全面的步态识别表示。
动机:现有的基于剪影和骨架的步态识别方法受到服装遮挡的影响,且缺乏身体形状信息。
方法:提出一种基于变压器的步态识别框架MMGaitFormer,包括空间融合模块(SFM)和时间融合模块(TFM),用于有效融合和聚合两种模态的空间-时间信息。
效果:实验证明,MMGaitFormer在流行的步态数据集上取得了最先进的性能,对于最具挑战性的“CL”条件,该方法实现了94.8%的rank-1准确率,大幅超过了最先进的单模态方法。

Gait recognition is a biometric technology that identifies people by their walking patterns. The silhouettes-based method and the skeletons-based method are the two most popular approaches. However, the silhouette data are easily affected by clothing occlusion, and the skeleton data lack body shape information. To obtain a more robust and comprehensive gait representation for recognition, we propose a transformer-based gait recognition framework called MMGaitFormer, which effectively fuses and aggregates the spatial-temporal information from the skeletons and silhouettes. Specifically, a Spatial Fusion Module (SFM) and a Temporal Fusion Module (TFM) are proposed for effective spatial-level and temporal-level feature fusion, respectively. The SFM performs fine-grained body parts spatial fusion and guides the alignment of each part of the silhouette and each joint of the skeleton through the attention mechanism. The TFM performs temporal modeling through Cycle Position Embedding (CPE) and fuses temporal information of two modalities. Experiments demonstrate that our MMGaitFormer achieves state-of-the-art performance on popular gait datasets. For the most challenging "CL" (i.e., walking in different clothes) condition in CASIA-B, our method achieves a rank-1 accuracy of 94.8%, which outperforms the state-of-the-art single-modal methods by a large margin.

Cascaded Local Implicit Transformer for Arbitrary-Scale Super-Resolution
Chen, Hao-WeiandXu, Yu-SyuanandHong, Min-FongandTsai, Yi-MinandKuo, Hsien-KaiandLee, Chun-Yi



研究问题:如何利用注意力机制和频率编码技术,通过局部隐式图像函数来表示任意分辨率的图像。
动机:当前的神经网络表征在表示任意分辨率的图像方面表现出了强大的潜力。
方法:提出了局部隐式变换器(LIT),将注意力机制和频率编码技术整合到局部隐式图像函数中,设计了一个跨尺度局部注意模块来有效聚合局部特征,以及一个局部频率编码模块来结合位置编码和傅里叶域信息以构建高分辨率图像。为了进一步提高代表性,我们提出了级联LIT(CLIT),利用多尺度特征和累积训练策略逐渐增加训练的上采样因子。
效果:大量的实验验证了这些组件的有效性,并分析了各种训练策略的变化。定性和定量的结果表明,LIT和CLIT在任意超分辨率任务上都取得了良好的效果,超过了先前的工作。

Implicit neural representation demonstrates promising ability in representing images with arbitrary resolutions recently. In this paper, we present Local Implicit Transformer (LIT) that integrates attention mechanism and frequency encoding technique into local implicit image function. We design a cross-scale local attention block to effectively aggregate local features and a local frequency encoding block to combine positional encoding with Fourier domain information for constructing high-resolution (HR) images. To further improve representative power, we propose Cascaded LIT (CLIT) exploiting multi-scale features along with cumulative training strategy that gradually increase the upsampling factors for training. We have performed extensive experiments to validate the effectiveness of these components and analyze the variants of the training strategy. The qualitative and quantitative results demonstrated that LIT and CLIT achieve favorable results and outperform the previous works within arbitrary super-resolution tasks.

Transformer Scale Gate for Semantic Segmentation
Shi, HengcanandHayat, MunawarandCai, Jianfei



研究问题:如何有效地编码多尺度的上下文信息以提高语义分割的准确性。
动机:现有的基于变压器的分割模型在结合不同尺度的特征时,没有进行任何选择,可能会导致次优尺度的特征降低分割结果的质量。
方法:提出了一种简单而有效的模块——变压器尺度门(TSG),以优化地结合多尺度特征。TSG利用了变压器中自我和交叉注意力的特性进行尺度选择。TSG是一个高度灵活的即插即用模块,可以很容易地集成到任何基于编码器-解码器层次结构的视觉变压器架构中。
效果:在Pascal Context、ADE20K和Cityscapes数据集上的大量实验表明,我们的特征选择策略能够持续获得收益。

Effectively encoding multi-scale contextual information is crucial for accurate semantic segmentation. Most of the existing transformer-based segmentation models combine features across scales without any selection, where features on sub-optimal scales may degrade segmentation outcomes. Leveraging from the inherent properties of Vision Transformers, we propose a simple yet effective module, Transformer Scale Gate (TSG), to optimally combine multi-scale features. TSG exploits cues in self and cross attentions in Vision Transformers for the scale selection. TSG is a highly flexible plug-and-play module, and can easily be incorporated with any encoder-decoder-based hierarchical vision Transformer architecture. Extensive experiments on the Pascal Context, ADE20K and Cityscapes datasets demonstrate that our feature selection strategy achieves consistent gains.

PMatch: Paired Masked Image Modeling for Dense Geometric Matching
Zhu, ShengjieandLiu, Xiaoming



研究问题:如何通过联合训练大规模文本语料库和知识图谱,利用外部知识增强语言表示模型的性能。
动机:目前的预训练语言模型缺乏对结构化知识的利用,而知识图谱中的有信息量的实体可以提升语言理解能力。
方法:采用大规模文本语料库和知识图谱进行联合训练,构建了ERNIE模型,该模型能同时捕捉词汇、句法和知识信息。
效果:实验结果显示,ERNIE在各种知识驱动任务上表现优秀,且在其他常见的NLP任务上与BERT模型性能相当。

Dense geometric matching determines the dense pixel-wise correspondence between a source and support image corresponding to the same 3D structure. Prior works employ an encoder of transformer blocks to correlate the two-frame features. However, existing monocular pretraining tasks, e.g., image classification, and masked image modeling (MIM), can not pretrain the cross-frame module, yielding less optimal performance. To resolve this, we reformulate the MIM from reconstructing a single masked image to reconstructing a pair of masked images, enabling the pretraining of transformer module. Additionally, we incorporate a decoder into pretraining for improved upsampling results. Further, to be robust to the textureless area, we propose a novel cross-frame global matching module (CFGM). Since the most textureless area is planar surfaces, we propose a homography loss to further regularize its learning. Combined together, we achieve the State-of-The-Art (SoTA) performance on geometric matching. Codes and models are available at https://github.com/ShngJZ/PMatch.

Teaching Matters: Investigating the Role of Supervision in Vision Transformers
Walmer, MatthewandSuri, SakshamandGupta, KamalandShrivastava, Abhinav



研究问题:本研究旨在探索视觉转换器(ViTs)在不同学习范式下的行为。
动机:近年来,视觉转换器在许多应用中取得了显著的普及,但其在不同监督学习方法下的行为尚未得到充分探索。
方法:通过比较不同监督学习方法训练的视觉转换器,分析其注意力、表示和下游性能的差异。
效果:研究发现,视觉转换器具有高度的灵活性,能够根据训练方法的不同以不同的顺序处理局部和全局信息。对比性自监督学习方法学习的特征与显式监督学习方法的特征具有竞争力,甚至在某些部分任务上更优。此外,重建模型的表示与对比性自监督模型的表示存在显著相似性。

Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, their behavior under different learning paradigms is not well explored. We compare ViTs trained through different methods of supervision, and show that they learn a diverse range of behaviors in terms of their attention, representations, and downstream performance. We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads. These are self-attention heads that attend to a token adjacent to the current token with a fixed directional offset, a phenomenon that to the best of our knowledge has not been highlighted in any prior work. Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method. We find that contrastive self-supervised methods learn features that are competitive with explicitly supervised features, and they can even be superior for part-level tasks. We also find that the representations of reconstruction-based models show non-trivial similarity to contrastive self-supervised models.

Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers
Long, SifanandZhao, ZhenandPi, JiminandWang, ShengshengandWang, Jingdong



研究问题:视觉转换器在各种视觉任务上取得了显著改进,但其二次项的标记交互会大大降低计算效率。
动机:现有的剪枝方法主要关注保留局部注意力标记的重要性,但完全忽略了全局标记的多样性。
方法:我们提出了一种有效的标记解耦和合并方法,该方法可以同时考虑标记的重要性和多样性进行标记剪枝。根据类标记注意力,我们将注意力和非注意力标记分离。
效果:尽管方法简单,但我们的方法在模型复杂度和分类准确性之间取得了良好的平衡。在DeiT-S上,我们的方法将FLOPs减少了35%,仅使准确率下降了0.2%。值得注意的是,由于保持了标记的多样性,我们的方法甚至可以在将DeiT-T的FLOPs减少40%后,将其准确率提高了0.1%。

Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. Many pruning methods have been proposed to remove redundant tokens for efficient vision transformers recently. However, existing studies mainly focus on the token importance to preserve local attentive tokens but completely ignore the global token diversity. In this paper, we emphasize the cruciality of diverse global semantics and propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. According to the class token attention, we decouple the attentive and inattentive tokens. In addition to preserve the most discriminative local tokens, we merge similar inattentive tokens and match homogeneous attentive tokens to maximize the token diversity. Despite its simplicity, our method obtains a promising trade-off between model complexity and classification accuracy. On DeiT-S, our method reduces the FLOPs by 35% with only a 0.2% accuracy drop. Notably, benefiting from maintaining the token diversity, our method can even improve the accuracy of DeiT-T by 0.1% after reducing its FLOPs by 40%.

AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation
Li, ZhenandZhu, Zuo-LiangandHan, Ling-HaoandHou, QibinandGuo, Chun-LeandCheng, Ming-Ming



研究问题:本文旨在提出一种新的网络架构——全对多域变换(AMT),用于视频帧插值。
动机:现有的视频帧插值方法在处理大运动和遮挡区域时存在困难,而基于卷积的方法在准确性和效率上与基于Transformer的方法相比有优势。
方法:首先,为所有像素对构建双向关联量,并使用预测的双边流来检索相关性以更新流和插值内容特征。其次,从一对更新的粗流中导出多组细粒度流场,分别对输入帧进行反向扭曲。这两种设计的结合使得模型能够生成面向任务的流,并在视频帧插值中降低大运动和遮挡区域的建模难度。
效果:实验结果表明,该模型在各种基准测试中实现了最先进的性能,并且在准确性和效率上都优于基于Transformer的模型。

We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video frame interpolation. It is based on two essential designs. First, we build bidirectional correlation volumes for all pairs of pixels and use the predicted bilateral flows to retrieve correlations for updating both flows and the interpolated content feature. Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately. Combining these two designs enables us to generate promising task-oriented flows and reduce the difficulties in modeling large motions and handling occluded areas during frame interpolation. These qualities promote our model to achieve state-of-the-art performance on various benchmarks with high efficiency. Moreover, our convolution-based model competes favorably compared to Transformer-based models in terms of accuracy and efficiency. Our code is available at https://github.com/MCG-NKU/AMT.

Deep Discriminative Spatial and Temporal Network for Efficient Video Deblurring
Pan, JinshanandXu, BomingandDong, JiangxinandGe, JianjunandTang, Jinhui



研究问题:如何有效地探索视频去模糊中的空间和时间信息。
动机:与现有的直接对齐相邻帧而不加区分的方法不同,我们开发了一种深度判别性空间和时间网络,以促进空间和时间特征的探索,从而更好地进行视频去模糊。
方法:我们首先开发了一个通道门控动态网络来自适应地探索空间信息。然后,为了获取有用的时间特征以恢复潜在清晰帧,我们开发了一个简单的但有效的判别性时间特征融合模块。此外,为了利用来自远距离帧的信息,我们开发了一种基于小波的特征传播方法,该方法将判别性时间特征融合模块作为基本单元,有效地从远距离帧传播主要结构,以实现更好的视频去模糊。
效果:我们的实验表明,所提出的方法不需要额外的对齐方法,并且在准确性和模型复杂度方面优于现有的最佳方法。

How to effectively explore spatial and temporal information is important for video deblurring. In contrast to existing methods that directly align adjacent frames without discrimination, we develop a deep discriminative spatial and temporal network to facilitate the spatial and temporal feature exploration for better video deblurring. We first develop a channel-wise gated dynamic network to adaptively explore the spatial information. As adjacent frames usually contain different contents, directly stacking features of adjacent frames without discrimination may affect the latent clear frame restoration. Therefore, we develop a simple yet effective discriminative temporal feature fusion module to obtain useful temporal features for latent frame restoration. Moreover, to utilize the information from long-range frames, we develop a wavelet-based feature propagation method that takes the discriminative temporal feature fusion module as the basic unit to effectively propagate main structures from long-range frames for better video deblurring. We show that the proposed method does not require additional alignment methods and performs favorably against state-of-the-art ones on benchmark datasets in terms of accuracy and model complexity.

Deep Arbitrary-Scale Image Super-Resolution via Scale-Equivariance Pursuit
Wang, XiaohangandChen, XuanhongandNi, BingbingandWang, HangandTong, ZhengyanandLiu, Yutian



研究问题:如何利用变换器框架中的规模等变模块来提高任意尺度的图像超分辨率(ASISR)性能,特别是在高上采样率图像外推中。
动机:观察到规模等变处理模块在任意尺度图像超分辨率任务中的关键作用,受此启发,提出了两个新的规模等变模块。
方法:设计了一个名为“自适应特征提取器”的插件模块,该模块在频率扩展编码中注入显式规模信息,从而实现表示学习的规模适应。在上采样阶段,引入了一种可学习的神经插值上采样算子,该算子同时以双边方式编码相对距离(即规模感知)信息和特征相似性(即从训练数据中学到的先验知识)。
效果:实验结果表明,所提出的操作符和学习框架提供了出色的规模等变能力,比之前的SOTA在任意尺度的SR上都有更好的结果。

The ability of scale-equivariance processing blocks plays a central role in arbitrary-scale image super-resolution tasks. Inspired by this crucial observation, this work proposes two novel scale-equivariant modules within a transformer-style framework to enhance arbitrary-scale image super-resolution (ASISR) performance, especially in high upsampling rate image extrapolation. In the feature extraction phase, we design a plug-in module called Adaptive Feature Extractor, which injects explicit scale information in frequency-expanded encoding, thus achieving scale-adaption in representation learning. In the upsampling phase, a learnable Neural Kriging upsampling operator is introduced, which simultaneously encodes both relative distance (i.e., scale-aware) information as well as feature similarity (i.e., with priori learned from training data) in a bilateral manner, providing scale-encoded spatial feature fusion. The above operators are easily plugged into multiple stages of a SR network, and a recent emerging pre-training strategy is also adopted to impulse the model's performance further. Extensive experimental results have demonstrated the outstanding scale-equivariance capability offered by the proposed operators and our learning framework, with much better results than previous SOTAs at arbitrary scales for SR. Our code is available at https://github.com/neuralchen/EQSR.

OmniAL: A Unified CNN Framework for Unsupervised Anomaly Localization
Zhao, Ying



研究问题:如何有效地进行无监督异常定位和检测,特别是在工业制造过程中由于缺乏异常样本的情况下。
动机:现有的无监督工业异常检测方法通过为许多不同类别训练单独的模型来实现高性能,但这种方法的模型存储和训练时间成本高,且一模型-N-类别的设置会导致现有方法的性能下降。
方法:本文提出了一种名为OmniAL的统一CNN框架进行无监督异常定位,通过改进异常合成、重建和定位来解决这个问题。该方法使用提出的面板引导的合成异常数据训练模型,而不是直接使用正常数据,以防止模型学习到相同的重建。同时,通过使用提出的Dilated Channel and Spatial Attention (DCSA)块增加多类分布的异常重建误差。为了更好地定位异常区域,它在重建和定位子网络之间使用了提出的DiffNeck来探索多级差异。
效果:在15类MVTecAD和12类VisA数据集上的实验验证了OmniAL的优势,超越了统一模型的最新技术。在15类-MVTecAD/12类-VisA上,其单一统一模型实现了97.2/87.8的图像AUROC,98.3/96.6的像素AUROC和73.4/41.7的像素AP用于异常检测和定位。此外,首次尝试对无监督异常定位和检测方法在不同级别的对抗攻击下的鲁棒性进行了全面研究。实验结果表明,OmniAL具有优越的性能和良好的应用前景。

Unsupervised anomaly localization and detection is crucial for industrial manufacturing processes due to the lack of anomalous samples. Recent unsupervised advances on industrial anomaly detection achieve high performance by training separate models for many different categories. The model storage and training time cost of this paradigm is high. Moreover, the setting of one-model-N-classes leads to fearful degradation of existing methods. In this paper, we propose a unified CNN framework for unsupervised anomaly localization, named OmniAL. This method conquers aforementioned problems by improving anomaly synthesis, reconstruction and localization. To prevent the model learning identical reconstruction, it trains the model with proposed panel-guided synthetic anomaly data rather than directly using normal data. It increases anomaly reconstruction error for multi-class distribution by using a network that is equipped with proposed Dilated Channel and Spatial Attention (DCSA) blocks. To better localize the anomaly regions, it employs proposed DiffNeck between reconstruction and localization sub-networks to explore multi-level differences. Experiments on 15-class MVTecAD and 12-class VisA datasets verify the advantage of proposed OmniAL that surpasses the state-of-the-art of unified models. On 15-class-MVTecAD/12-class-VisA, its single unified model achieves 97.2/87.8 image-AUROC, 98.3/96.6 pixel-AUROC and 73.4/41.7 pixel-AP for anomaly detection and localization respectively. Besides that, we make the first attempt to conduct a comprehensive study on the robustness of unsupervised anomaly localization and detection methods against different level adversarial attacks. Experiential results show OmniAL has good application prospects for its superior performance.

Recurrent Homography Estimation Using Homography-Guided Image Warping and Focus Transformer
Cao, Si-YuanandZhang, RunminandLuo, LunandYu, BeinanandSheng, ZehuaandLi, JunweiandShen, Hui-Liang



研究问题:如何利用循环同构估计框架和焦点变压器,通过使用同构图引导的图像变形和关注机制,提高特征一致性和注意力集中性。
动机:为了解决以往方法在处理具有挑战性的跨分辨率和跨模态数据集时准确性不足的问题,同时实现参数效率。
方法:提出了一种名为RHWF的循环同构估计框架,该框架将同构图引导的图像变形和焦点变压器适当地吸收到循环框架中,以逐步增强特征一致性,并通过全局->非局部->局部的方式聚合内部-外部对应关系。
效果:实验结果表明,RHWF在各种数据集上的准确性都名列前茅,包括具有挑战性的跨分辨率和跨模态数据集。与先前最先进的LocalTrans和IHN方法相比,RHWF在MSCOCO数据集上将平均角误差降低了约70%和38.1%,同时节省了86.5%和24.6%的参数成本。

We propose the Recurrent homography estimation framework using Homography-guided image Warping and Focus transformer (FocusFormer), named RHWF. Both being appropriately absorbed into the recurrent framework, the homography-guided image warping progressively enhances the feature consistency and the attention-focusing mechanism in FocusFormer aggregates the intra-inter correspondence in a global->nonlocal->local manner. Thanks to the above strategies, RHWF ranks top in accuracy on a variety of datasets, including the challenging cross-resolution and cross-modal ones. Meanwhile, benefiting from the recurrent framework, RHWF achieves parameter efficiency despite the transformer architecture. Compared to previous state-of-the-art approaches LocalTrans and IHN, RHWF reduces the mean average corner error (MACE) by about 70% and 38.1% on the MSCOCO dataset, while saving the parameter costs by 86.5% and 24.6%. Similar to the previous works, RHWF can also be arranged in 1-scale for efficiency and 2-scale for accuracy, with the 1-scale RHWF already outperforming most of the previous methods. Source code is available at https://github.com/imdumpl78/RHWF.

DLBD: A Self-Supervised Direct-Learned Binary Descriptor
Xiao, BinandHu, YangandLiu, BoandBi, XiuliandLi, WeishengandGao, Xinbo



研究问题:学习型二值描述符的二值化过程尚未得到很好的解决,因为二值化阻碍了梯度反向传播。
动机:现有的学习型二值描述符首先学习实数值输出,然后通过其提出的二值化过程转换为二值描述符。由于它们的二值化过程不是网络的一部分,因此学习型二值描述符无法充分利用深度学习的进步。
方法:我们提出了一种模型无关的插件二值转换层(BTL),使网络直接生成二值描述符。然后,我们提出了第一个自我监督、直接学习的二值描述符,称为DLBD。此外,我们还提出了超宽温度比例交叉熵损失来调整学习的描述符在更大范围内的分布。
效果:实验表明,我们提出的BTL可以替代以前的二值化过程。我们提出的DLBD在不同的任务上优于最先进的技术,如图像检索和分类。

For learning-based binary descriptors, the binarization process has not been well addressed. The reason is that the binarization blocks gradient back-propagation. Existing learning-based binary descriptors learn real-valued output, and then it is converted to binary descriptors by their proposed binarization processes. Since their binarization processes are not a component of the network, the learning-based binary descriptor cannot fully utilize the advances of deep learning. To solve this issue, we propose a model-agnostic plugin binary transformation layer (BTL), making the network directly generate binary descriptors. Then, we present the first self-supervised, direct-learned binary descriptor, dubbed DLBD. Furthermore, we propose ultra-wide temperature-scaled cross-entropy loss to adjust the distribution of learned descriptors in a larger range. Experiments demonstrate that the proposed BTL can substitute the previous binarization process. Our proposed DLBD outperforms SOTA on different tasks such as image retrieval and classification.

AutoFocusFormer: Image Segmentation off the Grid
Ziwen, ChenandPatnaik, KaushikandZhai, ShuangfeiandWan, AlvinandRen, ZhileandSchwing, AlexanderG.andColburn, AlexandFuxin, Li



研究问题:如何改善卷积神经网络在处理高不平衡内容密度的实际世界图像时,对小物体信息丢失的问题。
动机:现有的连续网格下采样策略在处理图像时,会忽视小物体的信息,导致分割等任务效果下降。
方法:提出一种自适应下采样的局部注意力变压器图像识别模型AutoFocusFormer (AFF),通过学习保留最重要的像素来保留小物体信息。
效果:实验证明,AutoFocusFormer (AFF)在类似规模的基线模型上有了显著的改进。

Real world images often have highly imbalanced content density. Some areas are very uniform, e.g., large patches of blue sky, while other areas are scattered with many small objects. Yet, the commonly used successive grid downsampling strategy in convolutional deep networks treats all areas equally. Hence, small objects are represented in very few spatial locations, leading to worse results in tasks such as segmentation. Intuitively, retaining more pixels representing small objects during downsampling helps to preserve important information. To achieve this, we propose AutoFocusFormer (AFF), a local-attention transformer image recognition backbone, which performs adaptive downsampling by learning to retain the most important pixels for the task. Since adaptive downsampling generates a set of pixels irregularly distributed on the image plane, we abandon the classic grid structure. Instead, we develop a novel point-based local attention block, facilitated by a balanced clustering module and a learnable neighborhood merging module, which yields representations for our point-based versions of state-of-the-art segmentation heads. Experiments show that our AutoFocusFormer (AFF) improves significantly over baseline models of similar sizes.

CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion
Zhao, ZixiangandBai, HaowenandZhang, JiangsheandZhang, YulunandXu, ShuangandLin, ZudiandTimofte, RaduandVanGool, Luc



研究问题:多模态图像融合旨在生成保持不同模态优点(如功能亮点和详细纹理)的融合图像。
动机:为了解决跨模态特征建模和分解理想模态特定和模态共享特征的挑战,我们提出了一种新的相关性驱动的特征分解融合网络。
方法:首先,CDDFuse使用Restormer块提取跨模态浅层特征。然后,引入一个具有Lite Transformer(LT)块的双重分支Transformer-CNN特征提取器,利用长范围注意力处理低频全局特征,以及关注提取高频局部信息的Invertible Neural Networks(INN)块。进一步提出一种相关性驱动的损失函数,使低频特征基于嵌入信息相关,而高频特征不相关。最后,LT基全局融合和INN基局部融合层输出融合图像。
效果:大量实验证明,我们的CDDFuse在多种融合任务中取得了良好的效果,包括红外可见光图像融合和医学图像融合。我们还表明,CDDFuse可以在统一的基准测试中提高下游红外可见光语义分割和对象检测的性能。代码可在https://github.com/Zhaozixiang1228/MMIF-CDDFuse获取。

Multi-modality (MM) image fusion aims to render fused images that maintain the merits of different modalities, e.g., functional highlight and detailed textures. To tackle the challenge in modeling cross-modality features and decomposing desirable modality-specific and modality-shared features, we propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network. Firstly, CDDFuse uses Restormer blocks to extract cross-modality shallow features. We then introduce a dual-branch Transformer-CNN feature extractor with Lite Transformer (LT) blocks leveraging long-range attention to handle low-frequency global features and Invertible Neural Networks (INN) blocks focusing on extracting high-frequency local information. A correlation-driven loss is further proposed to make the low-frequency features correlated while the high-frequency features uncorrelated based on the embedded information. Then, the LT-based global fusion and INN-based local fusion layers output the fused image. Extensive experiments demonstrate that our CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion. We also show that CDDFuse can boost the performance in downstream infrared-visible semantic segmentation and object detection in a unified benchmark. The code is available at https://github.com/Zhaozixiang1228/MMIF-CDDFuse.

HGNet: Learning Hierarchical Geometry From Points, Edges, and Surfaces
Yao, TingandLi, YehaoandPan, YingweiandMei, Tao



研究问题:如何将无结构的点集解析为局部几何结构,以理解和表示点云。
动机:为了深入分析点云,需要设计一种能够从点、边、面(三角形)到超表面(相邻的表面)的层次几何模型的深度架构。
方法:本文提出了一种新的分层几何网络(HGNet),该网络以自顶向下的方式整合了从超表面、表面、边到点的层次几何结构,用于学习点云表示。具体来说,首先在每两个相邻的点之间构建边缘。然后通过边缘到点的聚合来学习点级表示,即将所有连接的边缘聚合到锚点上。接下来,由于每两个相邻的边缘组成一个面,因此通过在所有邻居表面上进行表面到边的聚合,得到每个锚边缘的边缘级表示。此外,通过将所有超表面转换为锚表面,实现表面级表示,即超表面到表面的聚合。最后,设计了一个Transformer结构,将所有点级、边级和表面级特征统一为整体的点云表示。
效果:在四个点云分析数据集上的大量实验表明,HGNet在3D对象分类和部分/语义分割任务上具有优越性。更值得注意的是,HGNet在ScanObjectNN上的总体准确率达到了89.2%,比PointNeXt-S提高了1.5%。

Parsing an unstructured point set into constituent local geometry structures (e.g., edges or surfaces) would be helpful for understanding and representing point clouds. This motivates us to design a deep architecture to model the hierarchical geometry from points, edges, surfaces (triangles), to super-surfaces (adjacent surfaces) for the thorough analysis of point clouds. In this paper, we present a novel Hierarchical Geometry Network (HGNet) that integrates such hierarchical geometry structures from super-surfaces, surfaces, edges, to points in a top-down manner for learning point cloud representations. Technically, we first construct the edges between every two neighbor points. A point-level representation is learnt with edge-to-point aggregation, i.e., aggregating all connected edges into the anchor point. Next, as every two neighbor edges compose a surface, we obtain the edge-level representation of each anchor edge via surface-to-edge aggregation over all neighbor surfaces. Furthermore, the surface-level representation is achieved through super-surface-to-surface aggregation by transforming all super-surfaces into the anchor surface. A Transformer structure is finally devised to unify all the point-level, edge-level, and surface-level features into the holistic point cloud representations. Extensive experiments on four point cloud analysis datasets demonstrate the superiority of HGNet for 3D object classification and part/semantic segmentation tasks. More remarkably, HGNet achieves the overall accuracy of 89.2% on ScanObjectNN, improving PointNeXt-S by 1.5%.

PointVector: A Vector Representation in Point Cloud Analysis
Deng, XinandZhang, WenYuandDing, QingandZhang, XinMing



研究问题:如何有效地提取局部特征,提高点云分析的性能。
动机:尽管点云分析中的基于点的方法和诸如PointNeXt的简洁多层感知器结构已经显示出与卷积和变压器结构的竞争力,但标准多层感知器在提取局部特征方面的能力有限。
方法:我们提出了一种矢量导向的点集抽象,通过更高维度的向量聚合相邻的特征。为了便于网络优化,我们构建了一个基于3D向量旋转的独立角度的标量到向量的转换。最后,我们开发了一种遵循PointNeXt结构的PointVector模型。
效果:我们的实验结果表明,PointVector在S3DIS Area 5上实现了72.3% mIOU的最佳性能,在S3DIS(6折交叉验证)上实现了78.4% mIOU的最佳性能,而其模型参数仅为PointNeXt的58%。我们希望我们的研究有助于探索简洁有效的特征表示。

In point cloud analysis, point-based methods have rapidly developed in recent years. These methods have recently focused on concise MLP structures, such as PointNeXt, which have demonstrated competitiveness with Convolutional and Transformer structures. However, standard MLPs are limited in their ability to extract local features effectively. To address this limitation, we propose a Vector-oriented Point Set Abstraction that can aggregate neighboring features through higher-dimensional vectors. To facilitate network optimization, we construct a transformation from scalar to vector using independent angles based on 3D vector rotations. Finally, we develop a PointVector model that follows the structure of PointNeXt. Our experimental results demonstrate that PointVector achieves state-of-the-art performance 72.3% mIOU on the S3DIS Area 5 and 78.4% mIOU on the S3DIS (6-fold cross-validation) with only 58% model parameters of PointNeXt. We hope our work will help the exploration of concise and effective feature representations. The code will be released soon.

BASiS: Batch Aligned Spectral Embedding Space
Streicher, OrandCohen, IdoandGilboa, Guy



研究问题:如何设计具有谱图特性的深度网络构建模块。
动机:谱图理论提供了强大的算法,可以用于设计最优的图结构或获取数据的正交低维嵌入。
方法:提出一种直接学习图谱特征空间的方法,并设计了一种稳定的对齐机制来处理批次和图度量的变化。
效果:实验证明,该方法在NMI、ACC、Grassman距离、正交性和分类准确率方面优于现有技术,且学习过程更稳定。

Graph is a highly generic and diverse representation, suitable for almost any data processing problem. Spectral graph theory has been shown to provide powerful algorithms, backed by solid linear algebra theory. It thus can be extremely instrumental to design deep network building blocks with spectral graph characteristics. For instance, such a network allows the design of optimal graphs for certain tasks or obtaining a canonical orthogonal low-dimensional embedding of the data. Recent attempts to solve this problem were based on minimizing Rayleigh-quotient type losses. We propose a different approach of directly learning the graph's eigensapce. A severe problem of the direct approach, applied in batch-learning, is the inconsistent mapping of features to eigenspace coordinates in different batches. We analyze the degrees of freedom of learning this task using batches and propose a stable alignment mechanism that can work both with batch changes and with graph-metric changes. We show that our learnt spectral embedding is better in terms of NMI, ACC, Grassman distnace, orthogonality and classification accuracy, compared to SOTA. In addition, the learning is more stable.

Recognizing Rigid Patterns of Unlabeled Point Clouds by Complete and Continuous Isometry Invariants With No False Negatives and No False Positives
Widdowson, DanielandKurlin, Vitaliy



研究问题:如何有效地表示和比较刚性结构如汽车或其他固体对象的点云数据。
动机:由于噪声和运动的影响,现有的刚性模式比较方法存在误报和漏报的问题,需要寻找一种在数据扰动下连续的不变性度量。
方法:提出了一种新的基于欧几里得空间的无标签点云的连续和完整的不变性度量方法。
效果:该方法可以在固定维度下以多项式时间计算新指标,有效解决了现有方法的问题。

Rigid structures such as cars or any other solid objects are often represented by finite clouds of unlabeled points. The most natural equivalence on these point clouds is rigid motion or isometry maintaining all inter-point distances. Rigid patterns of point clouds can be reliably compared only by complete isometry invariants that can also be called equivariant descriptors without false negatives (isometric clouds having different descriptions) and without false positives (non-isometric clouds with the same description). Noise and motion in data motivate a search for invariants that are continuous under perturbations of points in a suitable metric. We propose the first continuous and complete invariant of unlabeled clouds in any Euclidean space. For a fixed dimension, the new metric for this invariant is computable in a polynomial time in the number of points.

N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution
Choi, HaramandLee, JeongminandYang, Jihoon



研究问题:现有的预训练语言模型缺乏对丰富的结构化知识的利用,以及S研究问题:现有的预训练语言模型缺乏对丰富的结构化知识的利用,以及Swin Transformer在重建高分辨率图像时由于受限的感知场而忽略大范围区域的问题。
动机:为了解决这些问题,本文提出了将N-Gram上下文引入到低层视觉的Transformer中,并使用滑动窗口自注意力来扩大可见区域以恢复退化的像素。
方法:通过定义N-Gram为Swin中的相邻局部窗口,并将其与滑动窗口自注意力相结合,扩展了可见区域以恢复退化的像素。同时,我们还提出了一种高效的SR网络NGswin,该网络具有SCDP瓶颈,可以处理分层编码器的多尺度输出。
效果:实验结果表明,NGswin在保持高效结构的同时,与先前的方法相比具有竞争力的性能。此外,我们还改进了其他基于Swin的SR方法,构建了一个增强的模型SwinIR-NG,该模型优于当前最佳的轻量级SR方法,并取得了最先进的结果。

While some studies have proven that Swin Transformer (Swin) with window self-attention (WSA) is suitable for single image super-resolution (SR), the plain WSA ignores the broad regions when reconstructing high-resolution images due to a limited receptive field. In addition, many deep learning SR methods suffer from intensive computations. To address these problems, we introduce the N-Gram context to the low-level vision with Transformers for the first time. We define N-Gram as neighboring local windows in Swin, which differs from text analysis that views N-Gram as consecutive characters or words. N-Grams interact with each other by sliding-WSA, expanding the regions seen to restore degraded pixels. Using the N-Gram context, we propose NGswin, an efficient SR network with SCDP bottleneck taking multi-scale outputs of the hierarchical encoder. Experimental results show that NGswin achieves competitive performance while maintaining an efficient structure when compared with previous leading methods. Moreover, we also improve other Swin-based SR methods with the N-Gram context, thereby building an enhanced model: SwinIR-NG. Our improved SwinIR-NG outperforms the current best lightweight SR approaches and establishes state-of-the-art results. Codes are available at https://github.com/rami0205/NGramSwin.

Virtual Sparse Convolution for Multimodal 3D Object Detection
Wu, HaiandWen, ChengluandShi, ShaoshuaiandLi, XinandWang, Cheng



研究问题:如何有效地融合RGB图像和LiDAR数据进行3D物体检测。
动机:目前的虚拟/伪点基3D物体检测方法在生成的虚拟点非常密集,导致检测过程中冗余计算量大,且由不准确的深度补全引入的噪声会显著降低检测精度。
方法:提出了一种名为VirConvNet的新骨干网络,该网络基于新的运算符VirConv(虚拟稀疏卷积)。VirConv包含两个关键设计:(1) StVD(随机体素丢弃)和 (2) NRConv(抗噪子流形卷积)。StVD通过丢弃大量附近的冗余体素来缓解计算问题。NRConv通过在2D图像和3D LiDAR空间中编码体素特征来解决噪声问题。
效果:在KITTI汽车3D检测测试排行榜上,我们的VirConv-L实现了85%的AP,运行速度快达56ms。我们的VirConv-T和VirConv-S分别达到了86.3%和87.2%的高精确度,目前分别排名第二和第一。代码可在https://github.com/hailanyi/VirConv获取。

Recently, virtual/pseudo-point-based 3D object detection that seamlessly fuses RGB images and LiDAR data by depth completion has gained great attention. However, virtual points generated from an image are very dense, introducing a huge amount of redundant computation during detection. Meanwhile, noises brought by inaccurate depth completion significantly degrade detection precision. This paper proposes a fast yet effective backbone, termed VirConvNet, based on a new operator VirConv (Virtual Sparse Convolution), for virtual-point-based 3D object detection. The VirConv consists of two key designs: (1) StVD (Stochastic Voxel Discard) and (2) NRConv (Noise-Resistant Submanifold Convolution). The StVD alleviates the computation problem by discarding large amounts of nearby redundant voxels. The NRConv tackles the noise problem by encoding voxel features in both 2D image and 3D LiDAR space. By integrating our VirConv, we first develop an efficient pipeline VirConv-L based on an early fusion design. Then, we build a high-precision pipeline VirConv-T based on a transformed refinement scheme. Finally, we develop a semi-supervised pipeline VirConv-S based on a pseudo-label framework. On the KITTI car 3D detection test leaderboard, our VirConv-L achieves 85% AP with a fast running speed of 56ms. Our VirConv-T and VirConv-S attains a high-precision of 86.3% and 87.2% AP, and currently rank 2nd and 1st, respectively. The code is available at https://github.com/hailanyi/VirConv.

ALTO: Alternating Latent Topologies for Implicit 3D Reconstruction
Wang, ZhenandZhou, ShijieandPark, JeongJoonandPaschalidou, DespoinaandYou, SuyaandWetzstein, GordonandGuibas, LeonidasandKadambi, Achuta



研究问题:如何从有噪声的点云中高保真地重建隐含的3D表面。
动机:现有的方法在恢复细节方面存在困难,点状潜在向量和网格潜在向量都有各自的优缺点。
方法:提出交替潜在拓扑(ALTO)的方法,通过在几何表示之间进行交替,最终得到易于解码的潜在向量。
效果:实验证明,ALTO不仅在性能上超过了最先进的方法,而且在运行时间上提高了3-10倍。

This work introduces alternating latent topologies (ALTO) for high-fidelity reconstruction of implicit 3D surfaces from noisy point clouds. Previous work identifies that the spatial arrangement of latent encodings is important to recover detail. One school of thought is to encode a latent vector for each point (point latents). Another school of thought is to project point latents into a grid (grid latents) which could be a voxel grid or triplane grid. Each school of thought has tradeoffs. Grid latents are coarse and lose high-frequency detail. In contrast, point latents preserve detail. However, point latents are more difficult to decode into a surface, and quality and runtime suffer. In this paper, we propose ALTO to sequentially alternate between geometric representations, before converging to an easy-to-decode latent. We find that this preserves spatial expressiveness and makes decoding lightweight. We validate ALTO on implicit 3D recovery and observe not only a performance improvement over the state-of-the-art, but a runtime improvement of 3-10x. Anonymized source code at https://visual.ee.ucla.edu/alto.htm/.

MSMDFusion: Fusing LiDAR and Camera at Multiple Scales With Multi-Depth Seeds for 3D Object Detection
Jiao, YangandJie, ZequnandChen, ShaoxiangandChen, JingjingandMa, LinandJiang, Yu-Gang



研究问题:如何有效地融合激光雷达和相机信息,以实现自动驾驶系统中的精确可靠的3D物体检测。
动机:由于两种截然不同的模态(即激光雷达和相机)的多粒度几何和语义特征难以结合,因此将它们融合起来进行准确的3D物体检测在自动驾驶系统中至关重要。
方法:提出了一种新的框架,该框架更好地利用了深度信息,并在体素空间中实现了激光雷达和相机之间的细粒度跨模态交互。这个框架包括两个重要的组件:一是使用多深度未投影(MDU)方法提高每个交互级别的提升点的深度质量;二是应用门控模态感知卷积(GMA-Conv)块,以细粒度的方式调整涉及相机模态的体素,然后将多模态特征聚合到一个统一的空间中。
效果:在nuScenes测试基准上,该方法(简称为MSMDFusion)无需使用测试时增强和集成技术,就在3D物体检测和跟踪任务上取得了最先进的结果。

Fusing LiDAR and camera information is essential for accurate and reliable 3D object detection in autonomous driving systems. This is challenging due to the difficulty of combining multi-granularity geometric and semantic features from two drastically different modalities. Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images (referred to as "seeds") into 3D space, and then incorporate 2D semantics via cross-modal interaction or fusion techniques. However, depth information is under-investigated in these approaches when lifting points into 3D space, thus 2D semantics can not be reliably fused with 3D points. Moreover, their multi-modal fusion strategy, which is implemented as concatenation or attention, either can not effectively fuse 2D and 3D information or is unable to perform fine-grained interactions in the voxel space. To this end, we propose a novel framework with better utilization of the depth information and fine-grained cross-modal interaction between LiDAR and camera, which consists of two important components. First, a Multi-Depth Unprojection (MDU) method is used to enhance the depth quality of the lifted points at each interaction level. Second, a Gated Modality-Aware Convolution (GMA-Conv) block is applied to modulate voxels involved with the camera modality in a fine-grained manner and then aggregate multi-modal features into a unified space. Together they provide the detection head with more comprehensive features from LiDAR and camera. On the nuScenes test benchmark, our proposed method, abbreviated as MSMDFusion, achieves state-of-the-art results on both 3D object detection and tracking tasks without using test-time-augmentation and ensemble techniques. The code is available at https://github.com/SxJyJay/MSMDFusion.

Toward Stable, Interpretable, and Lightweight Hyperspectral Super-Resolution
Guo, Wen-jinandXie, WeiyingandJiang, KaiandLi, YunsongandLei, JieandFang, Leyuan



研究问题:现有的高光谱图像超分辨率(HSI-SR)方法在未知场景下性能不稳定,且计算消耗大。
动机:开发一种稳定、可解释且轻量级的HSI-SR新协调优化框架。
方法:创建了一种新的概率框架下的融合和退化估计之间的正循环。利用估计的退化作为指导进行退化感知的HSI-SR。
效果:实验证明该方法优于现有技术,例如在CAVE数据集上实现了2.3 dB的PSNR提升,模型大小减少了120倍,计算量减少了4300倍。

For real applications, existing HSI-SR methods are mostly not only limited to unstable performance under unknown scenarios but also suffer from high computation consumption. In this paper, we develop a new coordination optimization framework for stable, interpretable, and lightweight HSI-SR. Specifically, we create a positive cycle between fusion and degradation estimation under a new probabilistic framework. The estimated degradation is applied to fusion as guidance for a degradation-aware HSI-SR. Under the framework, we establish an explicit degradation estimation method to tackle the indeterminacy and unstable performance driven by black-box simulation in previous methods. Considering the interpretability in fusion, we integrate spectral mixing prior to the fusion process, which can be easily realized by a tiny autoencoder, leading to a dramatic release of the computation burden. We then develop a partial fine-tune strategy in inference to reduce the computation cost further. Comprehensive experiments demonstrate the superiority of our method against state-of-the-art under synthetic and real datasets. For instance, we achieve a 2.3 dB promotion on PSNR with 120x model size reduction and 4300x FLOPs reduction under the CAVE dataset. Code is available in https://github.com/WenjinGuo/DAEM.

R2Former: Unified Retrieval and Reranking Transformer for Place Recognition
Zhu, SijieandYang, LinjieandChen, ChenandShah, MubarakandShen, XiaohuiandWang, Heng



研究问题:本文旨在解决视觉地点识别(VPR)中的问题,即如何通过匹配查询图像与参考数据库中的图像来估计查询图像的位置。
动机:传统的VPR方法通常采用聚合的CNN特征进行全局检索,并使用基于RANSAC的几何验证进行重排。然而,RANSAC只使用几何信息,忽略了其他可能对重排有用的信息,如局部特征相关性和注意力值。
方法:本文提出了一个统一的地点识别框架,该框架使用一种新的变压器模型R2Former处理检索和重排。提出的重排模块考虑了特征相关性、注意力值和xy坐标,并学习确定图像对是否来自同一位置。整个流程是端到端可训练的,重排模块也可以单独应用于其他CNN或变压器主干作为通用组件。
效果:实验结果表明,R2Former在主要VPR数据集上显著优于最先进的方法,同时具有更小的推理时间和内存消耗。它在未参与的MSLS挑战集上也取得了最先进的成果,可以作为现实世界大规模应用的简单而强大的解决方案。实验还表明,视觉变压器令牌在某些情况下比CNN局部特征更好。代码已在https://github.com/Jeff-Zilence/R2Former上发布。

Visual Place Recognition (VPR) estimates the location of query images by matching them with images in a reference database. Conventional methods generally adopt aggregated CNN features for global retrieval and RANSAC-based geometric verification for reranking. However, RANSAC only employs geometric information but ignores other possible information that could be useful for reranking, e.g. local feature correlations, and attention values. In this paper, we propose a unified place recognition framework that handles both retrieval and reranking with a novel transformer model, named R2Former. The proposed reranking module takes feature correlation, attention value, and xy coordinates into account, and learns to determine whether the image pair is from the same location. The whole pipeline is end-to-end trainable and the reranking module alone can also be adopted on other CNN or transformer backbones as a generic component. Remarkably, R2Former significantly outperforms state-of-the-art methods on major VPR datasets with much less inference time and memory consumption. It also achieves the state-of-the-art on the hold-out MSLS challenge set and could serve as a simple yet strong solution for real-world large-scale applications. Experiments also show vision transformer tokens are comparable and sometimes better than CNN local features on local matching. The code is released at https://github.com/Jeff-Zilence/R2Former.

CompletionFormer: Depth Completion With Convolutions and Vision Transformers
Zhang, YouminandGuo, XiandaandPoggi, MatteoandZhu, ZhengandHuang, GuanandMattoccia, Stefano



研究问题:如何通过联合训练大规模文本语料库和知识图谱,利用外部知识增强语言表示模型的性能。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,本文提出利用知识图谱中的有信息量的实体来增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,训练出一种能够同时充分利用词汇、句法和知识信息的增强的语言表示模型ERNIE。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Given sparse depths and the corresponding RGB images, depth completion aims at spatially propagating the sparse measurements throughout the whole image to get a dense depth prediction. Despite the tremendous progress of deep-learning-based depth completion methods, the locality of the convolutional layer or graph model makes it hard for the network to model the long-range relationship between pixels. While recent fully Transformer-based architecture has reported encouraging results with the global receptive field, the performance and efficiency gaps to the well-developed CNN models still exist because of its deteriorative local feature details. This paper proposes a joint convolutional attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure. This hybrid architecture naturally benefits both the local connectivity of convolutions and the global context of the Transformer in one single model. As a result, our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure Transformer-based methods. Especially when the captured depth is highly sparse, the performance gap with other methods gets much larger.

Comprehensive and Delicate: An Efficient Transformer for Image Restoration
Zhao, HaiyuandGou, YuanbiaoandLi, BoyunandPeng, DezhongandLv, JianchengandPeng, Xi



研究问题:本文旨在提出一种新颖的高效图像恢复Transformer,以解决现有方法在捕获像素间全局依赖性方面的局限性。
动机:现有的图像恢复Transformer虽然取得了一定的成功,但由于其局部注意力机制,无法充分捕捉像素间的全局依赖关系。
方法:本文提出了一种先捕获超像素级全局依赖性,再将其转移到每个像素的粗到细的框架。具体来说,通过两个神经模块——压缩注意力神经模块(CA)和双自适应神经模块(DA)实现。CA采用特征聚合、注意力计算和特征恢复来有效捕获超像素级的全局依赖性;DA则采用新颖的双向结构,将超像素级的全局性自适应地封装到像素中。
效果:由于采用了这两个神经模块,本文的方法在性能上与SwinIR相当,但计算量仅为其6%。

Vision Transformers have shown promising performance in image restoration, which usually conduct window- or channel-based attention to avoid intensive computations. Although the promising performance has been achieved, they go against the biggest success factor of Transformers to a certain extent by capturing the local instead of global dependency among pixels. In this paper, we propose a novel efficient image restoration Transformer that first captures the superpixel-wise global dependency, and then transfers it into each pixel. Such a coarse-to-fine paradigm is implemented through two neural blocks, i.e., condensed attention neural block (CA) and dual adaptive neural block (DA). In brief, CA employs feature aggregation, attention computation, and feature recovery to efficiently capture the global dependency at the superpixel level. To embrace the pixel-wise global dependency, DA takes a novel dual-way structure to adaptively encapsulate the globality from superpixels into pixels. Thanks to the two neural blocks, our method achieves comparable performance while taking only 6% FLOPs compared with SwinIR.

Camouflaged Object Detection With Feature Decomposition and Edge Reconstruction
He, ChunmingandLi, KaiandZhang, YachaoandTang, LongxiangandZhang, YulunandGuo, ZhenhuaandLi, Xiu



研究问题:本文旨在解决在复杂背景中识别伪装物体的问题。
动机:由于伪装物体与背景的相似性以及模糊的边界,使得伪装物体检测成为一项挑战。
方法:提出了一种名为FEDER的模型,该模型通过使用可学习的小波将特征分解成不同的频带,然后专注于最有价值的频带来挖掘区分前景和背景的微妙线索。同时,设计了一个受微分方程启发的边缘重建模块来生成精确的边缘。
效果:实验表明,FEDER模型在性能上显著优于现有方法,且计算和存储成本更低。

Camouflaged object detection (COD) aims to address the tough issue of identifying camouflaged objects visually blended into the surrounding backgrounds. COD is a challenging task due to the intrinsic similarity of camouflaged objects with the background, as well as their ambiguous boundaries. Existing approaches to this problem have developed various techniques to mimic the human visual system. Albeit effective in many cases, these methods still struggle when camouflaged objects are so deceptive to the vision system. In this paper, we propose the FEature Decomposition and Edge Reconstruction (FEDER) model for COD. The FEDER model addresses the intrinsic similarity of foreground and background by decomposing the features into different frequency bands using learnable wavelets. It then focuses on the most informative bands to mine subtle cues that differentiate foreground and background. To achieve this, a frequency attention module and a guidance-based feature aggregation module are developed. To combat the ambiguous boundary problem, we propose to learn an auxiliary edge reconstruction task alongside the COD task. We design an ordinary differential equation-inspired edge reconstruction module that generates exact edges. By learning the auxiliary task in conjunction with the COD task, the FEDER model can generate precise prediction maps with accurate object boundaries. Experiments show that our FEDER model significantly outperforms state-of-the-art methods with cheaper computational and memory costs.

ALOFT: A Lightweight MLP-Like Architecture With Dynamic Low-Frequency Transform for Domain Generalization
Guo, JintaoandWang, NaandQi, LeiandShi, Yinghuan



研究问题:如何训练一个模型,使其能利用多个源领域数据对未见的目标领域进行泛化。
动机:现有的大部分泛化方法基于卷积神经网络(CNN),但卷积核的局部运算使模型过于关注局部表示(如纹理),这导致模型更容易过拟合源领域,影响其泛化能力。
方法:受最近轻量级多层感知机(MLP)方法的启发,我们首先分析了CNN和MLP在领域泛化中的区别,发现MLP方法具有更好的泛化能力,因为它们可以更好地捕捉全局表示(如结构)。然后,基于最近的轻量级MLP方法,我们得到了一个强大的基线,它优于大多数最先进的CNN方法。该基线可以使用滤波器来抑制频率空间中的结构无关信息。此外,我们还提出了动态低频谱变换(ALOFT),可以在保留全局结构特征的同时扰动局部纹理特征,从而使滤波器能够充分去除结构无关信息。
效果:在四个基准测试上的大量实验表明,与最先进的CNN-based DG方法相比,我们的方法可以在少量参数的情况下实现显著的性能提升。我们的代码可在https://github.com/lingeringlight/ALOFT/获取。

Domain generalization (DG) aims to learn a model that generalizes well to unseen target domains utilizing multiple source domains without re-training. Most existing DG works are based on convolutional neural networks (CNNs). However, the local operation of the convolution kernel makes the model focus too much on local representations (e.g., texture), which inherently causes the model more prone to overfit to the source domains and hampers its generalization ability. Recently, several MLP-based methods have achieved promising results in supervised learning tasks by learning global interactions among different patches of the image. Inspired by this, in this paper, we first analyze the difference between CNN and MLP methods in DG and find that MLP methods exhibit a better generalization ability because they can better capture the global representations (e.g., structure) than CNN methods. Then, based on a recent lightweight MLP method, we obtain a strong baseline that outperforms most start-of-the-art CNN-based methods. The baseline can learn global structure representations with a filter to suppress structure-irrelevant information in the frequency space. Moreover, we propose a dynAmic LOw-Frequency spectrum Transform (ALOFT) that can perturb local texture features while preserving global structure features, thus enabling the filter to remove structure-irrelevant information sufficiently. Extensive experiments on four benchmarks have demonstrated that our method can achieve great performance improvement with a small number of parameters compared to SOTA CNN-based DG methods. Our code is available at https://github.com/lingeringlight/ALOFT/.

NLOST: Non-Line-of-Sight Imaging With Transformer
Li, YueandPeng, JiayongandYe, JuntianandZhang, YueyiandXu, FeihuandXiong, Zhiwei



研究问题:如何从非视距(NLOS)测量中重建复杂的3D场景。
动机:现有的方法在处理复杂场景的非视距成像重建上存在挑战,需要提高性能。
方法:提出一种基于变压器的神经网络NLOST,通过物理先验辅助提取浅层特征,设计两种空间-时间自注意力编码器和空间-时间跨注意力解码器来探索局部和全局相关性,最后融合深层和浅层特征重建隐藏场景的3D体积。
效果:实验结果表明,该方法在合成数据和不同非视距成像系统捕获的真实世界数据上都优于现有解决方案。

Time-resolved non-line-of-sight (NLOS) imaging is based on the multi-bounce indirect reflections from the hidden objects for 3D sensing. Reconstruction from NLOS measurements remains challenging especially for complicated scenes. To boost the performance, we present NLOST, the first transformer-based neural network for NLOS reconstruction. Specifically, after extracting the shallow features with the assistance of physics-based priors, we design two spatial-temporal self attention encoders to explore both local and global correlations within 3D NLOS data by splitting or downsampling the features into different scales, respectively. Then, we design a spatial-temporal cross attention decoder to integrate local and global features in the token space of transformer, resulting in deep features with high representation capabilities. Finally, deep and shallow features are fused to reconstruct the 3D volume of hidden scenes. Extensive experimental results demonstrate the superior performance of the proposed method over existing solutions on both synthetic data and real-world data captured by different NLOS imaging systems.

MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling With Informative-Preserved Reconstruction and Self-Distilled Consistency
Xu, MingyeandXu, MutianandHe, TongandOuyang, WanliandWang, YaliandHan, XiaoguangandQiao, Yu



研究问题:如何将MM应用于大规模的3D场景,解决数据稀疏和场景复杂性的问题。
动机:传统的随机遮蔽方式在恢复3D场景的遮蔽区域时存在很大的模糊性,因此需要探索新的策略。
方法:提出一种新颖的信息保留重建方法,通过局部统计数据发现并保留具有代表性的结构化点,以增强预文本遮蔽任务对3D场景理解的效果。
效果:通过结合信息保留重建和一致性自我蒸馏的方法,实验结果在一系列下游任务中得到了一致的改进,证明了该方法的优越性。

Masked Modeling (MM) has demonstrated widespread success in various vision challenges, by reconstructing masked visual patches. Yet, applying MM for large-scale 3D scenes remains an open problem due to the data sparsity and scene complexity. The conventional random masking paradigm used in 2D images often causes a high risk of ambiguity when recovering the masked region of 3D scenes. To this end, we propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points, effectively enhancing the pretext masking task for 3D scene understanding. Integrated with a progressive reconstruction manner, our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction. Besides, such scenes with progressive masking ratios can also serve to self-distill their intrinsic spatial consistency, requiring to learn the consistent representations from unmasked areas. By elegantly combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded. We conduct comprehensive experiments on a host of downstream tasks. The consistent improvement (e.g., +6.1% mAP@0.5 on object detection and +2.2% mIoU on semantic segmentation) demonstrates the superiority of our approach.

PointClustering: Unsupervised Point Cloud Pre-Training Using Transformation Invariance in Clustering
Long, FuchenandYao, TingandQiu, ZhaofanandLi, LusongandMei, Tao



研究问题:如何利用不同数据转换的不变性进行无监督表示学习。
动机:现有的预训练模型缺乏对点云数据的充分利用,而点云数据的几何特性和语义在常见转换中不会改变。
方法:提出一种新的无监督表示学习方法PointClustering,该方法通过变换不变性进行点云预训练。PointClustering将预训练任务设定为深度聚类,并将变换不变性作为归纳偏置,认为常见的点云转换不会改变其几何特性和语义。
效果:实验证明,PointClustering在六个基准测试上表现优秀,无论是分类还是分割等下游任务。更值得注意的是,使用Transformer主干网络时,PointClustering在ModelNet40上达到了94.5%的准确率。

Feature invariance under different data transformations, i.e., transformation invariance, can be regarded as a type of self-supervision for representation learning. In this paper, we present PointClustering, a new unsupervised representation learning scheme that leverages transformation invariance for point cloud pre-training. PointClustering formulates the pretext task as deep clustering and employs transformation invariance as an inductive bias, following the philosophy that common point cloud transformation will not change the geometric properties and semantics. Technically, PointClustering iteratively optimizes the feature clusters and backbone, and delves into the transformation invariance as learning regularization from two perspectives: point level and instance level. Point-level invariance learning maintains local geometric properties through gathering point features of one instance across transformations, while instance-level invariance learning further measures clusters over the entire dataset to explore semantics of instances. Our PointClustering is architecture-agnostic and readily applicable to MLP-based, CNN-based and Transformer-based backbones. We empirically demonstrate that the models pre-learnt on the ScanNet dataset by PointClustering provide superior performances on six benchmarks, across downstream tasks of classification and segmentation. More remarkably, PointClustering achieves an accuracy of 94.5% on ModelNet40 with Transformer backbone. Source code is available at https://github.com/FuchenUSTC/PointClustering.

CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution
Cao, JiezhangandWang, QinandXian, YongqinandLi, YaweiandNi, BingbingandPi, ZhimingandZhang, KaiandZhang, YulunandTimofte, RaduandVanGool, Luc



研究问题:如何通过学习连续的图像表示来改进图像超分辨率(SR)?
动机:现有的方法主要依赖于局部特征的集成,忽视了视觉特征的相似性,且其感受野有限,无法集成大范围的重要信息。
方法:提出了一种称为CiaoSR的连续隐式注意力网络,设计了一个隐式注意力网络来学习附近局部特征的集成权重,并在其中嵌入了尺度感知的注意力以利用额外的非局部信息。
效果:在基准数据集上的大量实验表明,CiaoSR显著优于具有相同主干网络的现有单图像SR方法,并在任意尺度SR任务上实现了最先进的性能。该方法在真实世界的SR设置中也显示出有效性,更重要的是,CiaoSR可以灵活地集成到任何主干网络中以提高SR性能。

Learning continuous image representations is recently gaining popularity for image super-resolution (SR) because of its ability to reconstruct high-resolution images with arbitrary scales from low-resolution inputs. Existing methods mostly ensemble nearby features to predict the new pixel at any queried coordinate in the SR image. Such a local ensemble suffers from some limitations: i) it has no learnable parameters and it neglects the similarity of the visual features; ii) it has a limited receptive field and cannot ensemble relevant features in a large field which are important in an image. To address these issues, this paper proposes a continuous implicit attention-in-attention network, called CiaoSR. We explicitly design an implicit attention network to learn the ensemble weights for the nearby local features. Furthermore, we embed a scale-aware attention in this implicit attention network to exploit additional non-local information. Extensive experiments on benchmark datasets demonstrate CiaoSR significantly outperforms the existing single image SR methods with the same backbone. In addition, CiaoSR also achieves the state-of-the-art performance on the arbitrary-scale SR task. The effectiveness of the method is also demonstrated on the real-world SR setting. More importantly, CiaoSR can be flexibly integrated into any backbone to improve the SR performance.

Directional Connectivity-Based Segmentation of Medical Images
Yang, ZiyunandFarsiu, Sina



研究问题:如何通过深度学习网络实现生物标记物分割的解剖一致性。
动机:解剖一致性在许多医学图像分析任务中至关重要,而现有的连接性建模方法忽略了潜在空间中的丰富通道方向信息。
方法:提出一种方向连接性建模方案,通过解耦、跟踪和利用网络中的方向信息来增强特征表示。
效果:实验证明,该方法在各种公共医学图像分割基准上的效果优于现有方法。

Anatomical consistency in biomarker segmentation is crucial for many medical image analysis tasks. A promising paradigm for achieving anatomically consistent segmentation via deep networks is incorporating pixel connectivity, a basic concept in digital topology, to model inter-pixel relationships. However, previous works on connectivity modeling have ignored the rich channel-wise directional information in the latent space. In this work, we demonstrate that effective disentanglement of directional sub-space from the shared latent space can significantly enhance the feature representation in the connectivity-based network. To this end, we propose a directional connectivity modeling scheme for segmentation that decouples, tracks, and utilizes the directional information across the network. Experiments on various public medical image segmentation benchmarks show the effectiveness of our model as compared to the state-of-the-art methods. Code is available at https://github.com/Zyun-Y/DconnNet.

Implicit Identity Leakage: The Stumbling Block to Improving Deepfake Detection Generalization
Dong, ShichaoandWang, JinandJi, RenheandLiang, JiajunandFan, HaoqiangandGe, Zheng



研究问题:本文旨在分析深度伪造检测任务中二元分类器的泛化能力。
动机:发现深度伪造检测的泛化能力受到图像上意外学习到的身份表示的影响,即“隐含身份泄露”现象。
方法:提出了一种名为ID-unaware Deepfake Detection Model的方法来减少这种现象的影响。
效果:实验结果表明,该方法在数据集内和跨数据集评估中均优于现有技术。

In this paper, we analyse the generalization ability of binary classifiers for the task of deepfake detection. We find that the stumbling block to their generalization is caused by the unexpected learned identity representation on images. Termed as the Implicit Identity Leakage, this phenomenon has been qualitatively and quantitatively verified among various DNNs. Furthermore, based on such understanding, we propose a simple yet effective method named the ID-unaware Deepfake Detection Model to reduce the influence of this phenomenon. Extensive experimental results demonstrate that our method outperforms the state-of-the-art in both in-dataset and cross-dataset evaluation. The code is available at https://github.com/megvii-research/CADDM.

DNF: Decouple and Feedback Network for Seeing in the Dark
Jin, XinandHan, Ling-HaoandLi, ZhenandGuo, Chun-LeandChai, ZhiandLi, Chongyi



研究问题:如何充分利用RAW数据的特性进行低光图像增强,并解决现有架构在单阶段和多阶段方法中的局限性。
动机:尽管RAW数据具有巨大的潜力用于低光图像增强,但现有的架构限制了其性能。
方法:提出了一种去耦与反馈(DNF)框架,通过将特定领域的子任务解耦,并充分利用RAW和sRGB领域的独特属性,以及通过反馈机制在各阶段之间传递特征,避免由于图像级数据流导致的信息丢失。
效果:该方法成功地解决了RAW数据基础的低光图像增强的内在限制,并在Sony和Fuji的SID子集上实现了0.97dB和1.30dB的PSNR改进,大大超过了先前最先进的方法,且仅使用了19%的参数。

The exclusive properties of RAW data have shown great potential for low-light image enhancement. Nevertheless, the performance is bottlenecked by the inherent limitations of existing architectures in both single-stage and multi-stage methods. Mixed mapping across two different domains, noise-to-clean and RAW-to-sRGB, misleads the single-stage methods due to the domain ambiguity. The multi-stage methods propagate the information merely through the resulting image of each stage, neglecting the abundant features in the lossy image-level dataflow. In this paper, we probe a generalized solution to these bottlenecks and propose a Decouple aNd Feedback framework, abbreviated as DNF. To mitigate the domain ambiguity, domainspecific subtasks are decoupled, along with fully utilizing the unique properties in RAW and sRGB domains. The feature propagation across stages with a feedback mechanism avoids the information loss caused by image-level dataflow. The two key insights of our method resolve the inherent limitations of RAW data-based low-light image enhancement satisfactorily, empowering our method to outperform the previous state-of-the-art method by a large margin with only 19% parameters, achieving 0.97dB and 1.30dB PSNR improvements on the Sony and Fuji subsets of SID.

Deformable Mesh Transformer for 3D Human Mesh Recovery
Yoshiyasu, Yusuke



研究问题:本文旨在提出一种新颖的基于顶点的单目3D人体网格恢复方法,即变形网格变换器(DeFormer)。
动机:以前的技术在处理高分辨率图像特征图和密集网格模型时,计算成本较高。
方法:通过在配备有高效体网格驱动注意力模块的变压器解码器内形成网格对齐反馈循环,迭代地将身体网格模型拟合到输入图像中。具体包括1)身体稀疏自注意力和2)可变形网格交叉注意力。
效果:实验结果表明,DeFormer在Human3.6M和3DPW基准测试上取得了最先进的性能。消融研究也表明,DeFormer模型设计能有效利用多尺度特征图。代码可在https://github.com/yusukey03012/DeFormer获取。

We present Deformable mesh transFormer (DeFormer), a novel vertex-based approach to monocular 3D human mesh recovery. DeFormer iteratively fits a body mesh model to an input image via a mesh alignment feedback loop formed within a transformer decoder that is equipped with efficient body mesh driven attention modules: 1) body sparse self-attention and 2) deformable mesh cross attention. As a result, DeFormer can effectively exploit high-resolution image feature maps and a dense mesh model which were computationally expensive to deal with in previous approaches using the standard transformer attention. Experimental results show that DeFormer achieves state-of-the-art performances on the Human3.6M and 3DPW benchmarks. Ablation study is also conducted to show the effectiveness of the DeFormer model designs for leveraging multi-scale feature maps. Code is available at https://github.com/yusukey03012/DeFormer.

HS-Pose: Hybrid Scope Feature Extraction for Category-Level Object Pose Estimation
Zheng, LinfangandWang, ChenandSun, YinghanandDasgupta, EshaandChen, HuaandLeonardis, Ale\v{s



研究问题:本文关注类别级物体姿态估计问题,由于大的内部类别形状变化,此问题具有挑战性。
动机:3D图卷积(3D-GC)方法广泛用于提取局部几何特征,但对复杂形状的物体有限制,且对噪声敏感。此外,3D-GC的比例和平移不变性限制了对象大小和平移信息的感知。
方法:本文提出了一种简单的网络结构——HS层,它将3D-GC扩展到点云数据中,以提取用于类别级物体姿态估计任务的混合范围潜在特征。提出的HS层:1)能够感知局部-全局几何结构和全局信息;2)对噪声具有鲁棒性;3)可以编码大小和平移信息。
效果:实验表明,在基线方法(GPV-Pose)上简单地用提出的HS层替换3D-GC层,性能显著提高,5d2cm度量提高了14.5%,IoU75提高了10.3%。该方法在REAL275数据集上比最先进的方法高出8.3%(5d2cm),6.9%(IoU75),并且运行实时(50 FPS)。

In this paper, we focus on the problem of category-level object pose estimation, which is challenging due to the large intra-category shape variation. 3D graph convolution (3D-GC) based methods have been widely used to extract local geometric features, but they have limitations for complex shaped objects and are sensitive to noise. Moreover, the scale and translation invariant properties of 3D-GC restrict the perception of an object's size and translation information. In this paper, we propose a simple network structure, the HS-layer, which extends 3D-GC to extract hybrid scope latent features from point cloud data for category-level object pose estimation tasks. The proposed HS-layer: 1) is able to perceive local-global geometric structure and global information, 2) is robust to noise, and 3) can encode size and translation information. Our experiments show that the simple replacement of the 3D-GC layer with the proposed HS-layer on the baseline method (GPV-Pose) achieves a significant improvement, with the performance increased by 14.5% on 5d2cm metric and 10.3% on IoU75. Our method outperforms the state-of-the-art methods by a large margin (8.3% on 5d2cm, 6.9% on IoU75) on REAL275 dataset and runs in real-time (50 FPS).

Parts2Words: Learning Joint Embedding of Point Clouds and Texts by Bidirectional Matching Between Parts and Words
Tang, ChuanandYang, XiandWu, BojianandHan, ZhizhongandChang, Yi



研究问题:本文旨在解决形状-文本匹配问题,即如何更好地理解3D形状。
动机:现有的方法主要通过将3D形状表示为多个2D视图来处理,但由于视图数量有限,自我遮挡造成的结构模糊性使得这种方法效果不佳。
方法:本文提出直接将3D形状表示为点云,并通过在优化的特征空间中学习点云和文本的联合嵌入来实现形状和文本的双向匹配。具体来说,首先将点云分割成部分,然后利用最优传输方法在优化的特征空间中匹配部分和单词,其中每个部分由其内所有点的 features 聚合而成,每个单词则由其上下文信息抽象而来。
效果:实验结果表明,该方法在Text2Shape数据集上的多模态检索任务上取得了显著优于现有技术的效果。

Shape-Text matching is an important task of high-level shape understanding. Current methods mainly represent a 3D shape as multiple 2D rendered views, which obviously can not be understood well due to the structural ambiguity caused by self-occlusion in the limited number of views. To resolve this issue, we directly represent 3D shapes as point clouds, and propose to learn joint embedding of point clouds and texts by bidirectional matching between parts from shapes and words from texts. Specifically, we first segment the point clouds into parts, and then leverage optimal transport method to match parts and words in an optimized feature space, where each part is represented by aggregating features of all points within it and each word is abstracted by its contextual information. We optimize the feature space in order to enlarge the similarities between the paired training samples, while simultaneously maximizing the margin between the unpaired ones. Experiments demonstrate that our method achieves a significant improvement in accuracy over the SOTAs on multi-modal retrieval tasks under the Text2Shape dataset. Codes are available at https://github.com/JLUtangchuan/Parts2Words.

How Can Objects Help Action Recognition?
Zhou, XingyiandArnab, AnuragandSun, ChenandSchmid, Cordelia



研究问题:如何利用对象的知识来设计更好的视频模型,即处理更少的令牌并提高识别准确性。
动机:目前的最先进的视频模型将所有的视频标记作为长序列的空间-时间令牌进行处理,但并没有明确地对对象及其在视频中的交互进行建模。
方法:提出了一种对象引导的令牌采样策略和一种对象感知的注意力模块。前者使我们能够保留一小部分输入令牌,同时对准确性的影响最小;后者则通过将对象信息融入特征表示中,提高了整体的准确性。
效果:我们的结果框架在使用较少的令牌时,其性能优于强大的基线。具体来说,我们在SomethingElse、Something-something v2和Epic-Kitchens上分别与基线的30%、40%和60%的输入令牌相匹配。当我们使用我们的模型处理与基线相同数量的令牌时,我们在这些数据集上提高了0.6到4.2个百分点。

Current state-of-the-art video models process a video clip as a long sequence of spatio-temporal tokens. However, they do not explicitly model objects, their interactions across the video, and instead process all the tokens in the video. In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy. This is in contrast to prior works which either drop tokens at the cost of accuracy, or increase accuracy whilst also increasing the computation required. First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens with minimal impact on accuracy. And second, we propose an object-aware attention module that enriches our feature representation with object information and improves overall accuracy. Our resulting framework achieves better performance when using fewer tokens than strong baselines. In particular, we match our baseline with 30%, 40%, and 60% of the input tokens on SomethingElse, Something-something v2, and Epic-Kitchens, respectively. When we use our model to process the same number of tokens as our baseline, we improve by 0.6 to 4.2 points on these datasets.

Efficient Hierarchical Entropy Model for Learned Point Cloud Compression
Song, RuiandFu, ChunyangandLiu, ShanandLi, Ge



研究问题:如何有效地去除点云压缩中的冗余信息。
动机:点云压缩中存在大量冗余信息,准确学习熵模型是去除冗余的关键。现有的基于八叉树的自回归熵模型虽然有效,但计算复杂度高,不适用于实际应用。
方法:提出一种分层注意力结构和分组上下文结构来提高注意力模型的效率和解决自回归导致的串行解码问题。
效果:实验证明,提出的熵模型在率-失真性能和解码延迟方面优于现有最先进的大规模自回归熵模型。

Learning an accurate entropy model is a fundamental way to remove the redundancy in point cloud compression. Recently, the octree-based auto-regressive entropy model which adopts the self-attention mechanism to explore dependencies in a large-scale context is proved to be promising. However, heavy global attention computations and auto-regressive contexts are inefficient for practical applications. To improve the efficiency of the attention model, we propose a hierarchical attention structure that has a linear complexity to the context scale and maintains the global receptive field. Furthermore, we present a grouped context structure to address the serial decoding issue caused by the auto-regression while preserving the compression performance. Experiments demonstrate that the proposed entropy model achieves superior rate-distortion performance and significant decoding latency reduction compared with the state-of-the-art large-scale auto-regressive entropy model.

DKM: Dense Kernelized Feature Matching for Geometry Estimation
Edstedt, JohanandAthanasiadis, IoannisandWadenb\"ack, M\r{a



研究问题:本文旨在解决计算机视觉中的特征匹配问题,即在两个3D场景图像之间找到对应关系。
动机:尽管稀疏方法在估计两视图几何方面的效果通常优于密集方法,但作者提出一种新的密集方法,该方法在所有几何估计任务上都超越了稀疏和半稀疏方法。
方法:首先,作者提出了一种核回归全局匹配器;其次,通过堆叠特征图和深度卷积内核进行扭曲优化;最后,通过一致的深度和平衡采样方法为密集置信图学习密集置信。
效果:实验证明,提出的密集方法“密集核化特征匹配”在多个几何估计基准测试中创造了新的最先进水平。特别是在MegaDepth-1500上,与之前最好的稀疏方法和密集方法相比,分别提高了+4.9和+8.9 AUC@5。

Feature matching is a challenging computer vision task that involves finding correspondences between two images of a 3D scene. In this paper we consider the dense approach instead of the more common sparse paradigm, thus striving to find all correspondences. Perhaps counter-intuitively, dense methods have previously shown inferior performance to their sparse and semi-sparse counterparts for estimation of two-view geometry. This changes with our novel dense method, which outperforms both dense and sparse methods on geometry estimation. The novelty is threefold: First, we propose a kernel regression global matcher. Secondly, we propose warp refinement through stacked feature maps and depthwise convolution kernels. Thirdly, we propose learning dense confidence through consistent depth and a balanced sampling approach for dense confidence maps. Through extensive experiments we confirm that our proposed dense method, Dense Kernelized Feature Matching, sets a new state-of-the-art on multiple geometry estimation benchmarks. In particular, we achieve an improvement on MegaDepth-1500 of +4.9 and +8.9 AUC@5 compared to the best previous sparse method and dense method respectively. Our code is provided at the following repository: https://github.com/Parskatt/DKM

Image Cropping With Spatial-Aware Feature and Rank Consistency
Wang, ChaoandNiu, LiandZhang, BoandZhang, Liqing



研究问题:如何通过图像裁剪找到视觉上吸引人的图像部分,同时捕捉到裁剪区域与美学元素(如显著对象、语义边缘)的空间关系。
动机:尽管先前的方法在这方面取得了重大进展,但他们在捕捉裁剪区域与美学元素之间的空间关系方面表现较弱。此外,由于标注数据的高成本,未标注数据的潜在价值仍有待挖掘。
方法:我们提出了一种空间感知特征来编码候选裁剪区域与美学元素之间的空间关系,方法是将裁剪掩膜和选择性聚合的特征图进行连接,然后输入到一个轻量级的编码器中。为了解决第二个问题,我们在标记图像上训练了一个成对排名分类器,并将这种知识转移到未标记的图像上,以强制保持排名一致性。
效果:在基准数据集上的实验结果表明,我们提出的方法在性能上优于最先进的方法。

Image cropping aims to find visually appealing crops in an image. Despite the great progress made by previous methods, they are weak in capturing the spatial relationship between crops and aesthetic elements (e.g., salient objects, semantic edges). Besides, due to the high annotation cost of labeled data, the potential of unlabeled data awaits to be excavated. To address the first issue, we propose spatial-aware feature to encode the spatial relationship between candidate crops and aesthetic elements, by feeding the concatenation of crop mask and selectively aggregated feature maps to a light-weighted encoder. To address the second issue, we train a pair-wise ranking classifier on labeled images and transfer such knowledge to unlabeled images to enforce rank consistency. Experimental results on the benchmark datasets show that our proposed method performs favorably against state-of-the-art methods.

SVGformer: Representation Learning for Continuous Vector Graphics Using Transformers
Cao, DefuandWang, ZhaowenandEchevarria, JoseandLiu, Yan



研究问题:如何更好地理解和生成数据,特别是在矢量图形数据中。
动机:现有的深度学习方法在处理矢量图形数据时,往往需要量化SVG参数,无法直接利用其几何特性,导致下游任务效果不佳。
方法:提出一种基于变压器的表示学习模型SVGformer,该模型直接在连续输入值上操作,操纵SVG的几何信息以编码轮廓细节和长距离依赖关系。
效果:通过在矢量字体和图标数据集上的大量实验,证明该模型能够捕获高质量的表示信息,并在下游任务上显著超越先前最先进的方法。

Advances in representation learning have led to great success in understanding and generating data in various domains. However, in modeling vector graphics data, the pure data-driven approach often yields unsatisfactory results in downstream tasks as existing deep learning methods often require the quantization of SVG parameters and cannot exploit the geometric properties explicitly. In this paper, we propose a transformer-based representation learning model (SVGformer) that directly operates on continuous input values and manipulates the geometric information of SVG to encode outline details and long-distance dependencies. SVGfomer can be used for various downstream tasks: reconstruction, classification, interpolation, retrieval, etc. We have conducted extensive experiments on vector font and icon datasets to show that our model can capture high-quality representation information and outperform the previous state-of-the-art on downstream tasks significantly.

Pixels, Regions, and Objects: Multiple Enhancement for Salient Object Detection
Wang, YiandWang, RuiliandFan, XinandWang, TianzhuandHe, Xiangjian



研究问题:如何提高显著物体检测的准确率和鲁棒性,特别是在多对象和背景杂乱的复杂场景中。
动机:目前的显著物体检测方法在处理复杂场景时存在不足,需要进一步提高准确率和鲁棒性。
方法:提出一种名为MENet的新方法,采用人类视觉系统(HVS)的边界敏感性、内容完整性、迭代精炼和频率分解机制。设计了多级混合损失来引导网络学习像素级、区域级和对象级特征。设计了灵活的多尺度特征增强模块(ME-Module)来逐步聚合和精炼全局或详细特征。使用迭代训练策略来增强MENet双分支解码器中的边界特征和自适应特征。
效果:在六个具有挑战性的基准数据集上进行全面评估,结果显示MENet取得了最先进的结果。

Salient object detection (SOD) aims to mimic the human visual system (HVS) and cognition mechanisms to identify and segment salient objects. However, due to the complexity of these mechanisms, current methods are not perfect. Accuracy and robustness need to be further improved, particularly in complex scenes with multiple objects and background clutter. To address this issue, we propose a novel approach called Multiple Enhancement Network (MENet) that adopts the boundary sensibility, content integrity, iterative refinement, and frequency decomposition mechanisms of HVS. A multi-level hybrid loss is firstly designed to guide the network to learn pixel-level, region-level, and object-level features. A flexible multiscale feature enhancement module (ME-Module) is then designed to gradually aggregate and refine global or detailed features by changing the size order of the input feature sequence. An iterative training strategy is used to enhance boundary features and adaptive features in the dual-branch decoder of MENet. Comprehensive evaluations on six challenging benchmark datasets show that MENet achieves state-of-the-art results. Both the codes and results are publicly available at https://github.com/yiwangtz/MENet.

ToThePoint: Efficient Contrastive Learning of 3D Point Clouds via Recycling
Li, XinglinandChen, JiajingandOuyang, JinhuiandDeng, HanhuiandVelipasalar, SenemandWu, Di



研究问题:近年来,点云处理技术发展迅速,但需要大量标注数据进行监督学习,且标注过程耗时耗力。
动机:针对此问题,本文提出了一种新颖的对比学习方法ToThePoint,该方法利用无标签数据预训练一个主干网络,提取潜在表示用于后续任务。
方法:与传统的对比学习方法不同,ToThePoint不仅最大化同一类点云经过不同类型增强后的特征之间的一致性,还最大化置换不变特征和最大池化后丢弃的特征之间的一致性。
效果:在ShapeNet数据集上进行自监督学习后,ToThePoint在ModelNet40、ModelNet40C、ScanobjectNN和ShapeNet-Part等下游任务上取得了与最新基线相当甚至更好的结果,并且训练时间比基线快200倍。

Recent years have witnessed significant developments in point cloud processing, including classification and segmentation. However, supervised learning approaches need a lot of well-labeled data for training, and annotation is labor- and time-intensive. Self-supervised learning, on the other hand, uses unlabeled data, and pre-trains a backbone with a pretext task to extract latent representations to be used with the downstream tasks. Compared to 2D images, self-supervised learning of 3D point clouds is under-explored. Existing models, for self-supervised learning of 3D point clouds, rely on a large number of data samples, and require significant amount of computational resources and training time. To address this issue, we propose a novel contrastive learning approach, referred to as ToThePoint. Different from traditional contrastive learning methods, which maximize agreement between features obtained from a pair of point clouds formed only with different types of augmentation, ToThePoint also maximizes the agreement between the permutation invariant features and features discarded after max pooling. We first perform self-supervised learning on the ShapeNet dataset, and then evaluate the performance of the network on different downstream tasks. In the downstream task experiments, performed on the ModelNet40, ModelNet40C, ScanobjectNN and ShapeNet-Part datasets, our proposed ToThePoint achieves competitive, if not better results compared to the state-of-the-art baselines, and does so with significantly less training time (200 times faster than baselines)

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
Wu, YanminandCheng, XinhuaandZhang, RenruiandCheng, ZesenandZhang, Jian



研究问题:如何通过丰富的语义线索从点云中找出自然语言描述的对象。
动机:现有的方法要么提取耦合所有单词的句子级特征,要么更关注对象名称,这会丢失单词级信息或忽视其他属性。
方法:提出了EDA模型,该模型明确地将句子中的文本属性解耦,并在这种精细的语言和点云对象之间进行密集对齐。具体来说,首先提出一个文本解耦模块,为每个语义组件生成文本特征。然后设计两种损失来监督两种模态之间的密集匹配:位置对齐损失和语义对齐损失。此外,还引入了一个新的视觉基础任务,即定位没有对象名称的对象,以全面评估模型的密集对齐能力。
效果:实验结果表明,EDA在两个广泛采用的3D视觉基础数据集ScanRefer和SR3D/NR3D上取得了最先进的性能,并在我们新提出的任务上取得了绝对的领导地位。

3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues. However, existing methods either extract the sentence-level features coupling all words or focus more on object names, which would lose the word-level information or neglect other attributes. To alleviate these issues, we present EDA that Explicitly Decouples the textual attributes in a sentence and conducts Dense Alignment between such fine-grained language and point cloud objects. Specifically, we first propose a text decoupling module to produce textual features for every semantic component. Then, we design two losses to supervise the dense matching between two modalities: position alignment loss and semantic alignment loss. On top of that, we further introduce a new visual grounding task, locating objects without object names, which can thoroughly evaluate the model's dense alignment capacity. Through experiments, we achieve state-of-the-art performance on two widely-adopted 3D visual grounding datasets, ScanRefer and SR3D/NR3D, and obtain absolute leadership on our newly-proposed task. The source code is available at https://github.com/yanmin-wu/EDA.

A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting Hand Pose Estimation From a Single RGB Image
Jiang, ChanglongandXiao, YangandWu, CunlinandZhang, MingyangandZheng, JinghongandCao, ZhiguoandZhou, JoeyTianyi



研究问题:如何从单张RGB图像中估计3D交互手的姿势,解决手部自我遮挡和相互遮挡、两只手相似外观模式的混淆以及2D到3D关节位置映射的病态问题。
动机:为了解决上述问题,作者提出将A2J-一种最先进的基于深度的3D单手姿势估计方法-扩展到交互手条件下的RGB领域。
方法:通过在Transformer的非局部编码-解码框架下对A2J进行改进,构建了A2J-Transformer。主要有三个优点:1. 建立局部锚点间的自注意力,使它们具有全局空间上下文意识,以更好地捕捉抵抗遮挡的关节连接线索;2. 将每个锚点视为可学习的查询,具有自适应特征学习,以增强模式拟合能力;3. 锚点位于3D空间,而不是A2J中的2D空间,以利用3D姿势预测。
效果:在具有挑战性的InterHand 2.6M数据集上进行的实验表明,A2J-Transformer可以实现最先进的无模型性能(在双手握持情况下,MPJPE降低了3.38mm),并且可以很好地泛化到深度领域。

3D interacting hand pose estimation from a single RGB image is a challenging task, due to serious self-occlusion and inter-occlusion towards hands, confusing similar appearance patterns between 2 hands, ill-posed joint position mapping from 2D to 3D, etc.. To address these, we propose to extend A2J-the state-of-the-art depth-based 3D single hand pose estimation method-to RGB domain under interacting hand condition. Our key idea is to equip A2J with strong local-global aware ability to well capture interacting hands' local fine details and global articulated clues among joints jointly. To this end, A2J is evolved under Transformer's non-local encoding-decoding framework to build A2J-Transformer. It holds 3 main advantages over A2J. First, self-attention across local anchor points is built to make them global spatial context aware to better capture joints' articulation clues for resisting occlusion. Secondly, each anchor point is regarded as learnable query with adaptive feature learning for facilitating pattern fitting capacity, instead of having the same local representation with the others. Last but not least, anchor point locates in 3D space instead of 2D as in A2J, to leverage 3D pose prediction. Experiments on challenging InterHand 2.6M demonstrate that, A2J-Transformer can achieve state-of-the-art model-free performance (3.38mm MPJPE advancement in 2-hand case) and can also be applied to depth domain with strong generalization.

E2PN: Efficient SE(3)-Equivariant Point Network
Zhu, MinghanandGhaffari, MaaniandClark, WilliamA.andPeng, Huei



研究问题:本文旨在提出一种从3D点云中学习SE(3)等变特征的卷积结构。
动机:现有的网络在处理点云数据时,无法有效地进行旋转不变性的特征提取和分类。
方法:通过结合群卷积和商表示,将SO(3)离散化为有限群,使用SO(2)作为稳定子群形成球面商特征场以节省计算,同时提出置换层从球面特征恢复SO(3)特征以保留区分旋转的能力。
效果:实验表明,该方法在物体分类、姿态估计和关键点匹配等多种任务上取得了相当或更好的性能,同时消耗的内存更少,运行速度比现有工作更快。

This paper proposes a convolution structure for learning SE(3)-equivariant features from 3D point clouds. It can be viewed as an equivariant version of kernel point convolutions (KPConv), a widely used convolution form to process point cloud data. Compared with existing equivariant networks, our design is simple, lightweight, fast, and easy to be integrated with existing task-specific point cloud learning pipelines. We achieve these desirable properties by combining group convolutions and quotient representations. Specifically, we discretize SO(3) to finite groups for their simplicity while using SO(2) as the stabilizer subgroup to form spherical quotient feature fields to save computations. We also propose a permutation layer to recover SO(3) features from spherical features to preserve the capacity to distinguish rotations. Experiments show that our method achieves comparable or superior performance in various tasks, including object classification, pose estimation, and keypoint-matching, while consuming much less memory and running faster than existing work. The proposed method can foster the development of equivariant models for real-world applications based on point clouds.

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation
Huang, HuiminandXie, ShiaoandLin, LanfenandTong, RuofengandChen, Yen-WeiandLi, YuexiangandWang, HongandHuang, YawenandZheng, Yefeng



研究问题:本文旨在通过整合卷积神经网络(CNN)和转换器的优势,解决半监督学习中全局学习能力和类别级特征的问题。
动机:现有的半监督学习方法主要集中在像素级的一致性上,忽视了模型内部的局部-全局交互以及跨模型的类别级一致性。
方法:提出了一种新的算法SemiCVT,该算法结合了CNN和Transformer的优点,设计了一个并行的CNN-Transformer架构,并在傅立叶域引入了局部-全局交互模式,同时提出了跨模型的类别级一致性来补充CNN和Transformer的类别级统计信息。
效果:实验结果表明,SemiCVT在两个公共基准测试中的表现优于现有的最佳方法。

Semi-supervised learning improves data efficiency of deep models by leveraging unlabeled samples to alleviate the reliance on a large set of labeled samples. These successes concentrate on the pixel-wise consistency by using convolutional neural networks (CNNs) but fail to address both global learning capability and class-level features for unlabeled data. Recent works raise a new trend that Trans- former achieves superior performance on the entire feature map in various tasks. In this paper, we unify the current dominant Mean-Teacher approaches by reconciling intra- model and inter-model properties for semi-supervised segmentation to produce a novel algorithm, SemiCVT, that absorbs the quintessence of CNNs and Transformer in a comprehensive way. Specifically, we first design a parallel CNN-Transformer architecture (CVT) with introducing an intra-model local-global interaction schema (LGI) in Fourier domain for full integration. The inter-model class- wise consistency is further presented to complement the class-level statistics of CNNs and Transformer in a cross- teaching manner. Extensive empirical evidence shows that SemiCVT yields consistent improvements over the state-of- the-art methods in two public benchmarks.

DejaVu: Conditional Regenerative Learning To Enhance Dense Prediction
Borse, ShubhankarandDas, DebasmitandPark, HyojinandCai, HongandGarrepalli, RisheekandPorikli, Fatih



研究问题:如何利用条件图像再生作为额外的监督来提高深度网络在密集预测任务(如分割、深度估计和表面法线预测)上的性能。
动机:现有的方法在处理密集预测任务时,通常忽略了图像的结构信息,导致预测结果的边界不清晰,空间一致性差。
方法:提出DejaVu框架,通过条件图像再生技术,将稀疏采样或选择性频率移除等技术应用于输入图像,去除其结构信息,然后使用条件生成器根据去结构化的图像和密集预测结果重建原始图像,从而鼓励基础网络在其密集预测中嵌入准确的场景结构。
效果:实验结果表明,DejaVu在多个密集预测基准测试中的表现优于现有方法,且无需增加计算成本。

We present DejaVu, a novel framework which leverages conditional image regeneration as additional supervision during training to improve deep networks for dense prediction tasks such as segmentation, depth estimation, and surface normal prediction. First, we apply redaction to the input image, which removes certain structural information by sparse sampling or selective frequency removal. Next, we use a conditional regenerator, which takes the redacted image and the dense predictions as inputs, and reconstructs the original image by filling in the missing structural information. In the redacted image, structural attributes like boundaries are broken while semantic context is largely preserved. In order to make the regeneration feasible, the conditional generator will then require the structure information from the other input source, i.e., the dense predictions. As such, by including this conditional regeneration objective during training, DejaVu encourages the base network to learn to embed accurate scene structure in its dense prediction. This leads to more accurate predictions with clearer boundaries and better spatial consistency. When it is feasible to leverage additional computation, DejaVu can be extended to incorporate an attention-based regeneration module within the dense prediction network, which further improves accuracy. Through extensive experiments on multiple dense prediction benchmarks such as Cityscapes, COCO, ADE20K, NYUD-v2, and KITTI, we demonstrate the efficacy of employing DejaVu during training, as it outperforms SOTA methods at no added computation cost.

Gated Multi-Resolution Transfer Network for Burst Restoration and Enhancement
Mehta, NancyandDudhane, AkshayandMurala, SubrahmanyamandZamir, SyedWaqasandKhan, SalmanandKhan, FahadShahbaz



研究问题:近年来,爆发式图像处理越来越受欢迎,但这是一个挑战性的任务,因为单个爆发式图像会经历多次退化,并且经常有相互错位,导致幽灵和拉链伪影。
动机:现有的爆发式恢复方法通常不考虑爆发帧之间的相互关联和非局部上下文信息,这往往限制了这些方法在具有挑战性的情况下的应用。另一个关键挑战在于爆发帧的稳健上采样。现有的上采样方法不能有效地利用单阶段和渐进上采样策略与常规和/或最近的上采样器同时使用的优点。
方法:为了解决这些挑战,我们提出了一种新的门控多分辨率转移网络(GMTNet)来从一组低质量的原始图像重建空间精确的高质量图像。GMTNet由三个针对爆发处理任务优化的模块组成:用于特征去噪和对齐的多尺度爆发特征对齐(MBFA),用于多帧特征聚合的转置注意特征合并(TAFM),以及用于上采样合并的特征并构建高质量输出图像的分辨率转移特征上采样器(RTFU)。
效果:我们在五个数据集上进行了详细的实验分析,验证了我们的方法,并在爆发超分辨率、爆发去噪和低光爆发增强方面设置了新的最先进的状态。我们的代码和模型可在https://github.com/nanmehta/GMTNet获取。

Burst image processing is becoming increasingly popular in recent years. However, it is a challenging task since individual burst images undergo multiple degradations and often have mutual misalignments resulting in ghosting and zipper artifacts. Existing burst restoration methods usually do not consider the mutual correlation and non-local contextual information among burst frames, which tends to limit these approaches in challenging cases. Another key challenge lies in the robust up-sampling of burst frames. The existing up-sampling methods cannot effectively utilize the advantages of single-stage and progressive up-sampling strategies with conventional and/or recent up-samplers at the same time. To address these challenges, we propose a novel Gated Multi-Resolution Transfer Network (GMTNet) to reconstruct a spatially precise high-quality image from a burst of low-quality raw images. GMTNet consists of three modules optimized for burst processing tasks: Multi-scale Burst Feature Alignment (MBFA) for feature denoising and alignment, Transposed-Attention Feature Merging (TAFM) for multi-frame feature aggregation, and Resolution Transfer Feature Up-sampler (RTFU) to up-scale merged features and construct a high-quality output image. Detailed experimental analysis on five datasets validate our approach and sets a new state-of-the-art for burst super-resolution, burst denoising, and low-light burst enhancement. Our codes and models are available at https://github.com/nanmehta/GMTNet.

PointDistiller: Structured Knowledge Distillation Towards Efficient and Compact 3D Detection
Zhang, LinfengandDong, RunpeiandTai, Hung-ShuoandMa, Kaisheng



研究问题:如何有效地进行点云表示学习,并实现准确且高效的3D物体检测。
动机:虽然点云表示学习在自动驾驶和虚拟现实等应用中取得了显著突破,但目前急需一种既准确又高效的3D物体检测方法。
方法:本文提出了一种名为PointDistiller的结构化知识蒸馏框架,通过动态图卷积和重新加权的学习策略提取和蒸馏点云的局部几何结构,以提高知识蒸馏的效率。
效果:实验证明,该方法在基于体素和原始点云的检测器上均优于先前的七种知识蒸馏方法,例如,压缩4倍的PointPillars学生模型在BEV和3D物体检测上的mAP提高了2.8和3.4,比其教师模型分别高出0.9和1.8 mAP。

The remarkable breakthroughs in point cloud representation learning have boosted their usage in real-world applications such as self-driving cars and virtual reality. However, these applications usually have an urgent requirement for not only accurate but also efficient 3D object detection. Recently, knowledge distillation has been proposed as an effective model compression technique, which transfers the knowledge from an over-parameterized teacher to a lightweight student and achieves consistent effectiveness in 2D vision. However, due to point clouds' sparsity and irregularity, directly applying previous image-based knowledge distillation methods to point cloud detectors usually leads to unsatisfactory performance. To fill the gap, this paper proposes PointDistiller, a structured knowledge distillation framework for point clouds-based 3D detection. Concretely, PointDistiller includes local distillation which extracts and distills the local geometric structure of point clouds with dynamic graph convolution and reweighted learning strategy, which highlights student learning on the critical points or voxels to improve knowledge distillation efficiency. Extensive experiments on both voxels-based and raw points-based detectors have demonstrated the effectiveness of our method over seven previous knowledge distillation methods. For instance, our 4X compressed PointPillars student achieves 2.8 and 3.4 mAP improvements on BEV and 3D object detection, outperforming its teacher by 0.9 and 1.8 mAP, respectively. Codes are available in the supplementary material and will be released on Github.

TopDiG: Class-Agnostic Topological Directional Graph Extraction From Remote Sensing Images
Yang, BingnanandZhang, MiandZhang, ZhanandZhang, ZhiliandHu, Xiangyun



研究问题:近年来,自动从遥感图像中提取向量的方法发展迅速,但大多数现有工作集中在特定目标上,对类别变化敏感,且难以在不同类别间实现稳定性能。
动机:为了解决这些问题,我们提出了一种创新的类别无关模型TopDiG,直接从遥感图像中提取拓扑方向图。
方法:首先,TopDiG采用一个关注拓扑的节点检测器(TCND)来检测节点并获取拓扑组件的紧凑感知。其次,我们提出了一种动态图监督(DGS)策略,从无序的节点中动态生成邻接图标签。最后,设计了一个方向图(DiG)生成器模块,从预测的节点中构建拓扑方向图。
效果:在Inria、CrowdAI、GID、GF2和Massachusetts数据集上的实验证明,TopDiG是类别无关的,并在所有数据集上都取得了有竞争力的性能。

Rapid development in automatic vector extraction from remote sensing images has been witnessed in recent years. However, the vast majority of existing works concentrate on a specific target, fragile to category variety, and hardly achieve stable performance crossing different categories. In this work, we propose an innovative class-agnostic model, namely TopDiG, to directly extract topological directional graphs from remote sensing images and solve these issues. Firstly, TopDiG employs a topology-concentrated node detector (TCND) to detect nodes and obtain compact perception of topological components. Secondly, we propose a dynamic graph supervision (DGS) strategy to dynamically generate adjacency graph labels from unordered nodes. Finally, the directional graph (DiG) generator module is designed to construct topological directional graphs from predicted nodes. Experiments on the Inria, CrowdAI, GID, GF2 and Massachusetts datasets empirically demonstrate that TopDiG is class-agnostic and achieves competitive performance on all datasets.

LinK: Linear Kernel for LiDAR-Based 3D Perception
Lu, TaoandDing, XiangandLiu, HaisongandWu, GangshanandWang, Limin



研究问题:如何将二维大核的成功扩展到三维感知,解决处理三维数据时立方体增加的开销和数据稀缺稀疏导致的优化困难。
动机:以前的工作通过引入块共享权重将内核大小从3x3x3扩展到7x7x7,但为了减少一个块内的特征变化,它只使用适度的块大小,无法实现像21x21x21这样的大内核。
方法:我们提出了一种新的方法LinK,以卷积类似的方式实现更广泛的感知接受域,主要设计有两个。一是用线性核生成器替换静态核矩阵,为非空体素自适应提供权重;二是复用重叠块中预先计算的聚合结果,以降低计算复杂度。
效果:该方法成功地使每个体素在21x21x21范围内感知上下文。在两个基本感知任务(3D对象检测和3D语义分割)上的大量实验表明了我们方法的有效性。特别是在nuScenes的3D检测基准测试中,我们的CenterPoint基础探测器仅通过集成LinK基础骨干就排名第一。在SemanticKITTI测试集上,我们还将强分割基线的mIoU提高了2.7%。代码可在https://github.com/MCG-NJU/LinK获取。

Extending the success of 2D Large Kernel to 3D perception is challenging due to: 1. the cubically-increasing overhead in processing 3D data; 2. the optimization difficulties from data scarcity and sparsity. Previous work has taken the first step to scale up the kernel size from 3x3x3 to 7x7x7 by introducing block-shared weights. However, to reduce the feature variations within a block, it only employs modest block size and fails to achieve larger kernels like the 21x21x21. To address this issue, we propose a new method, called LinK, to achieve a wider-range perception receptive field in a convolution-like manner with two core designs. The first is to replace the static kernel matrix with a linear kernel generator, which adaptively provides weights only for non-empty voxels. The second is to reuse the pre-computed aggregation results in the overlapped blocks to reduce computation complexity. The proposed method successfully enables each voxel to perceive context within a range of 21x21x21. Extensive experiments on two basic perception tasks, 3D object detection and 3D semantic segmentation, demonstrate the effectiveness of our method. Notably, we rank 1st on the public leaderboard of the 3D detection benchmark of nuScenes (LiDAR track), by simply incorporating a LinK-based backbone into the basic detector, CenterPoint. We also boost the strong segmentation baseline's mIoU with 2.7% in the SemanticKITTI test set. Code is available at https://github.com/MCG-NJU/LinK.

Modeling Entities As Semantic Points for Visual Information Extraction in the Wild
Yang, ZhiboandLong, RujiaoandWang, PengfeiandSong, SiboandZhong, HumenandCheng, WenqingandBai, XiangandYao, Cong



研究问题:目前视觉信息提取(VIE)在学术界和工业界的重要性日益增加,但评估这些方法的基准相对简单,没有充分代表真实世界的复杂情况。
动机:我们创建了一个更具挑战性的新数据集,并探索了一种在困难条件下精确且鲁棒地从文档图像中提取关键信息的新方法。
方法:与以往将视觉信息融入多模态架构或端到端训练文本定位和信息提取的方法不同,我们将实体明确地模型化为语义点,即实体的中心点富含描述不同实体属性和关系的语义信息。
效果:实验表明,该方法在实体标注和链接方面相比现有最先进的模型,可以显著提高性能。

Recently, Visual Information Extraction (VIE) has been becoming increasingly important in both academia and industry, due to the wide range of real-world applications. Previously, numerous works have been proposed to tackle this problem. However, the benchmarks used to assess these methods are relatively plain, i.e., scenarios with real-world complexity are not fully represented in these benchmarks. As the first contribution of this work, we curate and release a new dataset for VIE, in which the document images are much more challenging in that they are taken from real applications, and difficulties such as blur, partial occlusion, and printing shift are quite common. All these factors may lead to failures in information extraction. Therefore, as the second contribution, we explore an alternative approach to precisely and robustly extract key information from document images under such tough conditions. Specifically, in contrast to previous methods, which usually either incorporate visual information into a multi-modal architecture or train text spotting and information extraction in an end-to-end fashion, we explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities, which could largely benefit entity labeling and linking. Extensive experiments on standard benchmarks in this field as well as the proposed dataset demonstrate that the proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.

Learned Image Compression With Mixed Transformer-CNN Architectures
Liu, JinmingandSun, HemingandKatto, Jiro



研究问题:如何有效地融合卷积神经网络(CNN)和变换器在图像压缩中的优势,同时实现较高的性能和适当的复杂度?
动机:现有的学习型图像压缩(LIC)方法主要是基于CNN或变换器的,各有优势。如何充分利用两者的优点是一个值得探索的问题。
方法:本文提出了一种有效的并行Transformer-CNN混合物(TCM)块,结合了CNN的局部建模能力和变换器的非局部建模能力,以提高图像压缩模型的整体架构。此外,受最近熵估计模型和注意力模块进展的启发,通过使用通道挤压,提出了一种具有参数高效swin-transformer-based注意力(SWAtten)模块的通道级熵模型。
效果:实验结果表明,与现有的LIC方法相比,所提出的方法在三个不同分辨率的数据集(即柯达、Tecnick、CLIC专业验证)上实现了最先进的率失真性能。

Learned image compression (LIC) methods have exhibited promising progress and superior rate-distortion performance compared with classical image compression standards. Most existing LIC methods are Convolutional Neural Networks-based (CNN-based) or Transformer-based, which have different advantages. Exploiting both advantages is a point worth exploring, which has two challenges: 1) how to effectively fuse the two methods? 2) how to achieve higher performance with a suitable complexity? In this paper, we propose an efficient parallel Transformer-CNN Mixture (TCM) block with a controllable complexity to incorporate the local modeling ability of CNN and the non-local modeling ability of transformers to improve the overall architecture of image compression models. Besides, inspired by the recent progress of entropy estimation models and attention modules, we propose a channel-wise entropy model with parameter-efficient swin-transformer-based attention (SWAtten) modules by using channel squeezing. Experimental results demonstrate our proposed method achieves state-of-the-art rate-distortion performances on three different resolution datasets (i.e., Kodak, Tecnick, CLIC Professional Validation) compared to existing LIC methods. The code is at https://github.com/jmliu206/LIC_TCM.

PanoSwin: A Pano-Style Swin Transformer for Panorama Understanding
Ling, ZhixinandXing, ZhenandZhou, XiangdongandCao, ManliangandZhou, Guichun



研究问题:如何改善全景理解中由于等矩形投影(ERP)导致的边界不连续性和空间畸变问题。
动机:现有的CNNs和视觉Transformers在处理全景图像时,由于等矩形投影的问题,性能会严重下降。
方法:提出一种名为PanoSwin的新架构,通过探索全景样式的移位窗口方案和新颖的音高注意力来解决边界不连续性和空间畸变问题。同时,基于球面距离和笛卡尔坐标,为全景图像调整绝对位置编码和相对位置偏差以增强全景几何信息。并设计了一种新颖的两阶段学习框架,将来自平面图像的知识转移到全景图像中。
效果:在各种全景任务上与最先进的技术进行对比实验,包括全景对象检测、全景分类和全景布局估计,实验结果表明PanoSwin在全景理解方面非常有效。

In panorama understanding, the widely used equirectangular projection (ERP) entails boundary discontinuity and spatial distortion. It severely deteriorates the conventional CNNs and vision Transformers on panoramas. In this paper, we propose a simple yet effective architecture named PanoSwin to learn panorama representations with ERP. To deal with the challenges brought by equirectangular projection, we explore a pano-style shift windowing scheme and novel pitch attention to address the boundary discontinuity and the spatial distortion, respectively. Besides, based on spherical distance and Cartesian coordinates, we adapt absolute positional encodings and relative positional biases for panoramas to enhance panoramic geometry information. Realizing that planar image understanding might share some common knowledge with panorama understanding, we devise a novel two-stage learning framework to facilitate knowledge transfer from the planar images to panoramas. We conduct experiments against the state-of-the-art on various panoramic tasks, i.e., panoramic object detection, panoramic classification, and panoramic layout estimation. The experimental results demonstrate the effectiveness of PanoSwin in panorama understanding.

Adaptive Sparse Convolutional Networks With Global Context Enhancement for Faster Object Detection on Drone Images
Du, BoweiandHuang, YechengandChen, JiaxinandHuang, Di



研究问题:在资源有限的无人机平台上,如何在低延迟的情况下进行目标检测是一项重要但具有挑战性的任务。
动机:目前的检测头基于稀疏卷积的方法虽然在平衡准确性和效率方面有效,但在整合小物体的上下文信息以及控制不同尺度前景掩码比例方面存在问题。
方法:我们提出了一种新颖的全局上下文增强自适应稀疏卷积网络(CEASC)。它首先开发了一种上下文增强组归一化(CE-GN)层,通过将基于稀疏采样特征的统计信息替换为全局上下文信息,然后设计了一种自适应多层掩蔽策略,以生成不同尺度的最佳掩码比例,实现紧凑的前景覆盖,提高准确性和效率。
效果:我们在VisDrone和UAVDT两个主要基准上进行了广泛的实验,结果表明,当插入到典型的最先进的检测框架(如RetinaNet和GFL V1)时,CEASC显著降低了GFLOPs并加速了推理过程,同时保持了竞争性能。

Object detection on drone images with low-latency is an important but challenging task on the resource-constrained unmanned aerial vehicle (UAV) platform. This paper investigates optimizing the detection head based on the sparse convolution, which proves effective in balancing the accuracy and efficiency. Nevertheless, it suffers from inadequate integration of contextual information of tiny objects as well as clumsy control of the mask ratio in the presence of foreground with varying scales. To address the issues above, we propose a novel global context-enhanced adaptive sparse convolutional network (CEASC). It first develops a context-enhanced group normalization (CE-GN) layer, by replacing the statistics based on sparsely sampled features with the global contextual ones, and then designs an adaptive multi-layer masking strategy to generate optimal mask ratios at distinct scales for compact foreground coverage, promoting both the accuracy and efficiency. Extensive experimental results on two major benchmarks, i.e. VisDrone and UAVDT, demonstrate that CEASC remarkably reduces the GFLOPs and accelerates the inference procedure when plugging into the typical state-of-the-art detection frameworks (e.g. RetinaNet and GFL V1) with competitive performance. Code is available at https://github.com/Cuogeihong/CEASC.

LidarGait: Benchmarking 3D Gait Recognition With Point Clouds
Shen, ChuanfuandFan, ChaoandWu, WeiandWang, RuiandHuang, GeorgeQ.andYu, Shiqi



研究问题:视频步态识别在受限场景中取得了令人印象深刻的结果,但在3D野外世界中,由于忽视了人类的3D结构信息,其可行性受到限制。
动机:本研究探索了从点云中提取精确的3D步态特征,并提出了一个简单的但有效的3D步态识别框架,称为LidarGait。
方法:我们的方法将稀疏的点云投影到深度图中以学习具有3D几何信息的表示,这显著优于现有的点对点和基于摄像头的方法。
效果:我们构建了第一个大规模的基于激光雷达的步态识别数据集SUSTech1K,并通过实验证明:(1)3D结构信息是步态识别的重要特征;(2)LidarGait显著优于现有的基于点的和轮廓的方法;(3)在户外环境中,激光雷达传感器在步态识别上优于RGB摄像头。

Video-based gait recognition has achieved impressive results in constrained scenarios. However, visual cameras neglect human 3D structure information, which limits the feasibility of gait recognition in the 3D wild world. Instead of extracting gait features from images, this work explores precise 3D gait features from point clouds and proposes a simple yet efficient 3D gait recognition framework, termed LidarGait. Our proposed approach projects sparse point clouds into depth maps to learn the representations with 3D geometry information, which outperforms existing point-wise and camera-based methods by a significant margin. Due to the lack of point cloud datasets, we build the first large-scale LiDAR-based gait recognition dataset, SUSTech1K, collected by a LiDAR sensor and an RGB camera. The dataset contains 25,239 sequences from 1,050 subjects and covers many variations, including visibility, views, occlusions, clothing, carrying, and scenes. Extensive experiments show that (1) 3D structure information serves as a significant feature for gait recognition. (2) LidarGait outperforms existing point-based and silhouette-based methods by a significant margin, while it also offers stable cross-view results. (3) The LiDAR sensor is superior to the RGB camera for gait recognition in the outdoor environment. The source code and dataset have been made available at https://lidargait.github.io.

D2Former: Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-Based Transformers
He, JianfengandGao, YuanandZhang, TianzhuandZhang, ZheandWu, Feng



研究问题:如何提高计算机视觉中图像匹配的鲁棒性。
动机:现有的CNN提取的描述符在纹理较少的区域缺乏判别能力,关键点检测器只能识别具有特定结构级别的关键点,这导致图像匹配效果不佳。
方法:通过代理转换器的联合学习提出了一种新的图像匹配方法D2Former,包括上下文特征描述符学习(CFDL)模块和分层关键点检测器学习(HKDL)模块。
效果:实验结果表明,D2Former在四个具有挑战性的基准测试中显著优于现有的最佳图像匹配方法。

Establishing pixel-level matches between image pairs is vital for a variety of computer vision applications. However, achieving robust image matching remains challenging because CNN extracted descriptors usually lack discriminative ability in texture-less regions and keypoint detectors are only good at identifying keypoints with a specific level of structure. To deal with these issues, a novel image matching method is proposed by Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-based Transformers (D2Former), including a contextual feature descriptor learning (CFDL) module and a hierarchical keypoint detector learning (HKDL) module. The proposed D2Former enjoys several merits. First, the proposed CFDL module can model long-range contexts efficiently and effectively with the aid of designed descriptor agents. Second, the HKDL module can generate keypoint detectors in a hierarchical way, which is helpful for detecting keypoints with diverse levels of structures. Extensive experimental results on four challenging benchmarks show that our proposed method significantly outperforms state-of-the-art image matching methods.

Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR
Li, FengandZeng, AilingandLiu, ShilongandZhang, HaoandLi, HongyangandZhang, LeiandNi, LionelM.



研究问题:如何提高基于DEtection TRansformer(DETR)的物体检测模型的效率。
动机:当前DETR模型中的多尺度特征融合在编码器中引入了大量计算无效的标记,这阻碍了DETR模型的实际应用。
方法:提出Lite DETR,一种简单而高效的端到端物体检测框架。设计了一个有效的编码器块,交替更新高级别特征和低级别特征,同时开发了一种键感知的可变形注意力来预测更可靠的注意力权重。
效果:实验证明,Lite DETR能有效降低检测头的GFLOPs 60%,同时保持99%的原始性能。这种高效的编码器策略可以广泛应用于现有的基于DETR的模型。

Recent DEtection TRansformer-based (DETR) models have obtained remarkable performance. Its success cannot be achieved without the re-introduction of multi-scale feature fusion in the encoder. However, the excessively increased tokens in multi-scale features, especially for about 75% of low-level features, are quite computationally inefficient, which hinders real applications of DETR models. In this paper, we present Lite DETR, a simple yet efficient end-to-end object detection framework that can effectively reduce the GFLOPs of the detection head by 60% while keeping 99% of the original performance. Specifically, we design an efficient encoder block to update high-level features (corresponding to small-resolution feature maps) and low-level features (corresponding to large-resolution feature maps) in an interleaved way. In addition, to better fuse cross-scale features, we develop a key-aware deformable attention to predict more reliable attention weights. Comprehensive experiments validate the effectiveness and efficiency of the proposed Lite DETR, and the efficient encoder strategy can generalize well across existing DETR-based models. The code will be released after the blind review.

Attention-Based Point Cloud Edge Sampling
Wu, ChengzhiandZheng, JunweiandPfrommer, JuliusandBeyerer, J\"urgen



研究问题:如何有效地对点云进行采样以提高数据表示效果。
动机:目前最常用的点云采样方法仍为随机采样和最远点采样,而随着神经网络的发展,出现了许多基于任务学习的方法,但这些方法大多基于生成模型,而非直接利用数学统计进行点的选择。
方法:受图像中Canny边缘检测算法的启发,并借助注意力机制,本文提出了一种非生成的基于注意力的点云边缘采样方法(APES),该方法能捕获点云轮廓中的显著点。
效果:定性和定量的实验结果表明,我们的采样方法在常见的基准任务上表现出优越的性能。

Point cloud sampling is a less explored research topic for this data representation. The most commonly used sampling methods are still classical random sampling and farthest point sampling. With the development of neural networks, various methods have been proposed to sample point clouds in a task-based learning manner. However, these methods are mostly generative-based, rather than selecting points directly using mathematical statistics. Inspired by the Canny edge detection algorithm for images and with the help of the attention mechanism, this paper proposes a non-generative Attention-based Point cloud Edge Sampling method (APES), which captures salient points in the point cloud outline. Both qualitative and quantitative experimental results show the superior performance of our sampling method on common benchmark tasks.

BUFFER: Balancing Accuracy, Efficiency, and Generalizability in Point Cloud Registration
Ao, ShengandHu, QingyongandWang, HanyunandXu, KaiandGuo, Yulan



研究问题:如何实现点云注册框架的高精度、高效率和强泛化性之间的良好平衡。
动机:现有的注册技术或不准确,或效率低下,或泛化能力差,因此实现这三者之间的平衡极具挑战性。
方法:提出一种名为BUFFER的点云注册方法,通过结合点对技术和片对技术并克服其固有缺点来平衡准确性、效率和泛化性。
效果:在真实场景的大量实验中,该方法在准确性、效率和泛化性上都取得了最佳效果,不仅在未见过的数据域上达到了最高的成功率,而且比专门针对泛化的强基线快了近30倍。

An ideal point cloud registration framework should have superior accuracy, acceptable efficiency, and strong generalizability. However, this is highly challenging since existing registration techniques are either not accurate enough, far from efficient, or generalized poorly. It remains an open question that how to achieve a satisfying balance between this three key elements. In this paper, we propose BUFFER, a point cloud registration method for balancing accuracy, efficiency, and generalizability. The key to our approach is to take advantage of both point-wise and patch-wise techniques, while overcoming the inherent drawbacks simultaneously. Different from a simple combination of existing methods, each component of our network has been carefully crafted to tackle specific issues. Specifically, a Point-wise Learner is first introduced to enhance computational efficiency by predicting keypoints and improving the representation capacity of features by estimating point orientations, a Patch-wise Embedder which leverages a lightweight local feature learner is then deployed to extract efficient and general patch features. Additionally, an Inliers Generator which combines simple neural layers and general features is presented to search inlier correspondences. Extensive experiments on real-world scenarios demonstrate that our method achieves the best of both worlds in accuracy, efficiency, and generalization. In particular, our method not only reaches the highest success rate on unseen domains, but also is almost 30 times faster than the strong baselines specializing in generalization. Code is available at https://github.com/aosheng1996/BUFFER.

FeatureBooster: Boosting Feature Descriptors With a Lightweight Neural Network
Wang, XinjiangandLiu, ZeyuandHu, YuandXi, WeiandYu, WenxianandZou, Danping



研究问题:本文旨在利用一种轻量级网络在同一图像中改进关键点描述符。
动机:现有的描述符在处理大光照变化或重复模式等挑战性情况时性能有限。
方法:提出一种网络,以原始描述符和关键点的几何特性作为输入,采用基于MLP的自我增强阶段和基于Transformer的交叉增强阶段来提升描述符。
效果:实验结果表明,该方法显著提高了各项任务的性能,尤其在处理大光照变化或重复模式等挑战性情况时表现优秀。此外,该方法计算速度快,适合实际应用。

We introduce a lightweight network to improve descriptors of keypoints within the same image. The network takes the original descriptors and the geometric properties of keypoints as the input, and uses an MLP-based self-boosting stage and a Transformer-based cross-boosting stage to enhance the descriptors. The boosted descriptors can be either real-valued or binary ones. We use the proposed network to boost both hand-crafted (ORB, SIFT) and the state-of-the-art learning-based descriptors (SuperPoint, ALIKE) and evaluate them on image matching, visual localization, and structure-from-motion tasks. The results show that our method significantly improves the performance of each task, particularly in challenging cases such as large illumination changes or repetitive patterns. Our method requires only 3.2ms on desktop GPU and 27ms on embedded GPU to process 2000 features, which is fast enough to be applied to a practical system. The code and trained weights are publicly available at github.com/SJTU-ViSYS/FeatureBooster.

Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors
Zhang, GongjieandLuo, ZhipengandTian, ZichenandZhang, JingyiandZhang, XiaoqinandLu, Shijian



研究问题:如何有效地在Transformer-based对象检测器中使用多尺度特征,同时减少额外的计算成本。
动机:多尺度特征已被证明对物体检测非常有效,但往往伴随着巨大甚至难以承受的额外计算成本,尤其是在最近的基于Transformer的检测器中。
方法:提出了迭代多尺度特征聚合(IMFA)方法,这是一种通用范式,可以在基于Transformer的对象检测器中有效利用多尺度特征。核心思想是利用稀疏的多尺度特征从关键的几个位置进行采样,并通过两个新的设计来实现。首先,IMFA重新排列了Transformer编码器-解码器的管道,使得编码的特征可以根据检测预测进行迭代更新。其次,IMFA在先前检测预测的指导下,从关键位置稀疏地采样适应尺度的特征,以进行精确的检测。结果,采样的多尺度特征虽然稀疏,但对物体检测仍然非常有用。
效果:大量实验表明,提出的IMFA显著提高了多个基于Transformer的对象检测器的性能,而只有轻微的计算开销。

Multi-scale features have been proven highly effective for object detection but often come with huge and even prohibitive extra computation costs, especially for the recent Transformer-based detectors. In this paper, we propose Iterative Multi-scale Feature Aggregation (IMFA) - a generic paradigm that enables efficient use of multi-scale features in Transformer-based object detectors. The core idea is to exploit sparse multi-scale features from just a few crucial locations, and it is achieved with two novel designs. First, IMFA rearranges the Transformer encoder-decoder pipeline so that the encoded features can be iteratively updated based on the detection predictions. Second, IMFA sparsely samples scale-adaptive features for refined detection from just a few keypoint locations under the guidance of prior detection predictions. As a result, the sampled multi-scale features are sparse yet still highly beneficial for object detection. Extensive experiments show that the proposed IMFA boosts the performance of multiple Transformer-based object detectors significantly yet with only slight computational overhead.

GeoMVSNet: Learning Multi-View Stereo With Geometry Perception
Zhang, ZheandPeng, RuiandHu, YuxiandWang, Ronggang



研究问题:现有的级联多视角立体(MVS)方法在估计高分辨率深度图时,忽略了粗略阶段中嵌入的重要几何信息,导致成本匹配脆弱和重建结果次优。
动机:为了解决上述问题,本文提出了一种名为GeoMVSNet的几何感知模型,通过明确整合粗略阶段中的几何线索来进行精细的深度估计。
方法:我们设计了一个双分支几何融合网络,从粗略估计中提取几何先验以增强更精细阶段的结构性特征提取。此外,我们还将编码有价值深度分布属性的粗略概率体积嵌入到轻量级正则化网络中,以进一步加强深度方向的几何直观性。同时,我们应用了频域滤波来减轻高频区域带来的负面影响,并采用了课程学习策略逐步提升模型的几何集成能力。为了强化模型对全场景几何感知的能力,我们提出了基于高斯混合模型假设的深度分布相似性损失。
效果:在DTU和Tanks and Temples(T&T)数据集上的大量实验表明,我们的GeoMVSNet实现了最先进的结果,并在T&T-Advanced集上排名第一。代码可在https://github.com/doubleZ0108/GeoMVSNet获取。

Recent cascade Multi-View Stereo (MVS) methods can efficiently estimate high-resolution depth maps through narrowing hypothesis ranges. However, previous methods ignored the vital geometric information embedded in coarse stages, leading to vulnerable cost matching and sub-optimal reconstruction results. In this paper, we propose a geometry awareness model, termed GeoMVSNet, to explicitly integrate geometric clues implied in coarse stages for delicate depth estimation. In particular, we design a two-branch geometry fusion network to extract geometric priors from coarse estimations to enhance structural feature extraction at finer stages. Besides, we embed the coarse probability volumes, which encode valuable depth distribution attributes, into the lightweight regularization network to further strengthen depth-wise geometry intuition. Meanwhile, we apply the frequency domain filtering to mitigate the negative impact of the high-frequency regions and adopt the curriculum learning strategy to progressively boost the geometry integration of the model. To intensify the full-scene geometry perception of our model, we present the depth distribution similarity loss based on the Gaussian-Mixture Model assumption. Extensive experiments on DTU and Tanks and Temples (T&T) datasets demonstrate that our GeoMVSNet achieves state-of-the-art results and ranks first on the T&T-Advanced set. Code is available at https://github.com/doubleZ0108/GeoMVSNet.

DNeRV: Modeling Inherent Dynamics via Difference Neural Representation for Videos
Zhao, QiandAsif, M.SalmanandMa, Zhan



研究问题:现有的视频隐含神经表示方法未能充分利用视频中的时空冗余信息。
动机:索引基的隐含神经表示忽略了内容特定的空间特征,混合的隐含神经表示忽略了相邻帧之间的上下文依赖性,导致对大动态或大运动场景的建模能力较差。
方法:从函数拟合的角度分析这一局限性,揭示帧差的重要性。提出差异神经表示法(DNeRV),包括内容和帧差两个流,并引入协作内容单元进行有效的特征融合。
效果:在视频压缩、修复和插值等任务上测试DNeRV,其结果与最先进的神经压缩方法相当,并在960 x 1920的视频的后续修复和插值任务上优于现有的隐含方法。

Existing implicit neural representation (INR) methods do not fully exploit spatiotemporal redundancies in videos. Index-based INRs ignore the content-specific spatial features and hybrid INRs ignore the contextual dependency on adjacent frames, leading to poor modeling capability for scenes with large motion or dynamics. We analyze this limitation from the perspective of function fitting and reveal the importance of frame difference. To use explicit motion information, we propose Difference Neural Representation for Videos (DNeRV), which consists of two streams for content and frame difference. We also introduce a collaborative content unit for effective feature fusion. We test DNeRV for video compression, inpainting, and interpolation. DNeRV achieves competitive results against the state-of-the-art neural compression approaches and outperforms existing implicit methods on downstream inpainting and interpolation for 960 x 1920 videos.

FFF: Fragment-Guided Flexible Fitting for Building Complete Protein Structures
Chen, WeijieandWang, XinyanandWang, Yuhang



研究问题:如何通过结合片段识别和结构预测方法,解决冷冻电镜技术在构建完整蛋白质结构时面临的信号噪声比低的问题。
动机:冷冻电镜技术可以重建生物分子的三维结构,但直接从其映射中从头开始构建蛋白质结构存在困难。AlphaFold的出现为蛋白质结构预测提供了新的思路。
方法:提出一种名为FFF的新方法,该方法通过灵活拟合将蛋白质结构预测与蛋白质结构识别相结合。首先,使用多级识别网络从输入的3D冷冻电镜图中提取各种结构特征;然后,根据提取的特征生成蛋白质结构片段;最后,通过灵活拟合使用预测的蛋白质片段构建完整的结构模型。
效果:基准测试结果显示,FFF在构建完整的蛋白质结构方面优于基线方法。

Cryo-electron microscopy (cryo-EM) is a technique for reconstructing the 3-dimensional (3D) structure of biomolecules (especially large protein complexes and molecular assemblies). As the resolution increases to the near-atomic scale, building protein structures de novo from cryo-EM maps becomes possible. Recently, recognition-based de novo building methods have shown the potential to streamline this process. However, it cannot build a complete structure due to the low signal-to-noise ratio (SNR) problem. At the same time, AlphaFold has led to a great breakthrough in predicting protein structures. This has inspired us to combine fragment recognition and structure prediction methods to build a complete structure. In this paper, we propose a new method named FFF that bridges protein structure prediction and protein structure recognition with flexible fitting. First, a multi-level recognition network is used to capture various structural features from the input 3D cryo-EM map. Next, protein structural fragments are generated using pseudo peptide vectors and a protein sequence alignment method based on these extracted features. Finally, a complete structural model is constructed using the predicted protein fragments via flexible fitting. Based on our benchmark tests, FFF outperforms the baseline meth- ods for building complete protein structures.

topic-5

Topic words :  image,  text,  language,  tasks,  visual,  pre,  training,  vision

Picture That Sketch: Photorealistic Image Generation From Abstract Sketches
Koley, SubhadeepandBhunia, AyanKumarandSain, AneeshanandChowdhury, PinakiNathandXiang, TaoandSong, Yi-Zhe



研究问题:如何将非专业的手绘草图转化为照片般真实的图像。
动机:现有的技术需要以边缘地图式的草图作为起点,而本文的目标是处理抽象的徒手人类草图。
方法:提出一个解耦的编码器-解码器训练范式,其中解码器是一个仅在照片上训练的StyleGAN。此外,还提出了一种自回归草图映射器,用于将草图映射到StyleGAN的潜在空间。
效果:实验结果表明,该方法可以生成始终逼真的照片般真实的结果,并展示了一些下游任务,如精细的草图基于图像检索等。

Given an abstract, deformed, ordinary sketch from untrained amateurs like you and me, this paper turns it into a photorealistic image - just like those shown in Fig. 1(a), all non-cherry-picked. We differ significantly from prior art in that we do not dictate an edgemap-like sketch to start with, but aim to work with abstract free-hand human sketches. In doing so, we essentially democratise the sketch-to-photo pipeline, "picturing" a sketch regardless of how good you sketch. Our contribution at the outset is a decoupled encoder-decoder training paradigm, where the decoder is a StyleGAN trained on photos only. This importantly ensures that generated results are always photorealistic. The rest is then all centred around how best to deal with the abstraction gap between sketch and photo. For that, we propose an autoregressive sketch mapper trained on sketch-photo pairs that maps a sketch to the StyleGAN latent space. We further introduce specific designs to tackle the abstract nature of human sketches, including a fine-grained discriminative loss on the back of a trained sketch-photo retrieval model, and a partial-aware sketch augmentation strategy. Finally, we showcase a few downstream tasks our generation model enables, amongst them is showing how fine-grained sketch-based image retrieval, a well-studied problem in the sketch community, can be reduced to an image (generated) to image retrieval task, surpassing state-of-the-arts. We put forward generated results in the supplementary for everyone to scrutinise. Project page: https://subhadeepkoley.github.io/PictureThatSketch

Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information
Su, WeijieandZhu, XizhouandTao, ChenxinandLu, LeweiandLi, BinandHuang, GaoandQiao, YuandWang, XiaogangandZhou, JieandDai, Jifeng



研究问题:如何有效地利用大规模模型的潜力,提出各种预训练策略以支持来自不同来源的大量数据。
动机:目前的预训练方法采用多阶段预训练系统,复杂的流程可能增加预训练的不确定性和不稳定性,因此希望这些策略能够以单阶段的方式进行整合。
方法:提出了一种通用的多模态互信息公式作为统一的优化目标,并证明所有主流方法都是该框架的特例。在这个统一的视角下,提出了一种名为“最大化多模态互信息预训练”(M3I预训练)的一体化单阶段预训练方法。
效果:该方法在各种视觉基准测试中的表现优于以往的预训练方法,包括ImageNet分类、COCO对象检测、LVIS长尾对象检测和ADE20k语义分割等。特别是在公共数据集设置下,成功预训练了一个十亿级别的参数图像主干网络,并在各种基准测试中取得了最先进的性能。

To effectively exploit the potential of large-scale models, various pre-training strategies supported by massive data from different sources are proposed, including supervised pre-training, weakly-supervised pre-training, and self-supervised pre-training. It has been proved that combining multiple pre-training strategies and data from various modalities/sources can greatly boost the training of large-scale models. However, current works adopt a multi-stage pre-training system, where the complex pipeline may increase the uncertainty and instability of the pre-training. It is thus desirable that these strategies can be integrated in a single-stage manner. In this paper, we first propose a general multi-modal mutual information formula as a unified optimization target and demonstrate that all mainstream approaches are special cases of our framework. Under this unified perspective, we propose an all-in-one single-stage pre-training approach, named Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training). Our approach achieves better performance than previous pre-training methods on various vision benchmarks, including ImageNet classification, COCO object detection, LVIS long-tailed object detection, and ADE20k semantic segmentation. Notably, we successfully pre-train a billion-level parameter image backbone and achieve state-of-the-art performance on various benchmarks under public data setting. Code shall be released at https://github.com/OpenGVLab/M3I-Pretraining.

Aligning Bag of Regions for Open-Vocabulary Object Detection
Wu, SizeandZhang, WenweiandJin, ShengandLiu, WentaoandLoy, ChenChange



研究问题:现有的视觉语言模型和物体检测器无法充分利用图像中语义概念的组合结构。
动机:为了解决这一问题,我们提出了一种新的方法,将相关联的区域组合成“bag”,并将这些区域的嵌入作为句子中的词来处理。
方法:我们将一组上下文相关的区域组合成一个"bag",将这些区域的嵌入作为句子中的词来处理,然后将它们发送到视觉语言模型的文本编码器中,以获取"bag-of-regions"嵌入,该嵌入被学习以与冻结的视觉语言模型提取的特征对齐。
效果:在常用的Faster R-CNN上应用我们的方法,在开放词汇COCO和LVIS基准测试的新类别上,我们的方法是之前最佳结果的4.6倍盒AP 50和2.8倍mask AP。

Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP 50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://github.com/wusize/ovdet.

CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP
Chen, RunnanandLiu, YouquanandKong, LingdongandZhu, XingeandMa, YuexinandLi, YikangandHou, YuenanandQiao, YuandWang, Wenping



研究问题:探索如何将CLIP知识应用于3D场景理解。
动机:尽管CLIP在2D图像-文本预训练模型中取得了显著的成果,但其在3D场景理解中的应用尚未被探索。
方法:提出CLIP2Scene框架,将CLIP的知识从2D图像-文本预训练模型转移到3D点云网络中。设计了一个基于语义的跨模态对比学习框架来预训练3D网络。
效果:在SemanticKITTI、nuScenes和ScanNet上进行实验,首次实现了无标注的3D语义分割,并在有标注数据微调后,显著优于其他自监督方法。

Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D zero-shot and few-shot learning. Despite the impressive performance in 2D, applying CLIP to help the learning in 3D scene understanding has yet to be explored. In this paper, we make the first attempt to investigate how CLIP knowledge benefits 3D scene understanding. We propose CLIP2Scene, a simple yet effective framework that transfers CLIP knowledge from 2D image-text pre-trained models to a 3D point cloud network. We show that the pre-trained 3D network yields impressive performance on various downstream tasks, i.e., annotation-free and fine-tuning with labelled data for semantic segmentation. Specifically, built upon CLIP, we design a Semantic-driven Cross-modal Contrastive Learning framework that pre-trains a 3D network via semantic and spatial-temporal consistency regularization. For the former, we first leverage CLIP's text semantics to select the positive and negative point samples and then employ the contrastive loss to train the 3D network. In terms of the latter, we force the consistency between the temporally coherent point cloud features and their corresponding image features. We conduct experiments on SemanticKITTI, nuScenes, and ScanNet. For the first time, our pre-trained network achieves annotation-free 3D semantic segmentation with 20.8% and 25.08% mIoU on nuScenes and ScanNet, respectively. When fine-tuned with 1% or 100% labelled data, our method significantly outperforms other self-supervised methods, with improvements of 8% and 1% mIoU, respectively. Furthermore, we demonstrate the generalizability for handling cross-domain datasets. Code is publicly available.

Prefix Conditioning Unifies Language and Label Supervision
Saito, KuniakiandSohn, KihyukandZhang, XiangandLi, Chun-LiangandLee, Chen-YuandSaenko, KateandPfister, Tomas



研究问题:如何结合大规模分类数据集和图像-标题数据集进行预训练,以充分利用两者的互补优势。
动机:图像-标题数据集具有更开放的领域,包含更广泛的场景类型和词汇,而大规模分类数据集可以提供平衡标签分布的精细类别。
方法:提出了一种新的预训练策略,使用分类和标题数据集进行联合训练,通过引入前缀令牌来解决数据集偏差问题,使语言编码器能够学习两个数据集的特征。
效果:实验表明,该方法提高了零样本图像识别的性能,并增强了对图像级别的分布偏移的鲁棒性。

Pretraining visual models on web-scale image-caption datasets has recently emerged as a powerful alternative to traditional pretraining on image classification data. Image-caption datasets are more "open-domain", containing broader scene types and vocabulary words, and result in models that have strong performance in few- and zero-shot recognition tasks. However large-scale classification datasets can provide fine-grained categories with a balanced label distribution. In this work, we study a pretraining strategy that uses both classification and caption datasets to unite their complementary benefits. First, we show that naively unifying the datasets results in sub-optimal performance in downstream zero-shot recognition tasks, as the model is affected by dataset bias: the coverage of image domains and vocabulary words is different in each dataset. We address this problem with novel Prefix Conditioning, a simple yet effective method that helps disentangle dataset biases from visual concepts. This is done by introducing prefix tokens that inform the language encoder of the input data type (e.g., classification vs caption) at training time. Our approach allows the language encoder to learn from both datasets while also tailoring feature extraction to each dataset. Prefix conditioning is generic and can be easily integrated into existing VL pretraining objectives, such as CLIP or UniCL. In experiments, we show that it improves zero-shot image recognition and robustness to image-level distribution shift.

GD-MAE: Generative Decoder for MAE Pre-Training on LiDAR Point Clouds
Yang, HonghuiandHe, TongandLiu, JiahengandChen, HuaandWu, BoxiandLin, BinbinandHe, XiaofeiandOuyang, Wanli



研究问题:探索大规模3D点云中的Masked Autoencoders (MAE)。
动机:现有的3D MAE框架设计复杂,或者采用复杂的遮蔽策略,我们提出一个更简单的方法。
方法:应用生成解码器进行MAE(GD-MAE),自动合并周围上下文,以层次融合的方式恢复被遮蔽的几何知识。
效果:在Waymo、KITTI和ONCE等大型基准测试中表现出色,不仅达到最先进的结果,而且在Waymo数据集上即使只有20%的标记数据也能达到相当的精度。

Despite the tremendous progress of Masked Autoencoders (MAE) in developing vision tasks such as image and video, exploring MAE in large-scale 3D point clouds remains challenging due to the inherent irregularity. In contrast to previous 3D MAE frameworks, which either design a complex decoder to infer masked information from maintained regions or adopt sophisticated masking strategies, we instead propose a much simpler paradigm. The core idea is to apply a Generative Decoder for MAE (GD-MAE) to automatically merges the surrounding context to restore the masked geometric knowledge in a hierarchical fusion manner. In doing so, our approach is free from introducing the heuristic design of decoders and enjoys the flexibility of exploring various masking strategies. The corresponding part costs less than 12% latency compared with conventional methods, while achieving better performance. We demonstrate the efficacy of the proposed method on several large-scale benchmarks: Waymo, KITTI, and ONCE. Consistent improvement on downstream detection tasks illustrates strong robustness and generalization capability. Not only our method reveals state-of-the-art results, but remarkably, we achieve comparable accuracy even with 20% of the labeled data on the Waymo dataset. Code will be released.

Towards Robust Tampered Text Detection in Document Image: New Dataset and New Solution
Qu, ChenfanandLiu, ChongyuandLiu, YuliangandChen, XinhongandPeng, DezhiandGuo, FengjunandJin, Lianwen



研究问题:如何有效地检测图像中被篡改的文本。
动机:由于其在信息安全中的重要作用,对文档图像中篡改文本的检测越来越受到关注。
方法:提出了一种新的框架,称为文档篡改检测器(DTD),包括频率感知头(FPH)和多视图迭代解码器(MID),以捕捉复杂场景中的更精细线索。同时设计了一种新的训练方式,称为篡改检测课程学习(CLTD)。
效果:实验表明,提出的DTD在各种类型的文档图像上的表现优于先前最先进的技术,提高了9.2%,26.3%和12.3%的F-measure。

Recently, tampered text detection in document image has attracted increasingly attention due to its essential role on information security. However, detecting visually consistent tampered text in photographed document images is still a main challenge. In this paper, we propose a novel framework to capture more fine-grained clues in complex scenarios for tampered text detection, termed as Document Tampering Detector (DTD), which consists of a Frequency Perception Head (FPH) to compensate the deficiencies caused by the inconspicuous visual features, and a Multi-view Iterative Decoder (MID) for fully utilizing the information of features in different scales. In addition, we design a new training paradigm, termed as Curriculum Learning for Tampering Detection (CLTD), which can address the confusion during the training procedure and thus to improve the robustness for image compression and the ability to generalize. To further facilitate the tampered text detection in document images, we construct a large-scale document image dataset, termed as DocTamper, which contains 170,000 document images of various types. Experiments demonstrate that our proposed DTD outperforms previous state-of-the-art by 9.2%, 26.3% and 12.3% in terms of F-measure on the DocTamper testing set, and the cross-domain testing sets of DocTamper-FCD and DocTamper-SCD, respectively. Codes and dataset will be available at https://github.com/qcf-568/DocTamper.

VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval
Huang, SitengandGong, BiaoandPan, YulinandJiang, JianwenandLv, YiliangandLi, YuyuanandWang, Donglin



研究问题:如何有效地进行文本-视频跨模态检索,同时避免增加大量参数和知识遗忘的问题。
动机:目前的研究大多通过添加重型模块来调整预训练的CLIP模型,这既增加了巨大的计算负担,也导致从上游模型中的知识遗忘。
方法:提出VoP(Text-Video Co-operative Prompt Tuning)框架,引入视频和文本提示,可以视为仅具有0.1%可训练参数的强大基线。此外,基于视频的时空特性,开发了三种新的视频提示机制,以改善不同规模的可训练参数的性能。
效果:实验表明,与完全微调相比,增强的VoP在五个文本-视频检索基准上实现了1.4%的平均R@1增益,参数开销减少了6倍。

Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models. In this work, we propose the VoP: Text-Video Co-operative Prompt Tuning for efficient tuning on the text-video retrieval task. The proposed VoP is an end-to-end framework with both video & text prompts introducing, which can be regarded as a powerful baseline with only 0.1% trainable parameters. Further, based on the spatio-temporal characteristics of videos, we develop three novel video prompt mechanisms to improve the performance with different scales of trainable parameters. The basic idea of the VoP enhancement is to model the frame position, frame context, and layer function with specific trainable prompts, respectively. Extensive experiments show that compared to full fine-tuning, the enhanced VoP achieves a 1.4% average R@1 gain across five text-video retrieval benchmarks with 6x less parameter overhead. The code will be available at https://github.com/bighuang624/VoP.

Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR
Sain, AneeshanandBhunia, AyanKumarandKoley, SubhadeepandChowdhury, PinakiNathandChattopadhyay, SoumitriandXiang, TaoandSong, Yi-Zhe



研究问题:本文旨在解决细粒度草图基图像检索(FG-SBIR)中的两个关键问题,即黄金标准三元组损失无法强制整体潜在空间几何,以及训练高精度模型的草图数量永远不足。
动机:为了解决上述问题,作者提出了一种改进的三元组损失和一种新的知识蒸馏模块,并设计了一种新颖的即插即用训练范式。
方法:首先,对标准的三元组损失进行了修改,以明确在照片/草图实例之间进行分离。其次,提出了一种新的知识蒸馏模块,可以利用照片数据进行模型训练。最后,将这两个模块插入到新的即插即用训练范式中,以实现更稳定的训练。
效果:实验结果表明,该方法不仅显著超越了现有技术,而且在新类别上的泛化结果也令人满意。

This paper advances the fine-grained sketch-based image retrieval (FG-SBIR) literature by putting forward a strong baseline that overshoots prior state-of-the art by 11%. This is not via complicated design though, but by addressing two critical issues facing the community (i) the gold standard triplet loss does not enforce holistic latent space geometry, and (ii) there are never enough sketches to train a high accuracy model. For the former, we propose a simple modification to the standard triplet loss, that explicitly enforces separation amongst photos/sketch instances. For the latter, we put forward a novel knowledge distillation module can leverage photo data for model training. Both modules are then plugged into a novel plug-n-playable training paradigm that allows for more stable training. More specifically, for (i) we employ an intra-modal triplet loss amongst sketches to bring sketches of the same instance closer from others, and one more amongst photos to push away different photo instances while bringing closer a structurally augmented version of the same photo (offering a gain of 4-6%). To tackle (ii), we first pre-train a teacher on the large set of unlabelled photos over the aforementioned intra-modal photo triplet loss. Then we distill the contextual similarity present amongst the instances in the teacher's embedding space to that in the student's embedding space, by matching the distribution over inter-feature distances of respective samples in both embedding spaces (delivering a further gain of 4-5%). Apart from outperforming prior arts significantly, our model also yields satisfactory results on generalising to new classes. Project page: https://aneeshan95.github.io/Sketch_PVT/

Seeing What You Miss: Vision-Language Pre-Training With Semantic Completion Learning
Ji, YataiandTu, RongchengandJiang, JieandKong, WeijieandCai, ChengfeiandZhao, WenzheandWang, HongfaandYang, YujiuandLiu, Wei



研究问题:本文旨在解决视觉-语言预训练模型中跨模态对齐的问题,以学习不同模态之间的正确对应信息。
动机:现有的视觉-语言预训练模型在跨模态对齐方面存在局限,主要关注基于可见上下文的局部对齐,忽视了全局语义特征的学习。
方法:受自然语言处理预训练领域成功应用的掩码语言建模任务启发,提出一种新颖的语义完成学习(SCL)任务,以促进全局-局部对齐。具体来说,SCL任务通过捕获另一模态的对应信息来补充被遮蔽数据缺失的语义,从而学习更具代表性的全局特征。
效果:实验结果表明,该方法在各种视觉-语言基准测试中取得了最先进的性能,如视觉问答、图像-文本检索和视频-文本检索。

Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-to-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations. Therefore, in this paper, we propose a novel Semantic Completion Learning (SCL) task, complementary to existing masked modeling tasks, to facilitate global-to-local alignment. Specifically, the SCL task complements the missing semantics of masked data by capturing the corresponding information from the other modality, promoting learning more representative global features which have a great impact on the performance of downstream tasks. Moreover, we present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.

Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning
Jiang, QianandChen, ChangyouandZhao, HanandChen, LiqunandPing, QingandTran, SonDinhandXu, YiandZeng, BelindaandChilimbi, Trishul



研究问题:对比损失在多模态学习表示中被广泛应用,但其对下游任务性能的影响尚未明确。
动机:尽管对比损失鼓励模态在潜在空间中完全匹配,但完全的模态对齐并不一定能带来最佳的下游预测任务性能。
方法:本文提出了三种构建潜在模态结构的方法,包括深度特征分离损失用于模内正则化、布朗桥损失用于模间正则化和几何一致性损失用于模内和模间正则化。
效果:通过在两个流行的多模态表示学习框架上进行实验,证明了该方法在零/少次图像分类、图像-文本检索、视觉问答、视觉推理和视觉蕴含等任务上的有效性和通用性。

Contrastive loss has been increasingly used in learning representations from multiple modalities. In the limit, the nature of the contrastive loss encourages modalities to exactly match each other in the latent space. Yet it remains an open question how the modality alignment affects the downstream task performance. In this paper, based on an information-theoretic argument, we first prove that exact modality alignment is sub-optimal in general for downstream prediction tasks. Hence we advocate that the key of better performance lies in meaningful latent modality structures instead of perfect modality alignment. To this end, we propose three general approaches to construct latent modality structures. Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization. Extensive experiments are conducted on two popular multi-modal representation learning frameworks: the CLIP-based two-tower model and the ALBEF-based fusion model. We test our model on a variety of tasks including zero/few-shot image classification, image-text retrieval, visual question answering, visual reasoning, and visual entailment. Our method achieves consistent improvements over existing methods, demonstrating the effectiveness and generalizability of our proposed approach on latent modality structure regularization.

Variational Distribution Learning for Unsupervised Text-to-Image Generation
Kang, MinsooandLee, DoyupandKim, JiseobandKim, SaehoonandHan, Bohyung



研究问题:在训练过程中,当图像的文本描述不可用时,如何基于深度神经网络提出一种文本到图像的生成算法。
动机:现有的图像描述方法只是简单地生成训练图像的伪真实句子,我们采用预训练的CLIP模型,该模型能够在联合空间中正确对齐图像和相应文本的嵌入,从而在零射击识别任务上表现良好。
方法:通过最大化条件成对图像-文本CLIP嵌入的数据对数似然性来优化文本到图像生成模型。为了更好地对齐两个领域的数据,我们采用基于变分推理的原理,有效地估计给定图像及其CLIP特征的隐藏文本嵌入的近似后验。
效果:实验结果表明,所提出的框架在无监督和半监督的文本到图像生成设置下大大优于现有方法。

We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training. In this work, instead of simply generating pseudo-ground-truth sentences of training images using existing image captioning methods, we employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space and, consequently, works well on zero-shot recognition tasks. We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings. To better align data in the two domains, we employ a principled way based on a variational inference, which efficiently estimates an approximate posterior of the hidden text embedding given an image and its CLIP feature. Experimental results validate that the proposed framework outperforms existing approaches by large margins under unsupervised and semi-supervised text-to-image generation settings.

Cross-Domain Image Captioning With Discriminative Finetuning
Dess{\`\i



研究问题:训练神经网络描述器模仿人类生成的参考,但不优化任何特定的通信目标,导致产生模糊的描述。
动机:通过自我监督的判别性通信目标对现成的神经网络描述器进行微调,有助于恢复直观、视觉描述性的语言,使其更能反映图像内容。
方法:给定一个目标图像,系统必须学会生成一个描述,使现成的文本条件图像检索器能够在一组候选图像中识别出该图像。我们使用流行的ClipCap描述器进行实验,并使用BLIP复制了主要结果。
效果:在与地面真实人类描述的相似性方面,当非微调模型在相同的字幕数据集上进行训练和测试时,来自判别性微调的描述落后于那些由未微调模型生成的描述。然而,当我们使用模型为非领域数据集生成描述时,我们的判别性微调描述器生成的描述比未经微调的描述更接近人类参考。我们还进一步表明,在概念性字幕数据集上,判别性微调的描述对于人类注释者执行图像判别任务比香草ClipCap描述或地面真实描述更有用。

Neural captioners are typically trained to mimic human-generated references without optimizing for any specific communication goal, leading to problems such as the generation of vague captions. In this paper, we show that fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language that is more informative about image contents. Given a target image, the system must learn to produce a description that enables an out-of-the-box text-conditioned image retriever to identify such image among a set of candidates. We experiment with the popular ClipCap captioner, also replicating the main results with BLIP. In terms of similarity to ground-truth human descriptions, the captions emerging from discriminative finetuning lag slightly behind those generated by the non-finetuned model, when the latter is trained and tested on the same caption dataset. However, when the model is used without further tuning to generate captions for out-of-domain datasets, our discriminatively-finetuned captioner generates descriptions that resemble human references more than those produced by the same captioner without finetuning. We further show that, on the Conceptual Captions dataset, discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task.

Accelerating Vision-Language Pretraining With Free Language Modeling
Wang, TengandGe, YixiaoandZheng, FengandCheng, RanandShan, YingandQie, XiaohuandLuo, Ping



研究问题:现有的视觉-语言预训练模型(VLP)虽然表现优秀,但训练成本高,尤其是在大型网络数据集上,因为其收敛速度慢且训练时间长。
动机:提高VLP的训练效率的主要障碍在于被遮蔽的语言模型(MLM)中的预测率和损坏率相互关联,即为了达到适当的损坏率,需要排除大部分输出标记不参与预测损失。
方法:为加速VLP的收敛,提出一种新的预训练任务——自由语言建模(FLM),它可以实现100%的预测率和任意的损坏率。通过允许每个待预测的标记自定义损坏范围,FLM成功地将预测率与损坏率解耦。
效果:实验表明,与基于MLM的方法相比,FLM可以在相同的GPU时间内训练出性能相当的模型,同时预训练时间减少了2.5倍。

The state of the arts in vision-language pretraining (VLP) achieves exemplary performance but suffers from high training costs resulting from slow convergence and long training time, especially on large-scale web datasets. An essential obstacle to training efficiency lies in the entangled prediction rate (percentage of tokens for reconstruction) and corruption rate (percentage of corrupted tokens) in masked language modeling (MLM), that is, a proper corruption rate is achieved at the cost of a large portion of output tokens being excluded from prediction loss. To accelerate the convergence of VLP, we propose a new pretraining task, namely, free language modeling (FLM), that enables a 100% prediction rate with arbitrary corruption rates. FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted. FLM-trained models are encouraged to learn better and faster given the same GPU time by exploiting bidirectional contexts more flexibly. Extensive experiments show FLM could achieve an impressive 2.5x pretraining time reduction in comparison to the MLM-based methods, while keeping competitive performance on both vision-language understanding and generation tasks.

Masked Autoencoders Enable Efficient Knowledge Distillers
Bai, YutongandWang, ZeyuandXiao, JunfeiandWei, ChenandWang, HuiyuandYuille, AlanL.andZhou, YuyinandXie, Cihang



研究问题:如何从预训练模型中提取知识,特别是遮蔽自动编码器。
动机:优化像素重建损失和减小教师模型与学生模型的中间特征图之间的距离,设计出一种计算效率高的知识蒸馏框架。
方法:通过仅使用一小部分可见的补丁和使用部分执行(即前几层的输入的前向传播)的教师模型来获取中间特征图,进行知识蒸馏。
效果:与直接蒸馏微调模型相比,蒸馏预训练模型显著提高了下游性能。例如,通过将一个预训练的ViT-L模型的知识蒸馏到一个ViT-B模型中,该方法实现了84.0%的ImageNet top-1准确率,比直接蒸馏微调的ViT-L模型的基线高出1.2%。更有趣的是,即使在极高的遮蔽比率下,该方法也能从教师模型中稳健地提取知识。

This paper studies the potential of distilling knowledge from pre-trained models, especially Masked Autoencoders. Our approach is simple: in addition to optimizing the pixel reconstruction loss on masked inputs, we minimize the distance between the intermediate feature map of the teacher model and that of the student model. This design leads to a computationally efficient knowledge distillation framework, given 1) only a small visible subset of patches is used, and 2) the (cumbersome) teacher model only needs to be partially executed, i.e., forward propagate inputs through the first few layers, for obtaining intermediate feature maps. Compared to directly distilling fine-tuned models, distilling pre-trained models substantially improves downstream performance. For example, by distilling the knowledge from an MAE pre-trained ViT-L into a ViT-B, our method achieves 84.0% ImageNet top-1 accuracy, outperforming the baseline of directly distilling a fine-tuned ViT-L by 1.2%. More intriguingly, our method can robustly distill knowledge from teacher models even with extremely high masking ratios: e.g., with 95% masking ratio where merely TEN patches are visible during distillation, our ViT-B competitively attains a top-1 ImageNet accuracy of 83.6%; surprisingly, it can still secure 82.4% top-1 ImageNet accuracy by aggressively training with just FOUR visible patches (98% masking ratio). The code will be made publicly available.

TinyMIM: An Empirical Study of Distilling MIM Pre-Trained Models
Ren, SuchengandWei, FangyunandZhang, ZhengandHu, Han



研究问题:如何将大型视觉Transformer的预训练成功转移到小型模型上。
动机:虽然大型视觉Transformer的预训练模型在大规模图像语料库上表现良好,但小型模型无法或只能从这种预训练方法中获得少量收益,这对实际应用至关重要。
方法:通过知识蒸馏技术将大型基于遮蔽图像建模(MIM)的预训练模型的成功转移到小型模型上。系统地研究了知识蒸馏框架中的不同选项,包括蒸馏目标、损失、输入、网络正则化、顺序蒸馏等。
效果:实验结果表明,使用所有ViT-Tiny、ViT-Small和ViT-base模型,我们在ImageNet-1K分类任务上的微调精度提高了+4.2%/+2.4%/+1.4%。我们的TinyMIM模型在AE20K语义分割任务上达到了52.2 mIoU,比MAE基线高出+4.1。我们的TinyMIM模型在ImageNet-1K图像分类任务上达到了79.6%的top-1精度,为相同大小和计算预算的小型视觉模型设定了新的记录。这强大的性能表明,开发小型视觉Transformer模型的另一种方法是探索更好的训练方法,而不是像大多数先前的工作那样在架构中引入归纳偏差。

Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.

OneFormer: One Transformer To Rule Universal Image Segmentation
Jain, JiteshandLi, JiachenandChiu, MangTikandHassani, AliandOrlov, NikitaandShi, Humphrey



研究问题:如何实现图像分割的统一框架,以同时提高语义、实例和全景分割的性能。
动机:现有的图像分割方法需要分别对语义、实例或全景分割进行训练,缺乏统一性。理想的情况是,一个真正的通用框架只需训练一次,就能在所有三个图像分割任务上达到最先进的性能。
方法:提出了一种名为OneFormer的通用图像分割框架,通过多任务一次训练的设计将分割与任务统一起来。首先,提出了一种任务条件联合训练策略,使得在一个单一的多任务训练过程中可以基于每个领域的真值(语义、实例和全景分割)进行训练。其次,引入了任务令牌,使模型根据手头的任务进行动态调整,支持多任务训练和推理。第三,提出了在训练过程中使用查询-文本对比损失来建立更好的任务间和类别间的区分。
效果:实验结果表明,尽管后者是在每个任务上单独进行训练的,但OneFormer模型在所有三个分割任务上都优于专门的Mask2Former模型,显示出了显著的改进。

Universal Image Segmentation is not a new concept.Past attempts to unify image segmentation include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, Cityscapes, and COCO, despite the latter being trained on each task individually. We believe OneFormer is a significant step towards making image segmentation more universal and accessible.

GeoLayoutLM: Geometric Pre-Training for Visual Information Extraction
Luo, ChuweiandCheng, ChangxuandZheng, QiandYao, Cong



研究问题:本文旨在解决预训练模型在知识图谱中利用外部知识增强语言表示的问题,以及视觉信息提取(VIE)任务中的关系抽取(RE)问题。
动机:目前的预训练语言模型和视觉信息提取模型在处理关系抽取任务时,由于缺乏对几何信息的充分利用和预训练与微调阶段目标差距大,导致性能受限。
方法:本文提出了一种名为GeoLayoutLM的多模态框架,通过设计三个特殊的几何相关预训练任务进行几何预训练,并设计了新的、经过几何预训练任务预训练和关系抽取微调的关系头来丰富和增强特征表示。
效果:实验结果表明,GeoLayoutLM在语义实体识别任务上取得了高度竞争的成绩,并在关系抽取任务上显著优于先前最先进的模型(例如,FUNSD数据集上的关系抽取F1分数从80.35%提升到89.45%)。

Visual information extraction (VIE) plays an important role in Document Intelligence. Generally, it is divided into two tasks: semantic entity recognition (SER) and relation extraction (RE). Recently, pre-trained models for documents have achieved substantial progress in VIE, particularly in SER. However, most of the existing models learn the geometric representation in an implicit way, which has been found insufficient for the RE task since geometric information is especially crucial for RE. Moreover, we reveal another factor that limits the performance of RE lies in the objective gap between the pre-training phase and the fine-tuning phase for RE. To tackle these issues, we propose in this paper a multi-modal framework, named GeoLayoutLM, for VIE. GeoLayoutLM explicitly models the geometric relations in pre-training, which we call geometric pre-training. Geometric pre-training is achieved by three specially designed geometry-related pre-training tasks. Additionally, novel relation heads, which are pre-trained by the geometric pre-training tasks and fine-tuned for RE, are elaborately designed to enrich and enhance the feature representation. According to extensive experiments on standard VIE benchmarks, GeoLayoutLM achieves highly competitive scores in the SER task and significantly outperforms the previous state-of-the-arts for RE (e.g.,the F1 score of RE on FUNSD is boosted from 80.35% to 89.45%).

Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework
Zheng, SipengandXu, BoshenandJin, Qin



研究问题:现有针对人与物体交互(HOI)的预测方法受限于有限的监督数据和现实生活中可能的交互组合数量,且无法很好地扩展到开放类别。
动机:为了解决这一问题,我们提出了OpenCat,这是一个将人与物体交互预测转化为序列生成的语言模型框架。
方法:我们将HOI三元组通过一种序列化方案转换为标记序列,使模型能够利用语言模型框架的开放词汇来预测新的交互类别。同时,我们还从图像-标题对中收集了大量弱监督数据,并设计了几个辅助代理任务进行预训练。
效果:实验表明,OpenCat显著提高了HOI的性能,尤其是在广泛的罕见和未见过的类别上。

Human-object interaction (HOI) has long been plagued by the conflict between limited supervised data and a vast number of possible interaction combinations in real life. Current methods trained from closed-set data predict HOIs as fixed-dimension logits, which restricts their scalability to open-set categories. To address this issue, we introduce OpenCat, a language modeling framework that reformulates HOI prediction as sequence generation. By converting HOI triplets into a token sequence through a serialization scheme, our model is able to exploit the open-set vocabulary of the language modeling framework to predict novel interaction classes with a high degree of freedom. In addition, inspired by the great success of vision-language pre-training, we collect a large amount of weakly-supervised data related to HOI from image-caption pairs, and devise several auxiliary proxy tasks, including soft relational matching and human-object relation prediction, to pre-train our model. Extensive experiments show that our OpenCat significantly boosts HOI performance, particularly on a broad range of rare and unseen categories.

CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not
Sain, AneeshanandBhunia, AyanKumarandChowdhury, PinakiNathandKoley, SubhadeepandXiang, TaoandSong, Yi-Zhe



研究问题:本文旨在利用CLIP进行基于草图的零样本图像检索(ZS-SBIR)。
动机:受基础模型最新进展和其提供的无与伦比的泛化能力的启发,首次将其定制以使草图社区受益。
方法:提出了如何最好地实现这种协同作用的新设计,包括类别设置和细粒度设置("全部")。解决方案的核心是一个提示学习设置。
效果:实验结果表明,通过考虑草图特定的提示,我们已经拥有一个超过所有先前艺术的类别级ZS-SBIR系统,大幅度提高了24.8%。在细粒度设置方面则更为棘手,需要深入研究这种协同作用。为此,我们提出了两种特定的设计来解决问题的细粒度匹配特性:(i)一种额外的正则化损失,以确保草图和照片之间的相对分离在整个类别中是均匀的,这与黄金标准的独立三元组损失不同;(ii)一种巧妙的补丁混洗技术,有助于建立草图-照片对之间的实例级结构对应关系。采用这些设计,我们再次观察到比先前最先进的技术提高了约26.9%的性能。

In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Project page: https://aneeshan95.github.io/Sketch_LVM/

Learning To Generate Language-Supervised and Open-Vocabulary Scene Graph Using Pre-Trained Visual-Semantic Space
Zhang, YongandPan, YingweiandYao, TingandHuang, RuiandMei, TaoandChen, Chang-Wen



研究问题:本文旨在解决场景图生成(SGG)在实际应用中遇到的两个难题:1)训练SGG模型需要耗时的人工标注;2)封闭的对象类别限制了SGG模型识别训练集以外的新对象的能力。
动机:为了解决这些问题,作者提出了一种新颖的方法,利用预训练的视觉语义空间(VSS)来触发语言监督和开放词汇的场景图生成。
方法:首先,通过解析图像的语言描述,将图像转化为语义图,从而获得廉价的场景图监督数据。然后,直接通过预训练的VSS中的区域-词对齐,将这些语义图中的名词短语与图像区域进行关联。最后,基于视觉上关联的对象,自然地构建关系表示,以实现开放词汇的场景图生成。
效果:通过在Visual Genome数据集上的大量实验,证明了该方法在各种场景图生成任务(如监督/语言监督、封闭/开放词汇)上均取得了优于现有方法的性能,展示了预训练的VSS在更实际的场景中进行场景图生成的潜力。

Scene graph generation (SGG) aims to abstract an image into a graph structure, by representing objects as graph nodes and their relations as labeled edges. However, two knotty obstacles limit the practicability of current SGG methods in real-world scenarios: 1) training SGG models requires time-consuming ground-truth annotations, and 2) the closed-set object categories make the SGG models limited in their ability to recognize novel objects outside of training corpora. To address these issues, we novelly exploit a powerful pre-trained visual-semantic space (VSS) to trigger language-supervised and open-vocabulary SGG in a simple yet effective manner. Specifically, cheap scene graph supervision data can be easily obtained by parsing image language descriptions into semantic graphs. Next, the noun phrases on such semantic graphs are directly grounded over image regions through region-word alignment in the pre-trained VSS. In this way, we enable open-vocabulary object detection by performing object category name grounding with a text prompt in this VSS. On the basis of visually-grounded objects, the relation representations are naturally built for relation recognition, pursuing open-vocabulary SGG. We validate our proposed approach with extensive experiments on the Visual Genome benchmark across various SGG scenarios (i.e., supervised / language-supervised, closed-set / open-vocabulary). Consistent superior performances are achieved compared with existing methods, demonstrating the potential of exploiting pre-trained VSS for SGG in more practical scenarios.

DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-To-Fine Contrastive Ranking
Yang, LijinandKong, QuanandYang, Hsuan-KungandKehl, WadimandSato, YoichiandKobori, Norimasa



研究问题:理解视频中的密集动作是视觉模型泛化的基本挑战。
动机:通过组合已知的原始元素,特别是处理新的复合结构,可以实现泛化。
方法:提出一种从粗到细的组合性表示学习方法,将原始查询句子分解为不同的粒度级别,然后通过对比排名约束学习视频和重新组合的查询之间的正确对应关系。
效果:在Charades-CG和ActivityNet-CG两个数据集上进行的实验表明,该方法具有优越的组合泛化能力。

Understanding dense action in videos is a fundamental challenge towards the generalization of vision models. Several works show that compositionality is key to achieving generalization by combining known primitive elements, especially for handling novel composited structures. Compositional temporal grounding is the task of localizing dense action by using known words combined in novel ways in the form of novel query sentences for the actual grounding. In recent works, composition is assumed to be learned from pairs of whole videos and language embeddings through large scale self-supervised pre-training. Alternatively, one can process the video and language into word-level primitive elements, and then only learn fine-grained semantic correspondences. Both approaches do not consider the granularity of the compositions, where different query granularity corresponds to different video segments. Therefore, a good compositional representation should be sensitive to different video and query granularity. We propose a method to learn a coarse-to-fine compositional representation by decomposing the original query sentence into different granular levels, and then learning the correct correspondences between the video and recombined queries through a contrastive ranking constraint. Additionally, we run temporal boundary prediction in a coarse-to-fine manner for precise grounding boundary detection. Experiments are performed on two datasets Charades-CG and ActivityNet-CG showing the superior compositional generalizability of our approach.

Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners
Zhang, RenruiandHu, XiangfeiandLi, BohaoandHuang, SiyuanandDeng, HanqiuandQiao, YuandGao, PengandLi, Hongsheng



研究问题:本文旨在探讨如何利用各种预训练模式的多样化先验知识,来更好地进行少次学习。
动机:在低数据量的情况下,视觉识别需要深度神经网络从有限的训练样本中学习泛化表示。最近的CLIP基方法通过对比语言-图像预训练显示出了有前景的少次性能提升。
方法:我们提出了CaFo,一种级联基础模型,它结合了各种预训练范式的多样化先验知识,以更好地进行少次学习。CaFo整合了CLIP的语言对比知识、DINO的视觉对比知识、DALL-E的视觉生成知识和GPT-3的语言生成知识。具体来说,CaFo的工作方式是“提示、生成、然后缓存”。
效果:实验结果表明,通过这种协作,CaFo可以充分发挥不同预训练方法的潜力,并将它们统一起来,实现少次分类的最先进的性能。

Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples. Recently, CLIP-based methods have shown promising few-shot performance benefited from the contrastive language-image pre-training. We then question, if the more diverse pre-training knowledge can be cascaded to further assist few-shot representation learning. In this paper, we propose CaFo, a Cascade of Foundation models that incorporates diverse prior knowledge of various pre training paradigms for better few-shot learning. Our CaFo incorporates CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, DALL-E's vision generative knowledge, and GPT-3's language-generative knowledge. Specifically, CaFo works by 'Prompt, Generate, then Cache'. Firstly, we leverage GPT-3 to produce textual inputs for prompting CLIP with rich downstream linguistic semantics. Then, we generate synthetic images via DALL-E to expand the few-shot training data without any manpower. At last, we introduce a learnable cache model to adaptively blend the predictions from CLIP and DINO. By such col laboration, CaFo can fully unleash the potential of different pre-training methods and unify them to perform state-of the-art for few-shot classification. Code is available at https://github.com/ZrrSkywalker/CaFo.

NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations
Hsu, JoyandMao, JiayuanandWu, Jiajun



研究问题:如何将3D场景中的对象属性和关系与语言联系起来,以支持各种人工智能任务。
动机:3D领域的可变性带来了标签成本高和3D语言复杂性两大挑战。因此,模型需要满足数据高效、能适应不同数据分布和未见过的任务语义形式,并能处理复杂的语言语义(如视角锚定和多对象引用)。
方法:提出NS3D,一个用于3D场景的神经符号框架。NS3D通过利用大型语言到代码模型将语言转化为具有层次结构的程序,程序中的不同功能模块由神经网络实现。特别是,NS3D通过引入能有效解决高阶关系(即涉及两个以上对象的关系)的功能模块,扩展了先前的神经符号视觉推理方法。模块化和组合架构使NS3D在ReferIt3D视依赖任务上取得了最先进的结果。
效果:NS3D在数据效率和泛化设置上表现出显著改善的性能,并在一项未见过的3D问答任务上实现了零样本转移。

Grounding object properties and relations in 3D scenes is a prerequisite for a wide range of artificial intelligence tasks, such as visually grounded dialogues and embodied manipulation. However, the variability of the 3D domain induces two fundamental challenges: 1) the expense of labeling and 2) the complexity of 3D grounded language. Hence, essential desiderata for models are to be data-efficient, generalize to different data distributions and tasks with unseen semantic forms, as well as ground complex language semantics (e.g., view-point anchoring and multi-object reference). To address these challenges, we propose NS3D, a neuro-symbolic framework for 3D grounding. NS3D translates language into programs with hierarchical structures by leveraging large language-to-code models. Different functional modules in the programs are implemented as neural networks. Notably, NS3D extends prior neuro-symbolic visual reasoning methods by introducing functional modules that effectively reason about high-arity relations (i.e., relations among more than two objects), key in disambiguating objects in complex 3D scenes. Modular and compositional architecture enables NS3D to achieve state-of-the-art results on the ReferIt3D view-dependence task, a 3D referring expression comprehension benchmark. Importantly, NS3D shows significantly improved performance on settings of data-efficiency and generalization, and demonstrate zero-shot transfer to an unseen 3D question-answering task.

VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking
Wang, LiminandHuang, BingkunandZhao, ZhiyuandTong, ZhanandHe, YinanandWang, YiandWang, YaliandQiao, Yu



研究问题:如何训练一个具有强大基础模型,能够泛化到各种下游任务。
动机:虽然规模是建立强大基础模型的主要因素,但训练数十亿参数的视频基础模型仍然具有挑战性。
方法:本文提出了视频掩码自动编码器(VideoMAE),这是一种可扩展的通用自我监督预训练器,用于构建视频基础模型。通过核心设计对模型和数据进行缩放,并提出了双掩蔽策略进行有效预训练。
效果:实验结果表明,VideoMAE可以有效地预训练十亿级别的模型,并在Kinetics和Something-Something等数据集上取得了新的最先进的性能。此外,预训练的视频ViT模型在各种下游任务上也表现出了有效性。

Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also introduce a progressive training paradigm that involves initial pre-training on the diverse multi-sourced unlabeled dataset, followed by fine-tuning on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner.

RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training
Xie, Chen-WeiandSun, SiyangandXiong, XiongandZheng, YunandZhao, DeliandZhou, Jingren



研究问题:本文旨在解决训练CLIP模型需要大量图像-文本对,数据需求大的问题。
动机:现有的CLIP模型在各种下游任务上表现出色,但训练过程中需要大量的图像-文本对来记忆各种语义概念,数据需求大。
方法:提出一种新颖且高效的框架RA-CLIP,通过在线检索来增强嵌入。具体来说,我们将部分图像-文本数据作为参考集进行抽样。对于输入的图像,从参考集中检索相关的图像-文本对以丰富输入图像的表示。这个过程可以看作是开卷考试:有了参考集作为小抄,该方法不需要记住训练数据中的所有视觉概念,而是通过利用参考集中图像和文本之间的对应关系来探索如何识别视觉概念。
效果:在10个图像分类数据集和2个目标检测数据集上进行的全面实验表明,RA-CLIP在零样本图像分类任务上比基础的CLIP模型有大幅度的提升(+12.7%),线性探测图像分类任务(+6.9%)和零样本ROI分类任务(+2.8%)。

Contrastive Language-Image Pre-training (CLIP) is attracting increasing attention for its impressive zero-shot recognition performance on different down-stream tasks. However, training CLIP is data-hungry and requires lots of image-text pairs to memorize various semantic concepts. In this paper, we propose a novel and efficient framework: Retrieval Augmented Contrastive Language-Image Pre-training (RA-CLIP) to augment embeddings by online retrieval. Specifically, we sample part of image-text data as a hold-out reference set. Given an input image, relevant image-text pairs are retrieved from the reference set to enrich the representation of input image. This process can be considered as an open-book exam: with the reference set as a cheat sheet, the proposed method doesn't need to memorize all visual concepts in the training data. It explores how to recognize visual concepts by exploiting correspondence between images and texts in the cheat sheet. The proposed RA-CLIP implements this idea and comprehensive experiments are conducted to show how RA-CLIP works. Performances on 10 image classification datasets and 2 object detection datasets show that RA-CLIP outperforms vanilla CLIP baseline by a large margin on zero-shot image classification task (+12.7%), linear probe image classification task (+6.9%) and zero-shot ROI classification task (+2.8%).

Teacher-Generated Spatial-Attention Labels Boost Robustness and Accuracy of Contrastive Models
Yao, YushiandYe, ChangandHe, JunfengandElsayed, GamaleldinF.



研究问题:人类空间注意力可以提供关于视觉场景中重要区域的信息,这是否可以用于自监督表示学习?
动机:收集大规模的人类注意力标签数据集非常昂贵,因此需要构建一个辅助的教师模型来预测人类的注意力。
方法:通过在相对较小的标记数据集上训练教师模型,生成ImageNet的图像(伪)注意力标签。然后,在标准的对比模型配置中添加一个简单的输出头,该头被训练以根据教师模型的伪标签预测每个图像的注意力图。
效果:我们发现,与普通的对比模型相比,使用教师指导训练的对比模型预测的空间注意力图更符合人类的关注点。此外,我们的方法提高了对比模型在ImageNet和ImageNet-C上的分类准确性和鲁棒性。最后,我们发现,通过在ImageNet、ImageNet-C、CIFAR10和CIFAR10-C数据集上测量精度-召回性能,模型表示对图像检索任务变得更有用。

Human spatial attention conveys information about theregions of visual scenes that are important for perform-ing visual tasks. Prior work has shown that the informa-tion about human attention can be leveraged to benefit var-ious supervised vision tasks. Might providing this weakform of supervision be useful for self-supervised represen-tation learning? Addressing this question requires collect-ing large datasets with human attention labels. Yet, col-lecting such large scale data is very expensive. To addressthis challenge, we construct an auxiliary teacher model topredict human attention, trained on a relatively small la-beled dataset. This teacher model allows us to generate im-age (pseudo) attention labels for ImageNet. We then traina model with a primary contrastive objective; to this stan-dard configuration, we add a simple output head trained topredict the attentional map for each image, guided by thepseudo labels from teacher model. We measure the qual-ity of learned representations by evaluating classificationperformance from the frozen learned embeddings as wellas performance on image retrieval tasks. We find that thespatial-attention maps predicted from the contrastive modeltrained with teacher guidance aligns better with human at-tention compared to vanilla contrastive models. Moreover,we find that our approach improves classification accuracyand robustness of the contrastive models on ImageNet andImageNet-C. Further, we find that model representationsbecome more useful for image retrieval task as measuredby precision-recall performance on ImageNet, ImageNet-C,CIFAR10, and CIFAR10-C datasets.

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Yang, AntoineandNagrani, ArshaandSeo, PaulHongsuckandMiech, AntoineandPont-Tuset, JordiandLaptev, IvanandSivic, JosefandSchmid, Cordelia



研究问题:如何利用大规模可获取的叙述视频进行预训练,以实现多模态单阶段密集事件描述模型。
动机:现有的标注数据集无法满足这种统一模型的训练需求,而未标注的视频可以作为有效的训练数据。
方法:通过将转录语音的句子边界重新定义为伪事件边界,并将转录语音的句子作为伪事件描述,来利用未标注的叙述视频进行密集视频描述。
效果:在YT-Temporal-1B数据集上预训练的Vid2Seq模型在各种密集视频描述基准测试中(包括YouCook2、ViTT和ActivityNet Captions)都取得了显著改进,并在视频段落描述、视频片段描述以及少样本设置等任务中表现出良好的泛化能力。

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. Such a unified model requires large-scale training data, which is not available in current annotated datasets. We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries, and using the transcribed speech sentences as pseudo event captions. The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the tasks of video paragraph captioning and video clip captioning, and to few-shot settings. Our code is publicly available at https://antoyang.github.io/vid2seq.html.

Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering
Zang, ChuanqiandWang, HanqingandPei, MingtaoandLiang, Wei



研究问题:视频问答(VideoQA)由于需要从冗余信息中捕捉准确的模态间关联,因此具有挑战性。
动机:现有方法主要关注任务的显性挑战,如多模态特征提取、视频-文本对齐和融合等,但在推理答案时依赖于统计证据,忽视了多模态数据中的潜在偏见。
方法:我们从因果关系表示的角度研究多模态数据的关系结构,并提出一种新的推理框架。对于视觉数据,与答案无关的问题对象可能建立简单的匹配关联;对于文本数据,模型倾向于局部短语语义,这可能偏离长句子中的全局语义。因此,为了提高模型的泛化能力,我们通过显式捕获与问题语义有因果关系的视觉特征,并减弱局部语言语义对问题回答的影响,来发现真实的关联。
效果:在两个大型因果视频问答数据集上的实验结果表明,我们提出的框架1)提高了现有视频问答主干的准确性,2)在复杂场景和问题上表现出鲁棒性。

Video Question Answering (VideoQA) is challenging as it requires capturing accurate correlations between modalities from redundant information. Recent methods focus on the explicit challenges of the task, e.g. multimodal feature extraction, video-text alignment and fusion. Their frameworks reason the answer relying on statistical evidence causes, which ignores potential bias in the multimodal data. In our work, we investigate relational structure from a causal representation perspective on multimodal data and propose a novel inference framework. For visual data, question-irrelevant objects may establish simple matching associations with the answer. For textual data, the model prefers the local phrase semantics which may deviate from the global semantics in long sentences. Therefore, to enhance the generalization of the model, we discover the real association by explicitly capturing visual features that are causally related to the question semantics and weakening the impact of local language semantics on question answering. The experimental results on two large causal VideoQA datasets verify that our proposed framework 1) improves the accuracy of the existing VideoQA backbone, 2) demonstrates robustness on complex scenes and questions.

A Simple Framework for Text-Supervised Semantic Segmentation
Yi, MuyangandCui, QuanandWu, HaoandYang, ChengandYoshie, OsamuandLu, Hongtao



研究问题:本文旨在解决图像-文本对比的语义分割问题,并探索无特殊网络架构的预训练模型在语义分割任务上的效果。
动机:虽然图像-文本对比的语义分割是一个新颖的研究主题,但早期的开创性方法受限于特定设计的网络架构。本文发现,一个普通的对比语言-图像预训练(CLIP)模型本身就是一个很好的文本监督语义分割器。
方法:首先,我们揭示了由于其优化由密集对齐视觉和语言表示驱动,普通CLIP在本地化和分割方面表现不佳。然后,我们提出了局部驱动对齐(LoDA)来解决这个问题,其中CLIP的优化是由稀疏地对齐局部表示驱动的。最后,我们提出了一个简单的分割(SimSeg)框架。LoDA和SimSeg共同改善了普通的CLIP,产生了令人印象深刻的语义分割结果。
效果:我们的模型在PASCAL VOC 2012、PASCAL Context和COCO数据集上的表现优于先前最先进的方法,且差距较大。代码和模型可以在github.com/muyangyi/SimSeg获取。

Text-supervised semantic segmentation is a novel research topic that allows semantic segments to emerge with image-text contrasting. However, pioneering methods could be subject to specifically designed network architectures. This paper shows that a vanilla contrastive language-image pre-training (CLIP) model is an effective text-supervised semantic segmentor by itself. First, we reveal that a vanilla CLIP is inferior to localization and segmentation due to its optimization being driven by densely aligning visual and language representations. Second, we propose the locality-driven alignment (LoDA) to address the problem, where CLIP optimization is driven by sparsely aligning local representations. Third, we propose a simple segmentation (SimSeg) framework. LoDA and SimSeg jointly ameliorate a vanilla CLIP to produce impressive semantic segmentation results. Our method outperforms previous state-of-the-art methods on PASCAL VOC 2012, PASCAL Context and COCO datasets by large margins. Code and models are available at github.com/muyangyi/SimSeg.

Dynamic Inference With Grounding Based Vision and Language Models
Uzkent, BurakandGarg, AmanmeetandZhu, WentaoandDoshi, KevalandYi, JingruandWang, XiaolongandOmar, Mohamed



研究问题:如何提高大规模预训练语言模型的效率?
动机:现有的大型预训练语言模型存在大量的计算冗余,影响了运行效率。
方法:提出动态推理机制,通过动态跳过多头自注意力和前馈网络层,以及动态令牌剪枝和融合,来优化模型的运行效率。
效果:实验结果显示,该方法可以显著提高最先进模型MDETR和GLIP在指代表达式理解、分割和VQA任务上的运行效率,同时仅使准确率下降0.3%。

Transformers have been recently utilized for vision and language tasks successfully. For example, recent image and language models with more than 200M parameters have been proposed to learn visual grounding in the pre-training step and show impressive results on downstream vision and language tasks. On the other hand, there exists a large amount of computational redundancy in these large models which skips their run-time efficiency. To address this problem, we propose dynamic inference for grounding based vision and language models conditioned on the input image-text pair. We first design an approach to dynamically skip multihead self-attention and feed forward network layers across two backbones and multimodal network. Additionally, we propose dynamic token pruning and fusion for two backbones. In particular, we remove redundant tokens at different levels of the backbones and fuse the image tokens with the language tokens in an adaptive manner. To learn policies for dynamic inference, we train agents using reinforcement learning. In this direction, we replace the CNN backbone in a recent grounding-based vision and language model, MDETR, with a vision transformer and call it ViTMDETR. Then, we apply our dynamic inference method to ViTMDETR, called D-ViTDMETR, and perform experiments on image-language tasks. Our results show that we can improve the run-time efficiency of the state-of-the-art models MDETR and GLIP by up to 50% on Referring Expression Comprehension and Segmentation, and VQA with only maximum 0.3% accuracy drop.

Visual-Language Prompt Tuning With Knowledge-Guided Context Optimization
Yao, HantaoandZhang, RuiandXu, Changsheng



研究问题:如何提高预训练视觉-语言模型(VLM)在未见过类别任务上的泛化能力?
动机:现有的代表性CoOp方法通过结合可学习的文本标记和类别标记来获取特定的文本知识,但这种方法对于未见过类别的任务的泛化能力较差。
方法:提出了一种新的知识引导上下文优化(KgCoOp)方法,通过构建一个正则化项来保证基本通用文本知识可以被嵌入到由可学习提示生成的特殊文本知识中,以提高其对未见过类别任务的泛化能力。
效果:广泛的实验表明,提出的知识引导上下文优化方法在提示调优任务上表现优秀,即在较少的训练时间内取得了更好的性能。

Prompt tuning is an effective way to adapt the pretrained visual-language model (VLM) to the downstream task using task-related textual tokens. Representative CoOp-based works combine the learnable textual tokens with the class tokens to obtain specific textual knowledge. However, the specific textual knowledge has worse generalizable to the unseen classes because it forgets the essential general textual knowledge having a strong generalization ability. To tackle this issue, we introduce a novel Knowledge-guided Context Optimization (KgCoOp) to enhance the generalization ability of the learnable prompt for unseen classes. To remember the essential general knowledge, KgCoOp constructs a regularization term to ensure that the essential general textual knowledge can be embedded into the special textual knowledge generated by the learnable prompt. Especially, KgCoOp minimizes the discrepancy between the textual embeddings generated by learned prompts and the hand-crafted prompts. Finally, adding the KgCoOp upon the contrastive loss can make a discriminative prompt for both seen and unseen tasks. Extensive evaluation of several benchmarks demonstrates that the proposed Knowledge-guided Context Optimization is an efficient method for prompt tuning, i.e., achieves better performance with less training time.

Open-Vocabulary Panoptic Segmentation With Text-to-Image Diffusion Models
Xu, JiaruiandLiu, SifeiandVahdat, ArashandByeon, WonminandWang, XiaolongandDeMello, Shalini



研究问题:本文旨在开发一种名为ODISE的开放词汇扩散式全景分割模型,将预训练的文本-图像扩散模型和判别模型统一起来进行开放词汇全景分割。
动机:文本-图像扩散模型具有以多样化的开放词汇语言描述生成高质量图像的强大能力,而文本-图像判别模型如CLIP则擅长将图像分类为开放词汇标签。作者希望利用这两种模型的冻结内部表示来进行野外任意类别的全景分割。
方法:通过联合训练文本-图像扩散模型和判别模型,利用两者的冻结内部表示进行全景分割。
效果:在开放词汇全景分割和语义分割任务上,该方法都大幅超越了先前的技术。特别是在只用COCO数据集训练的情况下,该方法在ADE20K数据集上实现了23.4 PQ和30.0 mIoU,比先前的技术分别提高了8.3 PQ和7.9 mIoU。

We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have the remarkable ability to generate high-quality images with diverse open-vocabulary language descriptions. This demonstrates that their internal representation space is highly correlated with open concepts in the real world. Text-image discriminative models like CLIP, on the other hand, are good at classifying images into open-vocabulary labels. We leverage the frozen internal representations of both these models to perform panoptic segmentation of any category in the wild. Our approach outperforms the previous state of the art by significant margins on both open-vocabulary panoptic and semantic segmentation tasks. In particular, with COCO training only, our method achieves 23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute improvement over the previous state of the art. We open-source our code and models at https://github.com/NVlabs/ODISE.

Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision
Xu, JilanandHou, JunlinandZhang, YuejieandFeng, RuiandWang, YiandQiao, YuandXie, Weidi



研究问题:本文旨在解决开放词汇语义分割(OVS)问题,即对任意类别的对象进行分割,而不是预定义的封闭集类别。
动机:目前的模型需要使用掩码注释进行预训练,而本文提出了一种无需使用掩码注释的基于变压器的OVS模型。
方法:本文提出的OVSegmentor模型通过基于插槽注意力的绑定模块将图像像素组装成一组可学习的组令牌,并将组令牌与相应的标题嵌入对齐。同时,提出了两种代理任务进行训练,即掩码实体完成和跨图像掩码一致性。
效果:在PASCAL VOC 2012、PASCAL Context和COCO Object三个基准数据集上进行零样本转移,仅使用3%的数据(4M vs 134M)进行预训练,就取得了优于最先进方法的分割结果。

In this paper, we consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories. The main contributions are as follows: First, we propose a transformer-based model for OVS, termed as OVSegmentor, which only exploits web-crawled image-text pairs for pre-training without using any mask annotations. OVSegmentor assembles the image pixels into a set of learnable group tokens via a slot-attention based binding module, and aligns the group tokens to the corresponding caption embedding. Second, we propose two proxy tasks for training, namely masked entity completion and cross-image mask consistency. The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities. The latter enforces consistent mask predictions between images that contain shared entities, which encourages the model to learn visual invariance. Third, we construct CC4M dataset for pre-training by filtering CC12M with frequently appeared entities, which significantly improves training efficiency. Fourth, we perform zero-shot transfer on three benchmark datasets, PASCAL VOC 2012, PASCAL Context, and COCO Object. Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training. Code and pre-trained models will be released for future research.

Learning Conditional Attributes for Compositional Zero-Shot Learning
Wang, QingshengandLiu, LingqiaoandJing, ChenchenandChen, HaoandLiang, GuoqiangandWang, PengandShen, Chunhua



研究问题:本文旨在解决组合零样本学习(CZSL)中属性与不同对象交互的挑战,即如何让模型识别基于已学概念(如属性-对象组合)的新组合概念。
动机:在CZSL中,一个挑战是如何对与不同对象交互的属性进行建模,例如"wet apple"和"wet cat"中的"wet"属性是不同的。
方法:我们提出一个属性学习框架,包含一个属性超学习器和一个属性基学习器,通过学习条件属性嵌入来解决这一问题。
效果:实验结果表明,我们的模型在CZSL基准测试上表现优于其他最先进的方法,验证了学习条件属性的重要性。

Compositional Zero-Shot Learning (CZSL) aims to train models to recognize novel compositional concepts based on learned concepts such as attribute-object combinations. One of the challenges is to model attributes interacted with different objects, e.g., the attribute "wet" in "wet apple" and "wet cat" is different. As a solution, we provide analysis and argue that attributes are conditioned on the recognized object and input image and explore learning conditional attribute embeddings by a proposed attribute learning framework containing an attribute hyper learner and an attribute base learner. By encoding conditional attributes, our model enables to generate flexible attribute embeddings for generalization from seen to unseen compositions. Experiments on CZSL benchmarks, including the more challenging C-GQA dataset, demonstrate better performances compared with other state-of-the-art approaches and validate the importance of learning conditional attributes.

Prompting Large Language Models With Answer Heuristics for Knowledge-Based Visual Question Answering
Shao, ZhenweiandYu, ZhouandWang, MengandYu, Jun



研究问题:如何利用大规模语言模型获取知识驱动的视觉问答所需的外部知识。
动机:现有的方法从显式的知识库中检索所需知识,这往往会引入与问题无关的信息,限制了模型的性能。
方法:提出一种名为Prophet的概念简单框架,通过向GPT-3提供答案启发式来获取知识驱动的视觉问答所需的外部知识。
效果:Prophet在两个具有挑战性的知识驱动视觉问答数据集上显著优于所有现有最先进的方法,在OK-VQA和A-OKVQA的测试集上分别达到了61.1%和55.7%的准确率。

Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have sought to use a large language model (i.e., GPT-3) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, we argue that they have not fully activated the capacity of GPT-3 as the provided input information is insufficient. In this paper, we present Prophet---a conceptually simple framework designed to prompt GPT-3 with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary answer heuristics from the model: answer candidates and answer-aware examples. Finally, the two types of answer heuristics are encoded into the prompts to enable GPT-3 to better comprehend the task thus enhancing its capacity. Prophet significantly outperforms all existing state-of-the-art methods on two challenging knowledge-based VQA datasets, OK-VQA and A-OKVQA, delivering 61.1% and 55.7% accuracies on their testing sets, respectively.

IFSeg: Image-Free Semantic Segmentation via Vision-Language Model
Yun, SukminandPark, SeongHyeonandSeo, PaulHongsuckandShin, Jinwoo



研究问题:本文旨在解决视觉语言预训练在无特定图像和注释的语义分割任务中的应用问题。
动机:现有的视觉语言预训练模型需要额外的图像或分割标注来适应下游分割任务,但这种方法需要大量的额外数据。
方法:本文提出了一种名为IFSeg的新方法,通过生成视觉语言驱动的人工图像分割对,将预训练的视觉语言模型更新为分割任务。
效果:实验结果表明,该方法不仅为这个新任务建立了有效的基线,而且与依赖更强监督(如特定图像和分割掩码)的现有方法相比,表现出强大的性能。

Vision-language (VL) pre-training has recently gained much attention for its transferability and flexibility in novel concepts (e.g., cross-modality transfer) across various visual tasks. However, VL-driven segmentation has been under-explored, and the existing approaches still have the burden of acquiring additional training images or even segmentation annotations to adapt a VL model to downstream segmentation tasks. In this paper, we introduce a novel image-free segmentation task where the goal is to perform semantic segmentation given only a set of the target semantic categories, but without any task-specific images and annotations. To tackle this challenging task, our proposed method, coined IFSeg, generates VL-driven artificial image-segmentation pairs and updates a pre-trained VL model to a segmentation task. We construct this artificial training data by creating a 2D map of random semantic categories and another map of their corresponding word tokens. Given that a pre-trained VL model projects visual and text tokens into a common space where tokens that share the semantics are located closely, this artificially generated word map can replace the real image inputs for such a VL model. Through an extensive set of experiments, our model not only establishes an effective baseline for this novel task but also demonstrates strong performances compared to existing methods that rely on stronger supervision, such as task-specific images and segmentation masks. Code is available at https://github.com/alinlab/ifseg.

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding
Alper, MorrisandFiman, MichaelandAverbuch-Elor, Hadar



研究问题:本文旨在探讨视觉-语言预训练是否能提高涉及隐性视觉推理的纯文本任务的性能,主要关注零样本探测方法。
动机:大多数人类使用视觉想象力来理解和推理语言,但像BERT这样的模型通过在仅基于文本的预训练期间获得的知识来推理语言。本研究调查视觉-语言预训练是否可以改善涉及隐性视觉推理的纯文本任务的性能。
方法:提出了一套视觉语言理解(VLU)任务用于探测文本编码器模型的视觉推理能力,以及各种非视觉自然语言理解(NLU)任务进行比较。还提出了一种新的零样本知识探测方法,Stroop探测,用于将诸如CLIP的模型应用于无需预测头(如BERT的掩码语言建模头)的纯文本任务。
效果:结果显示,最先进的多模态训练的文本编码器在VLU任务上优于单模态训练的文本编码器,但在NLU任务上表现不佳,为之前关于多模态模型NLU能力的混合结果提供了新的视角。结论是,预训练期间接触图像提供了固有的视觉推理知识,这反映在需要隐性视觉推理的语言任务中。这些发现对于更广泛的多模态学习背景具有重要意义,为在这种背景下选择文本编码器提供了原则性指导。

Most humans use visual imagination to understand and reason about language, but models such as BERT reason about language using knowledge acquired during text-only pretraining. In this work, we investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning, focusing primarily on zero-shot probing methods. We propose a suite of visual language understanding (VLU) tasks for probing the visual reasoning abilities of text encoder models, as well as various non-visual natural language understanding (NLU) tasks for comparison. We also contribute a novel zero-shot knowledge probing method, Stroop probing, for applying models such as CLIP to text-only tasks without needing a prediction head such as the masked language modelling head of models like BERT. We show that SOTA multimodally trained text encoders outperform unimodally trained text encoders on the VLU tasks while being underperformed by them on the NLU tasks, lending new context to previously mixed results regarding the NLU capabilities of multimodal models. We conclude that exposure to images during pretraining affords inherent visual reasoning knowledge that is reflected in language-only tasks that require implicit visual reasoning. Our findings bear importance in the broader context of multimodal learning, providing principled guidelines for the choice of text encoders used in such contexts.

Generative Bias for Robust Visual Question Answering
Cho, JaeWonandKim, Dong-JinandRyu, HyeonggonandKweon, InSo



研究问题:视觉问答(VQA)任务中,模型往往会利用数据集中的偏见进行预测。
动机:为了解决这一问题,本文提出了一种新的生成方法,通过直接从目标模型中训练一个偏置模型来消除这种偏见。
方法:该方法使用生成网络,通过结合对抗性目标和知识蒸馏,学习目标模型的偏见。然后,我们使用这个偏置模型来消除目标模型的偏见。
效果:在多个VQA偏见数据集上进行了广泛的实验,包括VQA-CP2、VQA-CP1、GQA-OOD和VQA-CE,结果显示,该方法在使用LXMERT架构时在VQA-CP2上取得了最先进的结果。

The task of Visual Question Answering (VQA) is known to be plagued by the issue of VQA models exploiting biases within the dataset to make its final prediction. Various previous ensemble based debiasing methods have been proposed where an additional model is purposefully trained to be biased in order to train a robust target model. However, these methods compute the bias for a model simply from the label statistics of the training data or from single modal branches. In this work, in order to better learn the bias a target VQA model suffers from, we propose a generative method to train the bias model directly from the target model, called GenB. In particular, GenB employs a generative network to learn the bias in the target model through a combination of the adversarial objective and knowledge distillation. We then debias our target model with GenB as a bias model, and show through extensive experiments the effects of our method on various VQA bias datasets including VQA-CP2, VQA-CP1, GQA-OOD, and VQA-CE, and show state-of-the-art results with the LXMERT architecture on VQA-CP2.

Data-Free Sketch-Based Image Retrieval
Chaudhuri, AbhraandBhunia, AyanKumarandSong, Yi-ZheandDutta, Anjan



研究问题:如何在没有训练数据的情况下,利用预训练的分类模型进行跨模态检索。
动机:深度学习模型的隐私和匿名性问题促使了无数据学习的研究,同时获取成对的照片-草图数据集困难,使得该设置具有实用性。
方法:提出一种名为Data-Free Sketch-Based Image Retrieval (DF-SBIR)的跨模态无数据学习设置,其中在单一模态中进行分类训练的教师模型必须被学生用来学习用于检索的跨模态度量空间。
效果:该方法在Sketchy, TU-Berlin, QuickDraw基准测试上进行了评估,设计了基于现有无数据学习文献的各种基线,并观察到该方法以显著的优势超越了所有基线。该方法还实现了与依赖数据的方案相竞争的mAPs,而无需任何训练数据。

Rising concerns about privacy and anonymity preservation of deep learning models have facilitated research in data-free learning. Primarily based on data-free knowledge distillation, models developed in this area so far have only been able to operate in a single modality, performing the same kind of task as that of the teacher. For the first time, we propose Data-Free Sketch-Based Image Retrieval (DF-SBIR), a cross-modal data-free learning setting, where teachers trained for classification in a single modality have to be leveraged by students to learn a cross-modal metric-space for retrieval. The widespread availability of pre-trained classification models, along with the difficulty in acquiring paired photo-sketch datasets for SBIR justify the practicality of this setting. We present a methodology for DF-SBIR, which can leverage knowledge from models independently trained to perform classification on photos and sketches. We evaluate our model on the Sketchy, TU-Berlin, and QuickDraw benchmarks, designing a variety of baselines based on existing data-free learning literature, and observe that our method surpasses all of them by significant margins. Our method also achieves mAPs competitive with data-dependent approaches, all the while requiring no training data. Implementation is available at https://github.com/abhrac/data-free-sbir.

Intrinsic Physical Concepts Discovery With Object-Centric Predictive Models
Tang, QuandZhu, XiangyuandLei, ZhenandZhang, Zhaoxiang



研究问题:如何通过观察和理解世界,发现并理解物理概念。
动机:人类智能的核心在于通过观察环境,以对象和关系的方式,无监督地感知环境,从而发现抽象的物理概念。
方法:本文提出了一种名为PHYCINE的系统,该系统可以在不同抽象级别中推断物理概念,而无需监督。
效果:实证评估表明,由我们的系统推断出的变量与相应的物理概念的属性相符。我们还发现,包含发现的物理概念变量的对象表示可以帮助在因果关系推理任务(即COMPHY)中实现更好的性能。

The ability to discover abstract physical concepts and understand how they work in the world through observing lies at the core of human intelligence. The acquisition of this ability is based on compositionally perceiving the environment in terms of objects and relations in an unsupervised manner. Recent approaches learn object-centric representations and capture visually observable concepts of objects, e.g., shape, size, and location. In this paper, we take a step forward and try to discover and represent intrinsic physical concepts such as mass and charge. We introduce the PHYsical Concepts Inference NEtwork (PHYCINE), a system that infers physical concepts in different abstract levels without supervision. The key insights underlining PHYCINE are two-fold, commonsense knowledge emerges with prediction, and physical concepts of different abstract levels should be reasoned in a bottom-to-up fashion. Empirical evaluation demonstrates that variables inferred by our system work in accordance with the properties of the corresponding physical concepts. We also show that object representations containing the discovered physical concepts variables could help achieve better performance in causal reasoning tasks, i.e., COMPHY.

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training
Luo, DezhaoandHuang, JiaboandGong, ShaogangandJin, HailinandLiu, Yang



研究问题:视频时刻检索(VMR)中视觉和文本的关联性至关重要,但现有方法严重依赖于独立的预训练特征提取器进行视觉和文本理解。
动机:现有的图像-文本预训练模型在捕捉视频变化方面存在限制,因此需要一种通用的方法来增强模型对视频时刻的理解。
方法:我们提出了一种名为视觉动态注入(VDI)的通用方法,从大规模的图像-文本数据中探索多模态关联性以促进可泛化的VMR。我们通过从视频帧中提取视觉上下文和空间动态信息,并显式地强制它们与描述视频变化的短语(如动词)对齐,使模型能够更准确地进行视频-文本对齐。
效果:我们在两个VMR基准数据集(Charades-STA和ActivityNet-Captions)上进行了广泛的实验,并在所有测试样本中实现了最先进的性能。特别是在涉及新场景和词汇的分布外分割上,VDI表现出显著的优势。

The correlation between the vision and text is essential for video moment retrieval (VMR), however, existing methods heavily rely on separate pre-training feature extractors for visual and textual understanding. Without sufficient temporal boundary annotations, it is non-trivial to learn universal video-text alignments. In this work, we explore multi-modal correlations derived from large-scale image-text data to facilitate generalisable VMR. To address the limitations of image-text pre-training models on capturing the video changes, we propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments. Whilst existing VMR methods are focusing on building temporal-aware video features, being aware of the text descriptions about the temporal changes is also critical but originally overlooked in pre-training by matching static images with sentences. Therefore, we extract visual context and spatial dynamic information from video frames and explicitly enforce their alignments with the phrases describing video changes (e.g. verb). By doing so, the potentially relevant visual and motion patterns in videos are encoded in the corresponding text embeddings (injected) so to enable more accurate video-text alignments. We conduct extensive experiments on two VMR benchmark datasets (Charades-STA and ActivityNet-Captions) and achieve state-of-the-art performances. Especially, VDI yields notable advantages when being tested on the out-of-distribution splits where the testing samples involve novel scenes and vocabulary.

Reproducible Scaling Laws for Contrastive Language-Image Learning
Cherti, MehdiandBeaumont, RomainandWightman, RossandWortsman, MitchellandIlharco, GabrielandGordon, CadeandSchuhmann, ChristophandSchmidt, LudwigandJitsev, Jenia



研究问题:本文旨在研究大规模对比语言-图像预训练(CLIP)的缩放定律,并使用公共LAION数据集和开源OpenCLIP仓库进行实验。
动机:尽管神经网络的扩大规模已经带来了显著的性能提升,但之前的研究主要依赖于私有数据和模型,或者只关注单一的语言或视觉学习。因此,本研究试图填补这一空白。
方法:通过在LAION数据集上训练多达20亿的图像-文本对,我们探索了多种下游任务的缩放定律,包括零样本分类、检索、线性探测和端到端微调。
效果:我们发现训练分布对缩放定律起着关键作用。尽管OpenAI和OpenCLIP模型具有相同的模型架构和类似的训练方法,但其缩放行为却有所不同。我们将所有的评估工作流程和模型开源,以确保研究的可重复性,并使更多的人能够接触到缩放定律的研究。

Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data & models or focused on uni-modal language or vision learning. To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository. Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks including zero-shot classification, retrieval, linear probing, and end-to-end fine-tuning. We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures and similar training recipes. We open-source our evaluation workflow and all models, including the largest public CLIP models, to ensure reproducibility and make scaling laws research more accessible. Source code and instructions to reproduce this study is available at https://github.com/LAION-AI/scaling-laws-openclip.

Learning Customized Visual Models With Retrieval-Augmented Knowledge
Liu, HaotianandSon, KilhoandYang, JianweiandLiu, CeandGao, JianfengandLee, YongJaeandLi, Chunyuan



研究问题:如何利用网络规模的数据收集和昂贵的预训练,以及从网络上检索到的相关图像-文本对来构建目标领域的定制视觉模型。
动机:现有的图像-文本对比学习模型如CLIP,虽然具有强大的任务转移能力,但其高通用性和可用性是通过大规模的数据收集和昂贵的预训练实现的。
方法:提出REACT框架,通过从网络规模的数据库中检索最相关的图像-文本对(占CLIP预训练数据的3%)作为外部知识,然后仅训练新的模块化块,同时冻结所有原始权重,以定制模型。
效果:实验结果表明,REACT在分类、检索、检测和分割任务上的效果显著,特别是在零样本分类任务上,与CLIP相比,ImageNet上的性能提高了5.4%,ELEVATER基准测试(20个数据集)上的性能提高了3.7%。

Image-text contrastive learning models such as CLIP have demonstrated strong task transfer ability. The high generality and usability of these visual models is achieved via a web-scale data collection process to ensure broad concept coverage, followed by expensive pre-training to feed all the knowledge into model weights. Alternatively, we propose REACT, REtrieval-Augmented CusTomization, a framework to acquire the relevant web knowledge to build customized visual models for target domains. We retrieve the most relevant image-text pairs ( 3% of CLIP pre-training data) from the web-scale database as external knowledge and propose to customize the model by only training new modularized blocks while freezing all the original weights. The effectiveness of REACT is demonstrated via extensive experiments on classification, retrieval, detection and segmentation tasks, including zero, few, and full-shot settings. Particularly, on the zero-shot classification task, compared with CLIP, it achieves up to 5.4% improvement on ImageNet and 3.7% on the ELEVATER benchmark (20 datasets).

Open Vocabulary Semantic Segmentation With Patch Aligned Contrastive Learning
Mukhoti, JishnuandLin, Tsung-YuandPoursaeed, OmidandWang, RuiandShah, AshishandTorr, PhilipH.S.andLim, Ser-Nam



研究问题:如何训练一个模型,使其能够将图像的特定区域与给定的文本输入相对应,从而在无需任何分割标注的情况下无缝转移到开放词汇语义分割任务。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:提出了一种修改后的CLIP对比损失兼容性函数——Patch Aligned Contrastive Learning (PACL),用于训练视觉编码器和文本编码器的CLS令牌之间的对齐。通过这种对齐,模型可以识别图像中对应于给定文本输入的区域,从而无需任何分割标注即可无缝转移到开放词汇语义分割任务。
效果:使用预训练的CLIP编码器和PACL,我们在4个不同的分割基准测试上实现了开放词汇零样本分割任务的最新状态:Pascal VOC、Pascal Context、COCO Stuff和ADE20K。此外,我们还证明PACL也适用于图像级别的预测,并且当与CLIP主干一起使用时,在12个图像分类数据集上提供了比CLIP更好的零样本分类准确率。

We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP's contrastive loss, intending to train an alignment between the patch tokens of the vision encoder and the CLS token of the text encoder. With such an alignment, a model can identify regions of an image corresponding to a given text input, and therefore transfer seamlessly to the task of open vocabulary semantic segmentation without requiring any segmentation annotations during training. Using pre-trained CLIP encoders with PACL, we are able to set the state-of-the-art on the task of open vocabulary zero-shot segmentation on 4 different segmentation benchmarks: Pascal VOC, Pascal Context, COCO Stuff and ADE20K. Furthermore, we show that PACL is also applicable to image-level predictions and when used with a CLIP backbone, provides a general improvement in zero-shot classification accuracy compared to CLIP, across a suite of 12 image classification datasets.

Co-Training 2L Submodels for Visual Recognition
Touvron, HugoandCord, MatthieuandOquab, MaximeandBojanowski, PiotrandVerbeek, JakobandJ\'egou, Herv\'e



研究问题:如何利用子模型共训练方法提高神经网络的训练效果。
动机:现有的预训练语言模型和图像识别模型在训练过程中,往往忽视了网络中不同层次的信息,导致模型性能受限。
方法:提出一种子模型共训练方法,通过随机激活网络中的部分层并跳过其他层,生成两个“子模型”,然后让这两个子模型相互作为“软教师”进行训练,提供互补的交叉熵损失。
效果:实验证明,该方法能有效提高神经网络的训练效果,适用于多种最新的网络架构,并在图像分类和语义分割等任务上取得了新的最优结果。

This paper introduces submodel co-training, a regularization method related to co-training, self-distillation and stochastic depth. Given a neural network to be trained, for each sample we implicitly instantiate two altered networks, "submodels", with stochastic depth: i.e. activating only a subset of the layers and skipping others. Each network serves as a soft teacher to the other, by providing a cross-entropy loss that complements the regular softmax cross-entropy loss provided by the one-hot label. Our approach, dubbed "cosub", uses a single set of weights, and does not involve a pre-trained external model or temporal averaging. Experimentally, we show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation, and that our approach is compatible with multiple recent architectures, including RegNet, PiT, and Swin. We report new state-of-the-art results for vision transformers trained on ImageNet only. For instance, a ViT-B pre-trained with cosub on Imagenet-21k achieves 87.4% top-1 acc. on Imagenet-val.

Understanding Masked Autoencoders via Hierarchical Latent Variable Models
Kong, LingjingandMa, MartinQ.andChen, GuangyiandXing, EricP.andChi, YuejieandMorency, Louis-PhilippeandZhang, Kun



研究问题:本文旨在对基于掩蔽图像区域重建的自监督学习框架——掩蔽自动编码器(MAE)进行理论性理解,并为其提供理论保证。
动机:尽管掩蔽自动编码器在各种视觉任务中取得了显著的成功,但其理论基础仍然缺乏。
方法:我们将底层数据生成过程形式化为一个分层潜在变量模型,并证明在合理的假设下,MAE可以识别出该模型中的一组潜在变量,解释了为什么MAE可以从像素中提取高层次信息。
效果:我们的理论为现有的实证观察提供了一致的解释,并为掩蔽重建范式的潜在实证改进和基本限制提供了见解。我们的实验结果验证了我们的理论洞见。

Masked autoencoder (MAE), a simple and effective self-supervised learning framework based on the reconstruction of masked image regions, has recently achieved prominent success in a variety of vision tasks. Despite the emergence of intriguing empirical observations on MAE, a theoretically principled understanding is still lacking. In this work, we formally characterize and justify existing empirical insights and provide theoretical guarantees of MAE. We formulate the underlying data-generating process as a hierarchical latent variable model, and show that under reasonable assumptions, MAE provably identifies a set of latent variables in the hierarchical model, explaining why MAE can extract high-level information from pixels. Further, we show how key hyperparameters in MAE (the masking ratio and the patch size) determine which true latent variables to be recovered, therefore influencing the level of semantic information in the representation. Specifically, extremely large or small masking ratios inevitably lead to low-level representations. Our theory offers coherent explanations of existing empirical observations and provides insights for potential empirical improvements and fundamental limitations of the masked-reconstruction paradigm. We conduct extensive experiments to validate our theoretical insights.

Photo Pre-Training, but for Sketch
Li, KeandPang, KaiyueandSong, Yi-Zhe



研究问题:如何利用基于照片的预训练来提高草图理解能力?
动机:由于草图数据的稀缺,社区在设计上做出了一些“特殊”的选择,如强制使用基于照片的预训练(即无草图)。我们想知道这种预训练是否能真正有益于草图理解。
方法:我们培养了预训练阶段学习的照片数据的拓扑结构,并将其作为下游草图任务的“免费”监督源。具体来说,我们使用细粒度的草图基图像检索(FG-SBIR)来展示我们对预训练的新视角。在这种背景下,从照片中学习到的拓扑信息监督在每次微调步骤中都起作用——预训练模型中的相邻照片在每次FG-SBIR更新中保持相邻。我们将这种邻域一致性约束描述为照片排名问题,并将其形成简洁的跨模态三元组损失。我们还展示了如何更好地将此目标作为元目标而不是与主要FG-SBIR目标并行优化。
效果:仅通过改变预训练,我们在所有五个产品级FG-SBIR基准测试中都取得了显著的优势(有时>10%)。最美妙的是,我们发现这样的巨大飞跃只需要几行额外的代码就能实现!我们的实现可以在https://github.com/KeLi-SketchX/Photo-Pre-Training-But-for-Sketch找到。

The sketch community has faced up to its unique challenges over the years, that of data scarcity however still remains the most significant to date. This lack of sketch data has imposed on the community a few "peculiar" design choices -- the most representative of them all is perhaps the coerced utilisation of photo-based pre-training (i.e., no sketch), for many core tasks that otherwise dictates specific sketch understanding. In this paper, we ask just the one question -- can we make such photo-based pre-training, to actually benefit sketch? Our answer lies in cultivating the topology of photo data learned at pre-training, and use that as a "free" source of supervision for downstream sketch tasks. In particular, we use fine-grained sketch-based image retrieval (FG-SBIR), one of the most studied and data-hungry sketch tasks, to showcase our new perspective on pre-training. In this context, the topology-informed supervision learned from photos act as a constraint that take effect at every fine-tuning step -- neighbouring photos in the pre-trained model remain neighbours under each FG-SBIR updates. We further portray this neighbourhood consistency constraint as a photo ranking problem and formulate it into a neat cross-modal triplet loss. We also show how this target is better leveraged as a meta objective rather than optimised in parallel with the main FG-SBIR objective. With just this change on pre-training, we beat all previously published results on all five product-level FG-SBIR benchmarks with significant margins (sometimes >10%). And the most beautiful thing, as we note, is such gigantic leap is made possible with just a few extra lines of code! Our implementation is available at https://github.com/KeLi-SketchX/Photo-Pre-Training-But-for-Sketch

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition With Pre-Trained Vision-Language Models
Wu, WenhaoandWang, XiaohanandLuo, HaipengandWang, JingdongandYang, YiandOuyang, Wanli



研究问题:如何利用预训练的视觉语言模型(VLMs)在视频识别任务中实现有效的知识转移。
动机:预训练的VLMs在图像-文本对的大型数据集上表现出强大的迁移能力,构建视觉和文本领域的桥梁是其最大的价值所在。
方法:提出一种名为BIKE的新框架,通过跨模态桥接探索双向知识转移。引入视频属性关联机制,利用视频到文本的知识生成辅助的视频识别文本属性;同时提出时间概念检测机制,使用文本到视频的专业知识以无参数的方式捕获时间显著性,增强视频表示。
效果:在六个流行的视频数据集上进行广泛研究,包括Kinetics-400 & 600、UCF-101、HMDB-51、ActivityNet和Charades,结果显示该方法在各种识别场景中实现了最先进的性能,如通用、零样本和少样本视频识别。在具有挑战性的Kinetics-400上,最好的模型达到了88.6%的准确率,这是目前最先进的结果。

Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE.

Pic2Word: Mapping Pictures to Words for Zero-Shot Composed Image Retrieval
Saito, KuniakiandSohn, KihyukandZhang, XiangandLi, Chun-LiangandLee, Chen-YuandSaenko, KateandPfister, Tomas



研究问题:本文旨在解决组合图像检索(CIR)中需要大量标注三元组的问题,提出了一种零样本组合图像检索(ZS-CIR)方法。
动机:现有的组合图像检索方法依赖于使用标注的三元组进行监督学习,但标注这些三元组的成本高且限制了CIR的广泛应用。
方法:我们提出了一种新的方法Pic2Word,它只需要弱标注的图像-标题对和未标注的图像数据集进行训练。
效果:与现有的有监督CIR模型相比,我们的模型在各种ZS-CIR任务上表现出强大的泛化能力,并在常见的CIR基准测试CIRR和Fashion-IQ上优于几种有监督的CIR方法。

In Composed Image Retrieval (CIR), a user combines a query image with text to describe their intended target. Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image. Labeling such triplets is expensive and hinders broad applicability of CIR. In this work, we propose to study an important task, Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training. To this end, we propose a novel method, called Pic2Word, that requires only weakly labeled image-caption pairs and unlabeled image datasets to train. Unlike existing supervised CIR models, our model trained on weakly labeled or unlabeled datasets shows strong generalization across diverse ZS-CIR tasks, e.g., attribute editing, object composition, and domain conversion. Our approach outperforms several supervised CIR methods on the common CIR benchmark, CIRR and Fashion-IQ.

Improving Image Recognition by Retrieving From Web-Scale Image-Text Data
Iscen, AhmetandFathi, AlirezaandSchmid, Cordelia



研究问题:本文旨在通过检索增强模型提高计算机视觉任务的识别能力。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:引入基于注意力的记忆模块,学习每个检索到的示例的重要性,并构建大规模的记忆数据集。
效果:实验结果表明,该方法在ImageNet-LT、Places-LT和Webvision等数据集上取得了最先进的准确率。

Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems. The goal is to enhance the recognition capabilities of the model by retrieving similar examples for the visual input from an external memory set. In this work, we introduce an attention-based memory module, which learns the importance of each retrieved example from the memory. Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query. We also thoroughly study various ways of constructing the memory dataset. Our experiments show the benefit of using a massive-scale memory dataset of 1B image-text pairs, and demonstrate the performance of different memory representations. We evaluate our method in three different classification tasks, namely long-tailed recognition, learning with noisy labels, and fine-grained classification, and show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.

Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding
Tan, ChaoleiandLin, ZihangandHu, Jian-FangandZheng, Wei-ShiandLai, Jianhuang



研究问题:视频段落定位(VPG)是视觉语言理解中的关键但具有挑战性的任务,旨在从未修剪的视频中联合定位多个事件,并使用段落查询描述。
动机:解决此问题的一个关键挑战是理解视觉和文本模态之间的复杂语义关系。以前的模型只关注从单一层面(即句子级别)对视频和文本之间的上下文信息进行建模,忽略了不同语义级别的丰富视觉-文本对应关系,如视频-单词和视频-段落对应。
方法:我们提出了一种新的分层语义对应网络(HSCNet),通过学习分层语义对齐来探索多级视觉-文本对应关系,并通过对各种级别的查询进行密集监督来实现。具体来说,我们开发了一个分层编码器,将多模态输入编码为在不同级别上的语义对齐表示。为了利用编码器中学习的分层语义对应来进行多级监督,我们还设计了一个分层解码器,根据更高级别的语义逐步进行更精细的定位。
效果:大量实验表明,HSCNet及其方法在ActivityNet-Captions和TACoS两个具有挑战性的基准测试中显著优于现有技术。

Video Paragraph Grounding (VPG) is an essential yet challenging task in vision-language understanding, which aims to jointly localize multiple events from an untrimmed video with a paragraph query description. One of the critical challenges in addressing this problem is to comprehend the complex semantic relations between visual and textual modalities. Previous methods focus on modeling the contextual information between the video and text from a single-level perspective (i.e., the sentence level), ignoring rich visual-textual correspondence relations at different semantic levels, e.g., the video-word and video-paragraph correspondence. To this end, we propose a novel Hierarchical Semantic Correspondence Network (HSCNet), which explores multi-level visual-textual correspondence by learning hierarchical semantic alignment and utilizes dense supervision by grounding diverse levels of queries. Specifically, we develop a hierarchical encoder that encodes the multi-modal inputs into semantics-aligned representations at different levels. To exploit the hierarchical semantic correspondence learned in the encoder for multi-level supervision, we further design a hierarchical decoder that progressively performs finer grounding for lower-level queries conditioned on higher-level semantics. Extensive experiments demonstrate the effectiveness of HSCNet and our method significantly outstrips the state-of-the-arts on two challenging benchmarks, i.e., ActivityNet-Captions and TACoS.

Dynamic Graph Enhanced Contrastive Learning for Chest X-Ray Report Generation
Li, MingjieandLin, BingqianandChen, ZicongandLin, HaokunandLiang, XiaodanandChang, Xiaojun



研究问题:如何利用动态知识图谱和对比学习进行胸部X射线报告生成。
动机:现有的基于数据驱动的神经网络在自动放射学报告任务中存在严重的视觉和文本偏见,且固定的医学知识图谱无法保证最合适的知识范围,限制了效果。
方法:提出一种具有动态结构和节点的知识图谱DCL,通过自下而上的方式从检索到的报告中提取特定知识来添加额外的节点或重新定义关系,将每个图像特征与其自己的更新后的图谱集成后输入解码器模块进行报告生成。
效果:在IU-Xray和MIMIC-CXR数据集上评估,DCL在这些两个基准上优于先前最先进的模型。

Automatic radiology reporting has great clinical potential to relieve radiologists from heavy workloads and improve diagnosis interpretation. Recently, researchers have enhanced data-driven neural networks with medical knowledge graphs to eliminate the severe visual and textual bias in this task. The structures of such graphs are exploited by using the clinical dependencies formed by the disease topic tags via general knowledge and usually do not update during the training process. Consequently, the fixed graphs can not guarantee the most appropriate scope of knowledge and limit the effectiveness. To address the limitation, we propose a knowledge graph with Dynamic structure and nodes to facilitate chest X-ray report generation with Contrastive Learning, named DCL. In detail, the fundamental structure of our graph is pre-constructed from general knowledge. Then we explore specific knowledge extracted from the retrieved reports to add additional nodes or redefine their relations in a bottom-up manner. Each image feature is integrated with its very own updated graph before being fed into the decoder module for report generation. Finally, this paper introduces Image-Report Contrastive and Image-Report Matching losses to better represent visual features and textual information. Evaluated on IU-Xray and MIMIC-CXR datasets, our DCL outperforms previous state-of-the-art models on these two benchmarks.

BiCro: Noisy Correspondence Rectification for Multi-Modality Data via Bi-Directional Cross-Modal Similarity Consistency
Yang, ShuoandXu, ZhaopanandWang, KaiandYou, YangandYao, HongxunandLiu, TongliangandXu, Min



研究问题:本文旨在解决多模态学习中的一种基本技术,跨模态匹配的问题,需要将各种感官模式投影到一个共享的特征空间中。
动机:由于大规模且正确对齐的数据对对于模型训练至关重要,因此收集和精确标注多模态数据集非常困难。然而,互联网上收集的共现数据对(如图像-文本对)已被广泛用于此领域,但这些廉价收集的数据集不可避免地包含许多不匹配的数据对,这已被证明会对模型性能产生负面影响。
方法:为了解决这个问题,我们提出了一个名为BiCro(双向跨模态相似性一致性)的通用框架,可以很容易地集成到现有的跨模态匹配模型中,并提高它们对噪声数据的鲁棒性。具体来说,BiCro的目标是为噪声数据对估计软标签,以反映其真实的对应程度。
效果:我们在三个流行的跨模态匹配数据集上的实验表明,我们的方法显著提高了各种匹配模型的噪声鲁棒性,并以明显的优势超越了最先进的技术。

As one of the most fundamental techniques in multimodal learning, cross-modal matching aims to project various sensory modalities into a shared feature space. To achieve this, massive and correctly aligned data pairs are required for model training. However, unlike unimodal datasets, multimodal datasets are extremely harder to collect and annotate precisely. As an alternative, the co-occurred data pairs (e.g., image-text pairs) collected from the Internet have been widely exploited in the area. Unfortunately, the cheaply collected dataset unavoidably contains many mismatched data pairs, which have been proven to be harmful to the model's performance. To address this, we propose a general framework called BiCro (Bidirectional Cross-modal similarity consistency), which can be easily integrated into existing cross-modal matching models and improve their robustness against noisy data. Specifically, BiCro aims to estimate soft labels for noisy data pairs to reflect their true correspondence degree. The basic idea of BiCro is motivated by that -- taking image-text matching as an example -- similar images should have similar textual descriptions and vice versa. Then the consistency of these two similarities can be recast as the estimated soft labels to train the matching model. The experiments on three popular cross-modal matching datasets demonstrate that our method significantly improves the noise-robustness of various matching models, and surpass the state-of-the-art by a clear margin.

Beyond Appearance: A Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks
Chen, WeihuaandXu, XianzheandJia, JianandLuo, HaoandWang, YaohuaandWang, FanandJin, RongandSun, Xiuyu



研究问题:如何从大量的未标记人类图像中学习通用的人类表示,以最大程度地提高下游以人为中心的任务的性能。
动机:由于其广泛的应用,以人为中心的视觉任务越来越受到研究关注。现有的自我监督学习方法未能充分利用人类图像的先验知识来构建伪语义标签和导入更多的语义信息到学习到的表示中。
方法:提出一种名为SOLIDER的自我监督学习框架,利用人类图像的先验知识建立伪语义标签,并将更多的语义信息导入到学习到的表示中。同时,考虑到不同的下游任务需要不同比例的语义信息和外观信息,SOLIDER引入了一个带有语义控制器的条件网络,用户可以根据需要调整控制器生成具有不同比例语义信息的表示。
效果:在六个下游以人为中心的视觉任务上验证了SOLIDER,其在各项任务上都优于现有技术,并为这些任务建立了新的基线。

Human-centric visual tasks have attracted increasing research attention due to their widespread applications. In this paper, we aim to learn a general human representation from massive unlabeled human images which can benefit downstream human-centric tasks to the maximum extent. We call this method SOLIDER, a Semantic cOntrollable seLf-supervIseD lEaRning framework. Unlike the existing self-supervised learning methods, prior knowledge from human images is utilized in SOLIDER to build pseudo semantic labels and import more semantic information into the learned representation. Meanwhile, we note that different downstream tasks always require different ratios of semantic information and appearance information. For example, human parsing requires more semantic information, while person re-identification needs more appearance information for identification purpose. So a single learned representation cannot fit for all requirements. To solve this problem, SOLIDER introduces a conditional network with a semantic controller. After the model is trained, users can send values to the controller to produce representations with different ratios of semantic information, which can fit different needs of downstream tasks. Finally, SOLIDER is verified on six downstream human-centric visual tasks. It outperforms state of the arts and builds new baselines for these tasks. The code is released in https://github.com/tinyvision/SOLIDER.

Position-Guided Text Prompt for Vision-Language Pre-Training
Wang, JinpengandZhou, PanandShou, MikeZhengandYan, Shuicheng



研究问题:本文旨在解决视觉语言预训练(VLP)模型在许多下游任务中缺乏视觉基础/定位能力的问题。
动机:VLP模型在对齐图像和文本对方面表现出了强大的能力,但在许多下游任务中,如视觉推理,它们往往缺乏关键的视觉基础/定位能力。
方法:本文提出了一种新的位置引导的文本提示(PTP)范式,以提高使用VLP训练的跨模态模型的视觉基础能力。具体来说,PTP将图像划分为NxN的块,并通过在VLP中使用的广泛对象检测器识别每个块中的对象。然后,它将视觉基础任务重新表述为一个填空问题,给定一个PTP,鼓励模型预测给定块中的对象或回归给定对象的块,例如填充"P"或"O"在PTP "The block P has a O"中。
效果:通过将PTP引入几种最先进的VLP框架,观察到代表性的跨模态学习模型架构和几个基准测试中显著的改进,例如ViLT基线在零射击Flickr30K检索(平均召回率@1 +4.8)和COCO字幕(CIDEr +5.3)上的表现优于BLIP基线。此外,由于PTP在推理时丢弃其对象检测器,而后者不能,因此PTP实现了与基于对象检测的方法相当的结果,并且推理速度更快。

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into NxN blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling "P" or "O" in a PTP "The block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released.

An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling
Fu, Tsu-JuiandLi, LinjieandGan, ZheandLin, KevinandWang, WilliamYangandWang, LijuanandLiu, Zicheng



研究问题:本文旨在探索在视频-语言预训练中,通过遮蔽视觉模型(MVM)进行预训练的有效性。
动机:尽管在视频输入上进行遮蔽帧模型等重建目标的研究已经在视频-语言预训练中得到探索,但之前的研究并未找到一种真正有效的MVM策略,能够大幅度提升下游性能。
方法:本文基于完全端到端的VIdeO-LanguagE Transformer(VIOLET),系统地考察了MVM在VidL学习中的潜力。我们探索了8种不同的MVM重建目标,从低级别的像素值和定向梯度,到高级别的深度图、光流、离散视觉标记和潜在视觉特征。
效果:实验结果表明,使用MVM目标预训练的VIOLETv2模型在13个VidL基准测试中取得了显著改进,包括视频问答、视频字幕生成和文本到视频检索等任务。

Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the factors leading to effective MVM training, resulting in an enhanced model VIOLETv2. Empirically, we show VIOLETv2 pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.

Revisiting Temporal Modeling for CLIP-Based Image-to-Video Knowledge Transferring
Liu, RuyangandHuang, JingjiaandLi, GeandFeng, JiashiandWu, XinglongandLi, ThomasH.



研究问题:如何将图像-文本预训练模型(如CLIP)扩展到视频领域,特别是在进行图像到视频的知识转移时如何进行有效的时间建模。
动机:当前的图像-文本预训练模型在大规模图像-文本数据对中学习到了令人印象深刻的多模态知识,这为改善视频领域的视觉表示学习提供了潜力。然而,现有的时间建模机制往往针对高级别的语义主导任务或低级别的视觉模式主导任务,无法同时处理这两种情况。
方法:本文提出了一种名为Spatial-Temporal Auxiliary Network (STAN)的简单而有效的时间建模机制,该机制通过分解的空间-时间模块实现了多级CLIP特征的空间-时间上下文化,从而实现了低级别和高级别的知识转移。
效果:在两个代表性的视频任务:视频-文本检索和视频识别上进行的大量实验表明,STAN模型在各种数据集上优于最先进的方法,包括MSR-VTT、DiDeMo、LSMDC、MSVD、Kinetics-400和Something-Something-V2。

Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video domain. We find that current temporal modeling mechanisms are tailored to either high-level semantic-dominant tasks (e.g., retrieval) or low-level visual pattern-dominant tasks (e.g., recognition), and fail to work on the two cases simultaneously. The key difficulty lies in modeling temporal dependency while taking advantage of both high-level and low-level knowledge in CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary Network (STAN) -- a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks. Specifically, to realize both low-level and high-level knowledge transferring, STAN adopts a branch structure with decomposed spatial-temporal modules that enable multi-level CLIP features to be spatial-temporally contextualized. We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments demonstrate the superiority of our model over the state-of-the-art methods on various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and Something-Something-V2. Codes will be available at https://github.com/farewellthree/STAN

Learning To Name Classes for Vision and Language Models
Parisot, SarahandYang, YongxinandMcDonagh, Steven



研究问题:如何通过大规模视觉和语言模型实现零样本识别,并解决对手工构造的类别名称选择敏感以及难以适应新小型数据集的问题。
动机:目前的模型在零样本识别上表现优秀,但对类别名称的选择过于敏感,且难以适应新的小数据集。
方法:提出利用现有数据为每个类别学习最优的词嵌入作为视觉内容的函数,通过在冻结的模型上学习新的词嵌入,以保留新类别的零样本能力,并易于适应新的数据集,同时调整可能错误、描述不清或模糊的类别名称。
效果:该方法可以容易地集成到图像分类和目标检测流程中,在多个场景下显著提高性能,并提供对模型偏见和标签错误的洞察。

Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content. Two distinct challenges that remain however, are high sensitivity to the choice of handcrafted class names that define queries, and the difficulty of adaptation to new, smaller datasets. Towards addressing these problems, we propose to leverage available data to learn, for each class, an optimal word embedding as a function of the visual content. By learning new word embeddings on an otherwise frozen model, we are able to retain zero-shot capabilities for new classes, easily adapt models to new datasets, and adjust potentially erroneous, non-descriptive or ambiguous class names. We show that our solution can easily be integrated in image classification and object detection pipelines, yields significant performance gains in multiple scenarios and provides insights into model biases and labelling errors.

Coreset Sampling From Open-Set for Fine-Grained Self-Supervised Learning
Kim, SungnyunandBae, SangminandYun, Se-Young



研究问题:如何利用大规模无标注开放数据集进行细粒度任务的预训练,并解决开放数据集与目标数据集之间的分布不匹配问题。
动机:现有的细粒度任务依赖于专家知识进行标注,并且需要一种通用的模型来处理特定领域的各种下游任务。最近的自我监督学习(SSL)是一种无需标注就能预训练模型的有效方法,可以作为任何下游任务的有效初始化。
方法:在预训练阶段,我们引入了一种新的开放集自我监督学习问题,假设存在大规模的未标注开放集和细粒度的目标数据集。我们提出了SimCore算法,通过在潜在空间中选择距离目标数据集最近的核心集(即开放集的子集),来解决开放集与目标数据集之间的分布不匹配问题。
效果:通过包括十一个细粒度数据集和七个开放集在内的大量实验设置,我们证明了SimCore算法能显著提高表示学习性能。

Deep learning in general domains has constantly been extended to domain-specific tasks requiring the recognition of fine-grained characteristics. However, real-world applications for fine-grained tasks suffer from two challenges: a high reliance on expert knowledge for annotation and necessity of a versatile model for various downstream tasks in a specific domain (e.g., prediction of categories, bounding boxes, or pixel-wise annotations). Fortunately, the recent self-supervised learning (SSL) is a promising approach to pretrain a model without annotations, serving as an effective initialization for any downstream tasks. Since SSL does not rely on the presence of annotation, in general, it utilizes the large-scale unlabeled dataset, referred to as an open-set. In this sense, we introduce a novel Open-Set Self-Supervised Learning problem under the assumption that a large-scale unlabeled open-set is available, as well as the fine-grained target dataset, during a pretraining phase. In our problem setup, it is crucial to consider the distribution mismatch between the open-set and target dataset. Hence, we propose SimCore algorithm to sample a coreset, the subset of an open-set that has a minimum distance to the target dataset in the latent space. We demonstrate that SimCore significantly improves representation learning performance through extensive experimental settings, including eleven fine-grained datasets and seven open-sets in various downstream tasks.

Divide and Conquer: Answering Questions With Object Factorization and Compositional Reasoning
Chen, ShiandZhao, Qi



研究问题:现有的视觉推理方法无法处理新的对象或现实世界中的误导性偏见,并且无法解释其决策背后的原理。
动机:受人类对视觉世界的推理启发,我们试图从组合的角度解决上述挑战。
方法:我们提出了一个由原则性物体分解方法和新颖的神经模块网络组成的整体框架。我们的分解方法根据物体的关键特征进行分解,并自动导出代表各种物体的原型。
效果:这个框架能够回答具有不同对象的问题,无论这些对象在训练期间是否可用,并能克服有偏的问答分布问题。此外,除了增强的泛化能力外,我们的框架还提供了一个可解释的界面,以理解模型的决策过程。

Humans have the innate capability to answer diverse questions, which is rooted in the natural ability to correlate different concepts based on their semantic relationships and decompose difficult problems into sub-tasks. On the contrary, existing visual reasoning methods assume training samples that capture every possible object and reasoning problem, and rely on black-boxed models that commonly exploit statistical priors. They have yet to develop the capability to address novel objects or spurious biases in real-world scenarios, and also fall short of interpreting the rationales behind their decisions. Inspired by humans' reasoning of the visual world, we tackle the aforementioned challenges from a compositional perspective, and propose an integral framework consisting of a principled object factorization method and a novel neural module network. Our factorization method decomposes objects based on their key characteristics, and automatically derives prototypes that represent a wide range of objects. With these prototypes encoding important semantics, the proposed network then correlates objects by measuring their similarity on a common semantic space and makes decisions with a compositional reasoning process. It is capable of answering questions with diverse objects regardless of their availability during training, and overcoming the issues of biased question-answer distributions. In addition to the enhanced generalizability, our framework also provides an interpretable interface for understanding the decision-making process of models. Our code is available at https://github.com/szzexpoi/POEM.

SceneTrilogy: On Human Scene-Sketch and Its Complementarity With Photo and Text
Chowdhury, PinakiNathandBhunia, AyanKumarandSain, AneeshanandKoley, SubhadeepandXiang, TaoandSong, Yi-Zhe



研究问题:如何将场景理解扩展到人类草图,并从草图、照片和文本三种不同且互补的模态中获取完整的场景表示。
动机:现有的方法学习的是刚性的三向嵌入,而我们的目标是学习一种灵活的联合嵌入,以充分支持这种互补性带来的"可选性"。
方法:首先,通过结合信息瓶颈和条件可逆神经网络,从草图、照片和文本中提取模态特定的组件;然后,使用修改后的跨注意力机制整合来自草图、照片和文本的模态无关实例。
效果:实验结果表明,我们的嵌入可以适应多种与场景相关的任务,包括首次因包含草图而实现的任务,而无需进行任何特定任务的修改。

In this paper, we extend scene understanding to include that of human sketch. The result is a complete trilogy of scene representation from three diverse and complementary modalities -- sketch, photo, and text. Instead of learning a rigid three-way embedding and be done with it, we focus on learning a flexible joint embedding that fully supports the "optionality" that this complementarity brings. Our embedding supports optionality on two axis: (i) optionality across modalities -- use any combination of modalities as query for downstream tasks like retrieval, (ii) optionality across tasks -- simultaneously utilising the embedding for either discriminative (e.g., retrieval) or generative tasks (e.g., captioning). This provides flexibility to end-users by exploiting the best of each modality, therefore serving the very purpose behind our proposal of a trilogy at the first place. First, a combination of information-bottleneck and conditional invertible neural networks disentangle the modality-specific component from modality-agnostic in sketch, photo, and text. Second, the modality-agnostic instances from sketch, photo, and text are synergised using a modified cross-attention. Once learned, we show our embedding can accommodate a multi-facet of scene-related tasks, including those enabled for the first time by the inclusion of sketch, all without any task-specific modifications. Project Page: http://www.pinakinathc.me/scenetrilogy

Mobile User Interface Element Detection via Adaptively Prompt Tuning
Gu, ZhangxuanandXu, ZhuoerandChen, HaoxingandLan, JunandMeng, ChanghuaandWang, Weiqiang



研究问题:现有的对象检测方法在处理包含额外光学字符识别(OCR)信息的移动用户界面(MUI)元素时存在困难。
动机:由于MUI元素包含描述其内容和功能的额外OCR信息,但往往被忽视,因此需要开发一种能够有效利用这些信息的新方法。
方法:本文提出了一种新的MUI元素检测数据集MUI-zh,并设计了一个自适应提示调优(APT)模块来利用这些有区别性的OCR信息。APT是一种轻量且有效的模块,用于在不同的模态中联合优化类别提示。
效果:通过在几个现有的基于CLIP的检测器上进行实验,发现该方法在两个数据集上都取得了显著的改进。

Recent object detection approaches rely on pretrained vision-language models for image-text alignment. However, they fail to detect the Mobile User Interface (MUI) element since it contains additional OCR information, which describes its content and function but is often ignored. In this paper, we develop a new MUI element detection dataset named MUI-zh and propose an Adaptively Prompt Tuning (APT) module to take advantage of discriminating OCR information. APT is a lightweight and effective module to jointly optimize category prompts across different modalities. For every element, APT uniformly encodes its visual features and OCR descriptions to dynamically adjust the representation of frozen category prompts. We evaluate the effectiveness of our plug-and-play APT upon several existing CLIP-based detectors for both standard and open-vocabulary MUI element detection. Extensive experiments show that our method achieves considerable improvements on two datasets. The datasets is available at github.com/antmachineintelligence/MUI-zh.

Generating Human Motion From Textual Descriptions With Discrete Representations
Zhang, JianrongandZhang, YangsongandCun, XiaodongandZhang, YongandZhao, HongweiandLu, HongtaoandShen, XiandShan, Ying



研究问题:本文旨在通过基于向量量化变分自编码器(VQ-VAE)和生成预训练转换器(GPT)的条件生成框架,从纹理描述中生成人类运动。
动机:尽管现有的方法在人类运动生成方面取得了一定的成果,但仍存在训练-测试不一致性和数据集限制等问题。
方法:本文提出了一种简单的CNN-based VQ-VAE和GPT结合的方法,通过使用常见的EMA和代码重置训练策略以及引入简单的损坏策略来减轻训练-测试不一致性。
效果:实验结果表明,该方法在人类运动生成任务上表现优于竞争方法,例如在最大的数据集HumanML3D上,该方法在文本与生成运动一致性(R-Precision)方面表现相当,但在FID指标上大幅领先于MotionDiffuse方法。然而,分析表明数据集大小是该方法的一个限制。

In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation. Our implementation is available on the project page: https://mael-zys.github.io/T2M-GPT/

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
Li, HaoandZhu, JinguoandJiang, XiaohuandZhu, XizhouandLi, HongshengandYuan, ChunandWang, XiaohuaandQiao, YuandWang, XiaogangandWang, WenhaiandDai, Jifeng



研究问题:如何消除预训练模型在特定任务微调上的不一致性,实现通用感知模型的目标。
动机:现有的通用模型在多功能性和性能上都不够理想。
方法:提出Uni-Perceiver v2,这是首个能处理大规模视觉和视觉语言任务的通用模型。图像通过通用区域建议进行编码,文本则通过基于Transformer的语言模型进行编码。然后通过一个与任务无关的解码器对编码表示进行转换。不同的任务被统一为最大似然估计问题。并提出有效的优化技术——任务平衡梯度归一化,以确保稳定的多任务学习。
效果:实验表明,Uni-Perceiver v2在多功能性和性能上都超过了所有现有的通用模型。同时,与需要特定任务微调的公认强基线相比,Uni-Perceiver v2在广泛的视觉和视觉语言任务上都能取得有竞争力的性能。

Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an effective optimization technique named Task-Balanced Gradient Normalization to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific fine-tuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning With Multimodal Models
Lin, ZhiqiuandYu, SamuelandKuang, ZhiyiandPathak, DeepakandRamanan, Deva



研究问题:如何利用跨模态信息进行少次学习,以更有效地理解新的概念。
动机:传统的少次学习基准测试只使用来自单一模态的少次样本,但这样的样本可能不足以描述整个概念类别。相比之下,人类利用跨模态信息来高效地学习新的概念。
方法:通过阅读关于狗的信息并听它们叫,构建了一个更好的视觉狗分类器。具体来说,我们利用了最新的多模态基础模型(如CLIP)本质上是跨模态的,将不同的模态映射到相同的表示空间这一事实。
效果:通过重新使用类名作为额外的一次训练样本,我们在视觉语言适应方面取得了SOTA结果。此外,我们的这种方法可以改善现有的方法,如前缀调整和分类器集成。最后,为了探索视觉和语言之外的其他模态,我们构建了第一个(据我们所知)音频视觉少次基准测试,并使用跨模态训练提高了图像和音频分类的性能。

The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better visual dog classifier by reading about dogs and listening to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space. Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities. By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation. Furthermore, we show that our approach can benefit existing methods such as prefix tuning and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification. We hope our success can inspire future works to embrace cross-modality for even broader domains and tasks.

3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions
Decatur, DaleandLang, ItaiandHanocka, Rana



研究问题:如何利用文本输入在3D网格上定位语义区域。
动机:目前的系统缺乏对"领域外"定位的解释能力,我们的目标是让系统能够理解如何在3D形状上放置非直观相关的概念。
方法:我们提出了一种名为3D Highlighter的技术,通过使用神经场将文本描述进行上下文化处理,并用概率加权混合来给形状的相应区域上色。我们的神经网络优化由预训练的CLIP编码器引导,无需任何3D数据集或3D注释。
效果:实验结果表明,3D Highlighter能够在各种输入形状上进行定位,具有高度的灵活性、通用性和生成定位的能力。

We present 3D Highlighter, a technique for localizing semantic regions on a mesh using text as input. A key feature of our system is the ability to interpret "out-of-domain" localizations. Our system demonstrates the ability to reason about where to place non-obviously related concepts on an input 3D shape, such as adding clothing to a bare 3D animal model. Our method contextualizes the text description using a neural field and colors the corresponding region of the shape using a probability-weighted blend. Our neural optimization is guided by a pre-trained CLIP encoder, which bypasses the need for any 3D datasets or 3D annotations. Thus, 3D Highlighter is highly flexible, general, and capable of producing localizations on a myriad of input shapes.

PLA: Language-Driven Open-Vocabulary 3D Scene Understanding
Ding, RunyuandYang, JihanandXue, ChuhuiandZhang, WenqingandBai, SongandQi, Xiaojuan



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. The recent breakthrough of 2D open-vocabulary perception is largely driven by Internet-scale paired image-text data with rich vocabulary concepts. However, this success cannot be directly transferred to 3D scenarios due to the inaccessibility of large-scale 3D-text pairs. To this end, we propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D, which allows explicitly associating 3D and semantic-rich captions. Further, to foster coarse-to-fine visual-semantic representation learning from captions, we design hierarchical 3D-caption pairs, leveraging geometric constraints between 3D scenes and multi-view images. Finally, by employing contrastive learning, the model learns language-aware embeddings that connect 3D and text for open-vocabulary tasks. Our method not only remarkably outperforms baseline methods by 25.8% 44.7% hIoU and 14.5% 50.4% hAP_ 50 in open-vocabulary semantic and instance segmentation, but also shows robust transferability on challenging zero-shot domain transfer tasks. See the project website at https://dingry.github.io/projects/PLA.

Visual Programming: Compositional Visual Reasoning Without Training
Gupta, TanmayandKembhavi, Aniruddha



研究问题:本文旨在提出一种神经符号方法VISPROG,用于解决复杂和组合性视觉任务。
动机:VISPROG避免了任何特定任务的训练需求,利用大型语言模型的上下文学习能力生成类似Python的模块化程序,然后执行这些程序以获取解决方案和全面且可解释的理由。
方法:VISPROG使用大规模语言模型的上下文学习能力生成类似Python的模块化程序,并调用各种现成的计算机视觉模型、图像处理程序或Python函数来产生可以被后续程序部分使用的中间输出。
效果:VISPROG在4个不同的任务上展示了其灵活性,包括组合性视觉问答、基于图像对的零样本推理、基于事实知识的对象标记以及语言引导的图像编辑。作者认为,像VISPROG这样的神经符号方法是一条令人兴奋的途径,可以方便有效地扩展AI系统的范围,以满足人们可能希望执行的复杂任务的需求。

We present VISPROG, a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions. VISPROG avoids the need for any task-specific training. Instead, it uses the in-context learning ability of large language models to generate python-like modular programs, which are then executed to get both the solution and a comprehensive and interpretable rationale. Each line of the generated program may invoke one of several off-the-shelf computer vision models, image processing routines, or python functions to produce intermediate outputs that may be consumed by subsequent parts of the program. We demonstrate the flexibility of VISPROG on 4 diverse tasks - compositional visual question answering, zero-shot reasoning on image pairs, factual knowledge object tagging, and language-guided image editing. We believe neuro-symbolic approaches like VISPROG are an exciting avenue to easily and effectively expand the scope of AI systems to serve the long tail of complex tasks that people may wish to perform.

Freestyle Layout-to-Image Synthesis
Xue, HanandHuang, ZhiwuandSun, QianruandSong, LiandZhang, Wenjun



研究问题:本文旨在探索布局到图像合成(LIS)模型的自由风格能力,即模型能在给定布局上生成未见过语义(如类别、属性和风格)的能力。
动机:由于大规模预训练的语言-图像模型的发展,一些基于有限基础类别的判别模型(如图像分类和目标检测)被赋予了预测未见过类别的能力。受此启发,我们选择利用大规模预训练的文本到图像扩散模型来实现未见过语义的生成。
方法:我们引入了一个名为修正交叉注意力(RCA)的新模块,可以方便地插入扩散模型以整合语义掩码。在模型的每个交叉注意力层中应用这种"插件",以修正图像和文本标记之间的注意力映射。
效果:广泛的实验表明,所提出的扩散网络能产生真实且自由风格的布局到图像生成结果,具有多样化的文本输入,这有很高的潜力引发一系列应用。

Typical layout-to-image synthesis (LIS) models generate images for a closed set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics (e.g., classes, attributes, and styles) onto a given layout, and call the task Freestyle LIS (FLIS). Thanks to the development of large-scale pre-trained language-image models, a number of discriminative models (e.g., image classification and object detection) trained on limited base classes are empowered with the ability of unseen class prediction. Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics. The key challenge of FLIS is how to enable the diffusion model to synthesize images from a specific layout which very likely violates its pre-learned knowledge, e.g., the model never sees "a unicorn sitting on a bench" during its pre-training. To this end, we introduce a new module called Rectified Cross-Attention (RCA) that can be conveniently plugged in the diffusion model to integrate semantic masks. This "plug-in" is applied in each cross-attention layer of the model to rectify the attention maps between image and text tokens. The key idea of RCA is to enforce each text token to act on the pixels in a specified region, allowing us to freely put a wide variety of semantics from pre-trained knowledge (which is general) onto the given layout (which is specific). Extensive experiments show that the proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs, which has a high potential to spawn a bunch of interesting applications. Code is available at https://github.com/essunny310/FreestyleNet.

Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator
Wang, ShijieandChang, JianlongandLi, HaojieandWang, ZhihuiandOuyang, WanliandTian, Qi



研究问题:如何在开放世界中处理未知子类别的细粒度检索问题。
动机:现有的方法主要针对封闭世界,所有子类别都预先定义,难以从未知子类别中获取区分性知识,因此无法处理开放世界中不可避免的未知子类别。
方法:提出一种新的视觉语言评估器(PLEor)框架,基于最近引入的对比语言-图像预训练(CLIP)模型进行开放集细粒度检索。PLEor利用预训练的CLIP模型推断包括预定义和未知子类别在内的类别特定差异,并将其转移到在封闭场景中训练的主干网络。设计一个双提示方案使预训练的CLIP模型对类别特定差异敏感,并通过知识蒸馏机制将类别特定差异转移到主干网络。
效果:实验表明,PLEor在开放集细粒度检索数据集上取得了良好的性能。

Open-set fine-grained retrieval is an emerging challenge that requires an extra capability to retrieve unknown subcategories during evaluation. However, current works are rooted in the close-set scenarios, where all the subcategories are pre-defined, and make it hard to capture discriminative knowledge from unknown subcategories, consequently failing to handle the inevitable unknown subcategories in open-world scenarios. In this work, we propose a novel Prompting vision-Language Evaluator (PLEor) framework based on the recently introduced contrastive language-image pretraining (CLIP) model, for open-set fine-grained retrieval. PLEor could leverage pre-trained CLIP model to infer the discrepancies encompassing both pre-defined and unknown subcategories, called category-specific discrepancies, and transfer them to the backbone network trained in the close-set scenarios. To make pre-trained CLIP model sensitive to category-specific discrepancies, we design a dual prompt scheme to learn a vision prompt specifying the category-specific discrepancies, and turn random vectors with category names in a text prompt into category-specific discrepancy descriptions. Moreover, a vision-language evaluator is proposed to semantically align the vision and text prompts based on CLIP model, and reinforce each other. In addition, we propose an open-set knowledge transfer to transfer the category-specific discrepancies into the backbone network using knowledge distillation mechanism. A variety of quantitative and qualitative experiments show that our PLEor achieves promising performance on open-set fine-grained retrieval datasets.

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation
Sarto, SaraandBarraco, ManueleandCornia, MarcellaandBaraldi, LorenzoandCucchiara, Rita



研究问题:如何有效地评估视觉-语言架构生成的字幕。
动机:现有的评估指标无法充分反映人类对图像和视频字幕的判断,需要新的评估方法。
方法:提出一种基于对比学习的评估指标PAC-S,通过在训练数据中添加生成的图像和文本来统一对比视觉语义空间的学习。
效果:实验证明,PAC-S在多个数据集上与人类判断高度相关,优于现有参考指标如CIDEr和SPICE以及参考自由指标如CLIP-Score。

The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. Our source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.

Improving Visual Representation Learning Through Perceptual Understanding
Tukra, SamyakhandHoffman, FrederickandChatfield, Ken



研究问题:如何通过改进预训练模型来提高图像表示的学习效果。
动机:现有的掩蔽自动编码器(MAE)在图像表示学习上存在不足,需要更好的捕捉高级别的图像细节。
方法:提出了一种基于感知相似性的改进型MAE,结合了生成图像和真实图像的感知相似性、多尺度训练和自适应判别器增强等技术。
效果:实验结果显示,该方法不仅提高了像素重建效果,而且在下游任务中表现更好,无需额外的预训练模型或数据即可实现ImageNet-1K的78.1% top-1准确率,微调后可达88.1%。

We present an extension to masked autoencoders (MAE) which improves on the representations learnt by the model by explicitly encouraging the learning of higher scene-level features. We do this by: (i) the introduction of a perceptual similarity term between generated and real images (ii) incorporating several techniques from the adversarial training literature including multi-scale training and adaptive discriminator augmentation. The combination of these results in not only better pixel reconstruction but also representations which appear to capture better higher-level details within images. More consequentially, we show how our method, Perceptual MAE, leads to better performance when used for downstream tasks outperforming previous methods. We achieve 78.1% top-1 accuracy linear probing on ImageNet-1K and up to 88.1% when fine-tuning, with similar results for other downstream tasks, all without use of additional pre-trained models or data.

AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning
Wang, RunqiandDuan, XiaoyueandKang, GuoliangandLiu, JianzhuangandLin, ShaohuiandXu, SongcenandL\"u, JinhuandZhang, Baochang



研究问题:本文旨在解决传统连续学习模型在处理新类别或任务时,需要逐渐扩大分类器的特定权重并存储历史数据以减轻偏见和灾难性遗忘的问题。
动机:现有的连续学习模型需要逐渐扩大分类器的特定权重并存储历史数据以减轻偏见和灾难性遗忘。
方法:本文提出了一种非递增学习器,名为AttriCLIP,用于逐步提取新类别或任务的知识。具体来说,AttriCLIP建立在预训练的视觉语言模型CLIP之上,其图像编码器和文本编码器固定用于从图像和文本提示中提取特征。每个文本提示由类别名称和从我们设计的属性库中选择的固定数量的可学习参数组成,这些参数作为属性。由于我们计算视觉和文本相似性进行分类,因此AttriCLIP是一种非递增学习器。属性提示有效地缓解了灾难性遗忘并避免了构建重播内存。
效果:实验结果表明,该方法在现实场景中的领域转移和长序列学习方面优于基于CLIP和先前最先进的连续学习方法。

Continual learning aims to enable a model to incrementally learn knowledge from sequentially arrived data. Previous works adopt the conventional classification architecture, which consists of a feature extractor and a classifier. The feature extractor is shared across sequentially arrived tasks or classes, but one specific group of weights of the classifier corresponding to one new class should be incrementally expanded. Consequently, the parameters of a continual learner gradually increase. Moreover, as the classifier contains all historical arrived classes, a certain size of the memory is usually required to store rehearsal data to mitigate classifier bias and catastrophic forgetting. In this paper, we propose a non-incremental learner, named AttriCLIP, to incrementally extract knowledge of new classes or tasks. Specifically, AttriCLIP is built upon the pre-trained visual-language model CLIP. Its image encoder and text encoder are fixed to extract features from both images and text prompts. Each text prompt consists of a category name and a fixed number of learnable parameters which are selected from our designed attribute bank and serve as attributes. As we compute the visual and textual similarity for classification, AttriCLIP is a non-incremental learner. The attribute prompts, which encode the common knowledge useful for classification, can effectively mitigate the catastrophic forgetting and avoid constructing a replay memory. We empirically evaluate our AttriCLIP and compare it with CLIP-based and previous state-of-the-art continual learning methods in realistic settings with domain-shift and long-sequence learning. The results show that our method performs favorably against previous state-of-the-arts.

Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-Commerce
Jin, YangandLi, YongzhiandYuan, ZehuanandMu, Yadong



研究问题:本文旨在建立一个通用的多模态基础模型,具有扩展到电子商务中大规模下游应用的能力。
动机:由于自然图像和产品图像之间存在显著差异,直接将这些框架应用于电子商务的产品级表示将不可避免地产生次优结果。
方法:我们提出了一种名为ECLIP的实例中心多模态预训练范例,通过设计一个引入一组可学习实例查询的解码器架构来明确聚合实例级语义。
效果:在1亿个与电子商务相关的数据上进行预训练后,ECLIP成功提取了更通用、语义丰富且稳健的表示。大量实验结果表明,无需进一步微调,ECLIP在广泛的下游任务上大幅超越了现有方法,显示出对现实世界电子商务应用的强大迁移能力。

This paper aims to establish a generic multi-modal foundation model that has the scalable capability to massive downstream applications in E-commerce. Recently, large-scale vision-language pretraining approaches have achieved remarkable advances in the general domain. However, due to the significant differences between natural and product images, directly applying these frameworks for modeling image-level representations to E-commerce will be inevitably sub-optimal. To this end, we propose an instance-centric multi-modal pretraining paradigm called ECLIP in this work. In detail, we craft a decoder architecture that introduces a set of learnable instance queries to explicitly aggregate instance-level semantics. Moreover, to enable the model to focus on the desired product instance without reliance on expensive manual annotations, two specially configured pretext tasks are further proposed. Pretrained on the 100 million E-commerce-related data, ECLIP successfully extracts more generic, semantic-rich, and robust representations. Extensive experimental results show that, without further fine-tuning, ECLIP surpasses existing methods by a large margin on a broad range of downstream tasks, demonstrating the strong transferability to real-world E-commerce applications.

Mask3D: Pre-Training 2D Vision Transformers by Learning Masked 3D Priors
Hou, JiandDai, XiaoliangandHe, ZijianandDai, AngelaandNie{\ss



研究问题:如何更有效地理解二维骨干网络中的三维结构先验?
动机:目前的二维骨干网络(如ViT和ResNets)在处理三维结构信息时存在不足。
方法:提出Mask3D,利用现有的大规模RGB-D数据进行自监督预训练,将三维先验嵌入到二维学习的特征表示中。通过在单个RGB-D帧中对RGB和深度补丁进行掩蔽,形成预文本重建任务。
效果:实验表明,Mask3D能有效地将三维先验嵌入到强大的二维ViT骨干网络中,提高了各种场景理解任务(如语义分割、实例分割和目标检测)的表示学习能力。在ScanNet、NYUv2和Cityscapes图像理解任务上,Mask3D显著优于现有的自监督三维预训练方法,在ScanNet图像语义分割任务上比最先进的Pri3D提高了+6.5% mIoU。

Current popular backbones in computer vision, such as Vision Transformers (ViT) and ResNets are trained to perceive the world from 2D images. However, to more effectively understand 3D structural priors in 2D backbones, we propose Mask3D to leverage existing large-scale RGB-D data in a self-supervised pre-training to embed these 3D priors into 2D learned feature representations. In contrast to traditional 3D contrastive learning paradigms requiring 3D reconstructions or multi-view correspondences, our approach is simple: we formulate a pre-text reconstruction task by masking RGB and depth patches in individual RGB-D frames. We demonstrate the Mask3D is particularly effective in embedding 3D priors into the powerful 2D ViT backbone, enabling improved representation learn- ing for various scene understanding tasks, such as semantic segmentation, instance segmentation and object detection. Experiments show that Mask3D notably outperforms exist- ing self-supervised 3D pre-training approaches on ScanNet, NYUv2, and Cityscapes image understanding tasks, with an improvement of +6.5% mIoU against the state-of-the-art Pri3D on ScanNet image semantic segmentation.

Multimodal Prompting With Missing Modalities for Visual Recognition
Lee, Yi-LunandTsai, Yi-HsuanandChiu, Wei-ChenandLee, Chen-Yu



研究问题:本文旨在解决多模态学习在视觉识别中的两个挑战:1)训练或测试中出现缺失模态的情况;2)没有足够的计算资源对重型转换模型进行微调。
动机:现有的预训练语言模型很少考虑结合知识图谱,而知识图谱可以提供丰富的结构化知识事实以更好地理解语言。
方法:利用大规模文本语料库和知识图谱训练增强的语言表示模型(ERNIE),将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

In this paper, we tackle two challenges in multimodal learning for visual recognition: 1) when missing-modality occurs either during training or testing in real-world situations; and 2) when the computation resources are not available to finetune on heavy transformer models. To this end, we propose to utilize prompt learning and mitigate the above two challenges together. Specifically, our modality-missing-aware prompts can be plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 1% learnable parameters compared to training the entire model. We further explore the effect of different prompt configurations and analyze the robustness to missing modality. Extensive experiments are conducted to show the effectiveness of our prompt learning framework that improves the performance under various missing-modality cases, while alleviating the requirement of heavy model re-training. Code is available.

A-Cap: Anticipation Captioning With Commonsense Knowledge
Vo, DucMinhandLuong, Quoc-AnandSugimoto, AkihiroandNakayama, Hideki



研究问题:如何利用稀疏的时间序列图像集合,生成未见过的目标图像的标题。
动机:人类能够根据过去收集到的少量视觉线索预测未来,为了模拟这种能力,研究人员提出了一种新的任务——预期描述(Anticipation Captioning)。
方法:研究人员提出了一种名为A-CAP的模型,该模型将常识知识融入到预先训练好的视觉语言模型中,使其能够预测标题。
效果:通过在自定义视觉故事数据集上进行定性和定量评估,A-CAP表现优于其他图像描述方法,为预期描述建立了坚实的基线。同时,研究人员也解决了这个任务中固有的挑战。

Humans possess the capacity to reason about the future based on a sparse collection of visual cues acquired over time. In order to emulate this ability, we introduce a novel task called Anticipation Captioning, which generates a caption for an unseen oracle image using a sparsely temporally-ordered set of images. To tackle this new task, we propose a model called A-CAP, which incorporates commonsense knowledge into a pre-trained vision-language model, allowing it to anticipate the caption. Through both qualitative and quantitative evaluations on a customized visual storytelling dataset, A-CAP outperforms other image captioning methods and establishes a strong baseline for anticipation captioning. We also address the challenges inherent in this task.

Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders
Zhang, RenruiandWang, LiuhuiandQiao, YuandGao, PengandLi, Hongsheng



研究问题:由于3D数据集的稀缺性,如何从2D预训练模型中获取高质量的3D表示。
动机:现有的2D预训练模型已经能够很好地学习图像数据的特征,我们希望通过这些模型来引导3D自编码器的学习,从而获得更好的3D特征。
方法:提出一种名为I2P-MAE的方法,通过图像到点的掩蔽自动编码器进行自我监督预训练。首先利用现成的2D模型提取输入点云的多视图视觉特征,然后进行两种类型的图像到点学习方案。一种是引入2D引导的掩蔽策略,保留语义上重要的点标记为可见;另一种是强制这些可见的标记在解码器后重建多视图2D特征,使网络能够有效地继承高级2D语义进行判别式3D建模。
效果:实验结果表明,未经任何微调的冻结I2P-MAE在ModelNet40上达到了93.4%的准确率,与现有的全训练方法具有竞争力。通过对ScanObjectNN的最困难分割进行进一步微调,I2P-MAE实现了90.11%的最先进的准确率,比第二名高出3.68%,显示出优越的可转移能力。

Pre-training by numerous image data has become de-facto for robust 2D representations. In contrast, due to the expensive data processing, a paucity of 3D datasets severely hinders the learning for high-quality 3D features. In this paper, we propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE. By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding, which reconstructs the masked point tokens with an encoder-decoder architecture. Specifically, we first utilize off-the-shelf 2D models to extract the multi-view visual features of the input point cloud, and then conduct two types of image-to-point learning schemes. For one, we introduce a 2D-guided masking strategy that maintains semantically important point tokens to be visible. Compared to random masking, the network can better concentrate on significant 3D structures with key spatial cues. For another, we enforce these visible tokens to reconstruct multi-view 2D features after the decoder. This enables the network to effectively inherit high-level 2D semantics for discriminative 3D modeling. Aided by our image-to-point pre-training, the frozen I2P-MAE, without any fine-tuning, achieves 93.4% accuracy for linear SVM on ModelNet40, competitive to existing fully trained methods. By further fine-tuning on on ScanObjectNN's hardest split, I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity. Code is available at https://github.com/ZrrSkywalker/I2P-MAE.

OVTrack: Open-Vocabulary Multiple Object Tracking
Li, SiyuanandFischer, TobiasandKe, LeiandDing, HenghuiandDanelljan, MartinandYu, Fisher



研究问题:现有的多目标跟踪(MOT)方法仅依赖于少数预定义的对象类别,无法应对真实世界中可能遇到的各种对象。
动机:为了解决这一问题,研究人员提出了开放词汇多目标跟踪(OVTrack)任务,并开发了一种能够追踪任意对象类别的开放词汇跟踪器。
方法:OVTrack的设计基于两个关键要素:一是利用视觉语言模型进行分类和关联;二是通过知识蒸馏进行数据增强策略,以从去噪扩散概率模型中学习稳健的外观特征。
效果:实验结果表明,OVTrack在大规模、大词汇量的TAO基准测试中取得了新的最先进水平,且仅通过静态图像进行训练。

The ability to recognize, localize and track dynamic objects in a scene is fundamental to many real-world applications, such as self-driving and robotic systems. Yet, traditional multiple object tracking (MOT) benchmarks rely only on a few object categories that hardly represent the multitude of possible objects that are encountered in the real world. This leaves contemporary MOT methods limited to a small set of pre-defined object categories. In this paper, we address this limitation by tackling a novel task, open-vocabulary MOT, that aims to evaluate tracking beyond pre-defined training categories. We further develop OVTrack, an open-vocabulary tracker that is capable of tracking arbitrary object classes. Its design is based on two key ingredients: First, leveraging vision-language models for both classification and association via knowledge distillation; second, a data hallucination strategy for robust appearance feature learning from denoising diffusion probabilistic models. The result is an extremely data-efficient open-vocabulary tracker that sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark, while being trained solely on static images. The project page is at https://www.vis.xyz/pub/ovtrack/.

ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders
Woo, SanghyunandDebnath, ShoubhikandHu, RonghangandChen, XinleiandLiu, ZhuangandKweon, InSoandXie, Saining



研究问题:如何提升视觉识别领域的表现,特别是在无监督学习设置下。
动机:尽管现代的ConvNets(如ConvNeXt模型)在各种应用场景中表现出色,但直接结合监督学习和自监督学习框架并未带来预期的效果。
方法:开发了一个完全卷积的掩蔽自动编码器框架,并对其进行了优化。同时,对ConvNeXt架构进行了升级,引入了新的全局响应归一化(GRN)层,以增强通道间特征的竞争性。
效果:新模型系列ConvNeXt V2显著提升了纯ConvNets在ImageNet分类、ADE20K分割和COCO检测等不同识别基准上的性能。

Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt models, have demonstrated strong performance across different application scenarios. Like many other architectures, ConvNeXt models were designed under the supervised learning setting with ImageNet labels. It is natural to expect ConvNeXt can also benefit from state-of-the-art self-supervised learning frameworks such as masked autoencoders (MAE), which was originally designed with Transformers. However, we show that simply combining the two designs yields subpar performance. In this paper, we develop an efficient and fully-convolutional masked autoencoder framework. We then upgrade the ConvNeXt architecture with a new Global Response Normalization (GRN) layer. GRN enhances inter-channel feature competition and is crucial for pre-training with masked input. The new model family, dubbed ConvNeXt V2, is a complete training recipe that synergizes both the architectural improvement and the advancement in self-supervised learning. With ConvNeXt V2, we are able to significantly advance pure ConvNets' performance across different recognition benchmarks including ImageNet classification, ADE20K segmentation and COCO detection. To accommodate different use cases, we provide pre-trained ConvNeXt V2 models of a wide range of complexity: from an efficient 3.7M-parameter Atto model that achieves 76.8% top-1 accuracy on ImageNet, to a 650M Huge model that can reach a state-of-the-art 88.9% accuracy using public training data only.

Evolved Part Masking for Self-Supervised Learning
Feng, ZhanzhouandZhang, Shiliang



研究问题:现有的掩蔽图像建模方法采用固定的掩蔽模式来引导自监督训练,这限制了视觉线索的建模能力。
动机:本文提出了一种改进的部分基于掩蔽的方法,以在自监督学习中追求更通用的视觉线索建模。
方法:该方法基于一个自适应部分分割模块,利用正在训练的视觉模型构建一个部分图,并通过图切割对部分进行划分。划分后的部分的准确性与预训练模型的能力相当,从而在不同的训练阶段产生进化的掩蔽模式。
效果:实验结果表明,该方法在各种任务上都有显著的性能提升,包括图像分类、目标检测和语义分割。例如,在相同的训练周期下,它在ImageNet-1K分类和ADE20K分割上分别比最近的MAE高出0.69%和1.61%。

Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those patterns resort to different criteria to mask local regions, sticking to a fixed pattern leads to limited vision cues modeling capability. This paper proposes an evolved part-based masking to pursue more general visual cues modeling in self-supervised learning. Our method is based on an adaptive part partition module, which leverages the vision model being trained to construct a part graph, and partitions parts with graph cut. The accuracy of partitioned parts is on par with the capability of the pre-trained model, leading to evolved mask patterns at different training stages. It generates simple patterns at the initial training stage to learn low-level visual cues, which hence evolves to eliminate accurate object parts to reinforce the learning of object semantics and contexts. Our method does not require extra pre-trained models or annotations, and effectively ensures the training efficiency by evolving the training difficulty. Experiment results show that it substantially boosts the performance on various tasks including image classification, object detection, and semantic segmentation. For example, it outperforms the recent MAE by 0.69% on imageNet-1K classification and 1.61% on ADE20K segmentation with the same training epochs.

Learning Attention As Disentangler for Compositional Zero-Shot Learning
Hao, ShaozheandHan, KaiandWong, Kwan-YeeK.



研究问题:如何通过学习已见组合中的属性和对象,并将概念知识结合到未见组合中,实现零样本学习(CZSL)。
动机:现有的CZSL方法需要学习属性-对象的解耦,但效果并不理想。
方法:提出利用交叉注意力作为组合解耦器来学习解耦的概念嵌入。并通过在注意力级别应用正则化,进一步约束解耦器学习感兴趣的概念。
效果:在三个CZSL基准数据集上的实验表明,该方法在封闭世界和开放世界中的表现均显著优于先前的工作,建立了新的最先进的技术。

Compositional zero-shot learning (CZSL) aims at learning visual concepts (i.e., attributes and objects) from seen compositions and combining concept knowledge into unseen compositions. The key to CZSL is learning the disentanglement of the attribute-object composition. To this end, we propose to exploit cross-attentions as compositional disentanglers to learn disentangled concept embeddings. For example, if we want to recognize an unseen composition "yellow flower", we can learn the attribute concept "yellow" and object concept "flower" from different yellow objects and different flowers respectively. To further constrain the disentanglers to learn the concept of interest, we employ a regularization at the attention level. Specifically, we adapt the earth mover's distance (EMD) as a feature similarity metric in the cross-attention module. Moreover, benefiting from concept disentanglement, we improve the inference process and tune the prediction score by combining multiple concept probabilities. Comprehensive experiments on three CZSL benchmark datasets demonstrate that our method significantly outperforms previous works in both closed- and open-world settings, establishing a new state-of-the-art. Project page: https://haoosz.github.io/ade-czsl/

GeneCIS: A Benchmark for General Conditional Image Similarity
Vaze, SagarandCarion, NicolasandMisra, Ishan



研究问题:现有的语言表示模型如何更好地利用结构化知识,提升语言理解能力。
动机:预训练的语言模型如BERT在大规模语料库上表现优秀,但缺乏对知识图谱中结构化知识的利用。
方法:提出ERNIE模型,结合大规模文本语料库和知识图谱进行联合训练,以充分利用词汇、句法和知识信息。
效果:实验结果显示,ERNIE在各种知识驱动任务上取得了显著改进,并在其他常见的NLP任务上与最先进的BERT模型相媲美。

We argue that there are many notions of 'similarity' and that models, like humans, should be able to adapt to these dynamically. This contrasts with most representation learning methods, supervised or self-supervised, which learn a fixed embedding function and hence implicitly assume a single notion of similarity. For instance, models trained on ImageNet are biased towards object categories, while a user might prefer the model to focus on colors, textures or specific elements in the scene. In this paper, we propose the GeneCIS ('genesis') benchmark, which measures models' ability to adapt to a range of similarity conditions. Extending prior work, our benchmark is designed for zero-shot evaluation only, and hence considers an open-set of similarity conditions. We find that baselines from powerful CLIP models struggle on GeneCIS and that performance on the benchmark is only weakly correlated with ImageNet accuracy, suggesting that simply scaling existing methods is not fruitful. We further propose a simple, scalable solution based on automatically mining information from existing image-caption datasets. We find our method offers a substantial boost over the baselines on GeneCIS, and further improves zero-shot performance on related image retrieval benchmarks. In fact, though evaluated zero-shot, our model surpasses state-of-the-art supervised models on MIT-States.

Learning Semantic Relationship Among Instances for Image-Text Matching
Fu, ZherenandMao, ZhendongandSong, YanandZhang, Yongdong



研究问题:本文旨在解决图像-文本匹配任务中,如何捕捉样本间和模态间的实例级交互关系,以获取更好的整体嵌入表示。
动机:现有的图像-文本匹配方法主要关注片段级别的关系,如图像的显著区域或文本中的单词,而对样本间和模态间的实例级交互关系关注不足。
方法:提出一种新颖的分层关系建模框架(HREM),明确捕捉片段级和实例级的关系,以学习区分性和鲁棒的跨模态嵌入。
效果:在Flickr30K和MS-COCO数据集上的大量实验表明,该方法在rSum指标上比最先进的方法提高了4%-10%。

Image-text matching, a bridge connecting image and language, is an important task, which generally learns a holistic cross-modal embedding to achieve a high-quality semantic alignment between the two modalities. However, previous studies only focus on capturing fragment-level relation within a sample from a particular modality, e.g., salient regions in an image or text words in a sentence, where they usually pay less attention to capturing instance-level interactions among samples and modalities, e.g., multiple images and texts. In this paper, we argue that sample relations could help learn subtle differences for hard negative instances, and thus transfer shared knowledge for infrequent samples should be promising in obtaining better holistic embeddings. Therefore, we propose a novel hierarchical relation modeling framework (HREM), which explicitly capture both fragment- and instance-level relations to learn discriminative and robust cross-modal embeddings. Extensive experiments on Flickr30K and MS-COCO show our proposed method outperforms the state-of-the-art ones by 4%-10% in terms of rSum.

Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models
Schramowski, PatrickandBrack, ManuelandDeiseroth, Bj\"ornandKersting, Kristian



研究问题:本文旨在解决文本条件图像生成模型在训练过程中可能产生的退化和偏见问题。
动机:由于这些模型高度依赖从互联网随机抓取的大量数据进行训练,因此可能会产生退化和偏见的人为行为,甚至可能加剧这种偏见。
方法:为了解决这个问题,作者提出了安全潜在扩散(SLD)方法,通过建立一个新的图像生成测试平台——不适当的图像提示(I2P),来测量未经过滤和不平衡的训练集导致的不适当退化。
效果:实验结果表明,引入的SLD能够在扩散过程中去除和抑制不适当的图像部分,无需额外的训练,对整体图像质量和文本对齐没有负面影响。

Text-conditioned image generation models have recently achieved astonishing results in image quality and text alignment and are consequently employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the internet, they also suffer, as we demonstrate, from degenerated and biased human behavior. In turn, they may even reinforce such biases. To help combat these undesired side effects, we present safe latent diffusion (SLD). Specifically, to measure the inappropriate degeneration due to unfiltered and imbalanced training sets, we establish a novel image generation test bed - inappropriate image prompts (I2P) - containing dedicated, real-world image-to-text prompts covering concepts such as nudity and violence. As our exhaustive empirical evaluation demonstrates, the introduced SLD removes and suppresses inappropriate image parts during the diffusion process, with no additional training required and no adverse effect on overall image quality or text alignment.

Visual Prompt Tuning for Generative Transfer Learning
Sohn, KihyukandChang, HuiwenandLezama, Jos\'eandPolania, LuisaandZhang, HanandHao, YuanandEssa, IrfanandJiang, Lu



研究问题:如何有效地从不同领域学习生成图像模型?
动机:需要通过在大型数据集上训练的图像合成模型转移知识来学习视觉转换器。
方法:采用基于生成视觉转换器的框架,将图像表示为具有自回归或非自回归转换器的一系列视觉标记。为了适应新领域,我们使用提示调优,将可学习的提示(称为提示)添加到图像标记序列的开头,并为任务引入新的提示设计。
效果:我们在各种视觉领域中进行了研究,结果显示了知识转移的有效性和明显更好的图像生成质量。

Learning generative image models from various domains efficiently needs transferring knowledge from an image synthesis model trained on a large dataset. We present a recipe for learning vision transformers by generative knowledge transfer. We base our framework on generative vision transformers representing an image as a sequence of visual tokens with the autoregressive or non-autoregressive transformers. To adapt to a new domain, we employ prompt tuning, which prepends learnable tokens called prompts to the image token sequence and introduces a new prompt design for our task. We study on a variety of visual domains with varying amounts of training images. We show the effectiveness of knowledge transfer and a significantly better image generation quality. Code is available at https://github.com/google-research/generative_transfer.

OmniMAE: Single Model Masked Pretraining on Images and Videos
Girdhar, RohitandEl-Nouby, AlaaeldinandSingh, MannatandAlwala, KalyanVasudevandJoulin, ArmandandMisra, Ishan



研究问题:如何训练一个单一的模型,使其能够处理多种视觉模态(如图像和视频)的任务?
动机:现有的方法通常为每种视觉模态分别设计模型,或者使用针对视觉任务定制的架构,导致性能不如单一模态模型。
方法:本文提出使用遮蔽自动编码来训练一个简单的视觉Transformer,无需任何标注数据。这种单一模型可以在图像和视频基准测试上学习出与或优于单一模态表示的视觉表示。
效果:实验结果表明,该模型在ImageNet上的准确率可以达到86.6%,在具有挑战性的Something Something-v2视频基准测试上可以达到75.5%,创造了新的最先进的结果。

Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work studies these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse performance compared to single modality models. In this work, we show that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data. This single model learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture. Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training of huge model architectures. In particular, we show that our single ViT-Huge model can be finetuned to achieve 86.6% on ImageNet and 75.5% on the challenging Something Something-v2 video benchmark, setting a new state-of-the-art.

Visual Atoms: Pre-Training Vision Transformers With Sinusoidal Waves
Takashima, SoraandHayamizu, RyoandInoue, NakamasaandKataoka, HirokatsuandYokota, Rio



研究问题:本研究旨在解决预训练视觉转换器时,轮廓导向的合成数据集为何能达到真实数据集相同的准确度的问题。
动机:尽管已有研究表明轮廓导向的合成数据集ExFractalDB-21k在预训练视觉转换器上的效果超过了ImageNet-21k,但关于其设计空间的系统性研究尚未进行,因此存在很大的质疑空间。
方法:本研究基于循环谐波开发了一种新的方法论,用于系统地调查轮廓导向的合成数据集的设计空间。通过这种方法,我们能够有效地搜索最优的FDSL参数范围,并最大化数据集中合成图像的多样性。
效果:使用新开发的VisualAtom-21k数据集预训练ViT-Base模型后,微调ImageNet-1k数据集时,模型的top-1准确率达到了83.7%,与JFT-300M预训练模型达到的84.2%的top-1准确率仅有0.5%的差距。此外,与静态的JFT-300M数据集不同,合成数据集的质量将不断提高。

Formula-driven supervised learning (FDSL) has been shown to be an effective method for pre-training vision transformers, where ExFractalDB-21k was shown to exceed the pre-training effect of ImageNet-21k. These studies also indicate that contours mattered more than textures when pre-training vision transformers. However, the lack of a systematic investigation as to why these contour-oriented synthetic datasets can achieve the same accuracy as real datasets leaves much room for skepticism. In the present work, we develop a novel methodology based on circular harmonics for systematically investigating the design space of contour-oriented synthetic datasets. This allows us to efficiently search the optimal range of FDSL parameters and maximize the variety of synthetic images in the dataset, which we found to be a critical factor. When the resulting new dataset VisualAtom-21k is used for pre-training ViT-Base, the top-1 accuracy reached 83.7% when fine-tuning on ImageNet-1k. This is only 0.5% difference from the top-1 accuracy (84.2%) achieved by the JFT-300M pre-training, even though the scale of images is 1/14. Unlike JFT-300M which is a static dataset, the quality of synthetic datasets will continue to improve, and the current work is a testament to this possibility. FDSL is also free of the common issues associated with real images, e.g. privacy/copyright issues, labeling costs/errors, and ethical biases.

Masked Autoencoding Does Not Help Natural Language Supervision at Scale
Weers, FlorisandShankar, VaishaalandKatharopoulos, AngelosandYang, YinfeiandGunter, Tom



研究问题:本文旨在研究自我监督和自然语言监督在大规模图像-文本训练中的有效性。
动机:尽管已有研究如M3AE和SLIP提出这两种方法可以有效结合,但他们的结果主要基于小数据集(<20M),并未充分反映常用的大数据集(>100M)情况。
方法:本研究采用最新的两种方法:掩码自动编码器MAE和对比性语言图像预训练CLIP进行联合训练,并在两个不同规模的数据集上进行实验。
效果:结果显示,当在11.3M的图像-文本对数据集上训练时,这种方法比单独使用CLIP有所改进;但在1.4B的图像数据集上训练时,其效果与单独使用CLIP相当。这为大规模图像-文本训练的自我监督提供了一些必要的清晰度。

Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE (Geng et al 2022) and SLIP (Mu et al 2022) have suggested that these approaches can be effectively combined, but most notably their results use small (<20M examples) pre-training datasets and don't effectively reflect the large-scale regime (>100M samples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when trained with a much larger amount of data. We find that a combination of two state of the art approaches: masked auto-encoders, MAE (He et al 2021) and contrastive language image pre-training, CLIP (Radford et al 2021) provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B images. Our work provides some much needed clarity into the effectiveness (or lack thereof) of self supervision for large-scale image-text training.

Doubly Right Object Recognition: A Why Prompt for Visual Rationales
Mao, ChengzhiandTeotia, RevantandSundar, AmruthaandMenon, SachitandYang, JunfengandWang, XinandVondrick, Carl



研究问题:本文旨在探讨计算机视觉模型是否能为其预测提供正确的理由。
动机:目前的视觉识别模型仅以分类准确率为评估标准,而忽略了其预测理由的正确性。
方法:提出“双重正确”的对象识别基准,要求模型同时产生正确的标签和理由。通过定制的数据集将语言模型中的理由转化为视觉表示,学习一个“为什么提示”,使大型视觉表示能够产生正确的理由。
效果:实验结果表明,我们的提示不仅在双重正确对象识别上显著提高了性能,还在未见过的目标任务和数据集上的零样本转移上取得了良好的效果。

Many visual recognition models are evaluated only on their classification accuracy, a metric for which they obtain strong performance. In this paper, we investigate whether computer vision models can also provide correct rationales for their predictions. We propose a "doubly right" object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales. We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions. However, by transferring the rationales from language models into visual representations through a tailored dataset, we show that we can learn a "why prompt," which adapts large visual representations to produce correct rationales. Visualizations and empirical experiments show that our prompts significantly improve performance on doubly right object recognition, in addition to zero-shot transfer to unseen tasks and datasets.

GLIGEN: Open-Set Grounded Text-to-Image Generation
Li, YuhengandLiu, HaotianandWu, QingyangandMu, FangzhouandYang, JianweiandGao, JianfengandLi, ChunyuanandLee, YongJae



研究问题:本文旨在解决现有文本到图像扩散模型只能使用文本输入,导致可控性差的问题。
动机:现有的预训练文本到图像扩散模型仅使用文本输入,限制了其可控性。
方法:提出GLIGEN模型,通过在已有的预训练文本到图像扩散模型基础上增加条件输入层,实现基于场景的文本到图像生成。
效果:GLIGEN模型实现了基于场景的文本到图像生成,并在COCO和LVIS数据集上取得了超越现有有监督布局到图像基线模型的优秀性能。

Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN: Open-Set Grounded Text-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin.

Q: How To Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!
Khan, ZaidandBG, VijayKumarandSchulter, SamuelandYu, XiangandFu, YunandChandraker, Manmohan



研究问题:如何对大规模视觉语言模型进行微调,以解决专业任务或非自然图像领域的视觉问答问题。
动机:针对特定任务或领域的数据集规模远小于通用的视觉问答数据集,而收集额外的标签又具有挑战性,但未标注的图片却很常见。
方法:提出SelTDA(自我教学数据增强)策略,通过视觉语言模型和目标数据集构建一个教师模型,该模型可以直接根据图片生成问题-答案伪标签,从而为未标注的图片生成伪标签。然后,在原始数据集上对初始的视觉语言模型进行微调,使用新生成的伪标签进行数据增强。
效果:实验表明,这种自我教学的数据增强方法可以提高模型对抗搜索问题、反事实示例和重新表述的鲁棒性,提高领域泛化能力,并更好地保留数值推理能力。这种方法不需要额外的注释或结构修改,并且与任何现代的编码器-解码器多模态转换器兼容。

Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non natural-image domains are orders of magnitude smaller than those for general-purpose VQA. While collecting additional labels for specialized tasks or domains can be challenging, unlabeled images are often available. We introduce SelTDA (Self-Taught Data Augmentation), a strategy for finetuning large VLMs on small-scale VQA datasets. SelTDA uses the VLM and target dataset to build a teacher model that can generate question-answer pseudolabels directly conditioned on an image alone, allowing us to pseudolabel unlabeled images. SelTDA then finetunes the initial VLM on the original dataset augmented with freshly pseudolabeled images. We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions, counterfactual examples, and rephrasings, it improves domain generalization, and results in greater retention of numerical reasoning skills. The proposed strategy requires no additional annotations or architectural modifications, and is compatible with any modern encoder-decoder multimodal transformer. Code available at https://github.com/codezakh/SelTDA

CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset
Gan, TianandWang, QingandDong, XingningandRen, XiangyuanandNie, LiqiangandGuo, Qingpei



研究问题:中文视频-文本预训练的研究受限于缺乏大规模的公共数据集和基准。
动机:现有的大规模英文视频-文本数据集无法满足中文视频-文本预训练的需求,而现有的中文预训练模型又基于私有数据集,限制了其研究和实际应用。
方法:构建了一个包含超过350万对中文视频-文本的公共跨模态数据集CNVid-3.5M,并通过过滤弱配对的视频提高了数据质量。同时,使用三种主流的像素级预训练架构进行基准测试,并提出了硬样本课程学习策略来提升预训练性能。
效果:CNVid-3.5M是迄今为止最大的中文视频-文本公共数据集,为中文视频-文本预训练提供了首个像素级基准测试。

Owing to well-designed large-scale video-text datasets, recent years have witnessed tremendous progress in video-text pre-training. However, existing large-scale video-text datasets are mostly English-only. Though there are certain methods studying the Chinese video-text pre-training, they pre-train their models on private datasets whose videos and text are unavailable. This lack of large-scale public datasets and benchmarks in Chinese hampers the research and downstream applications of Chinese video-text pre-training. Towards this end, we release and benchmark CNVid-3.5M, a large-scale public cross-modal dataset containing over 3.5M Chinese video-text pairs. We summarize our contributions by three verbs, i.e., "Build", "Filter", and "Pre-train": 1) To build a public Chinese video-text dataset, we collect over 4.5M videos from the Chinese websites. 2) To improve the data quality, we propose a novel method to filter out 1M weakly-paired videos, resulting in the CNVid-3.5M dataset. And 3) we benchmark CNVid-3.5M with three mainstream pixel-level pre-training architectures. At last, we propose the Hard Sample Curriculum Learning strategy to promote the pre-training performance. To the best of our knowledge, CNVid-3.5M is the largest public video-text dataset in Chinese, and we provide the first pixel-level benchmarks for Chinese video-text pre-training. The dataset, codebase, and pre-trained models are available at https://github.com/CNVid/CNVid-3.5M.

Efficient Multimodal Fusion via Interactive Prompting
Li, YaoweiandQuan, RuijieandZhu, LinchaoandYang, Yi



研究问题:如何降低大规模预训练多模态学习模型的计算成本,并提高其效率和灵活性。
动机:随着多模态学习模型规模的增大,其下游任务的微调计算成本也随之增加,急需一种有效且灵活的方法来降低这一成本。
方法:提出一种名为PMF的高效灵活的多模态融合方法,该方法针对单模态预训练的转换器进行优化。具体包括构建一个模块化的多模态融合框架,以增强不同模态之间的交互性;将普通提示分为三种类型,以便为多模态学习学习不同的优化目标;仅在单模态转换器的深层添加提示向量,从而显著减少训练内存使用。
效果:实验结果表明,该方法在性能上与其它几种多模态微调方法相当,但其可训练参数少于3%,训练内存使用最多可节省66%。

Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era. Following this trend, the size of multimodal learning models constantly increases, leading to an urgent need to reduce the massive computational cost of fine-tuning these models for downstream tasks. In this paper, we propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pretrained transformers. Specifically, we first present a modular multimodal fusion framework that exhibits high flexibility and facilitates mutual interactions among different modalities. In addition, we disentangle vanilla prompts into three types in order to learn different optimizing objectives for multimodal learning. It is also worth noting that we propose to add prompt vectors only on the deep layers of the unimodal transformers, thus significantly reducing the training memory usage. Experiment results show that our proposed method achieves comparable performance to several other multimodal finetuning methods with less than 3% trainable parameters and up to 66% saving of training memory usage.

Language Adaptive Weight Generation for Multi-Task Visual Grounding
Su, WeiandMiao, PeihanandDou, HuanzhangandWang, GaoangandQiao, LiangandLi, ZheyangandLi, Xi



研究问题:如何提高视觉基础模型在视觉定位任务中的表现?
动机:目前的视觉基础模型通常被动地提取特征,这可能导致特征匹配的缺失和冗余,限制了性能的进一步提高。
方法:提出一种基于语言适应权重的主动感知视觉定位框架(VG-LAW),使视觉基础模型能够根据不同的表达动态生成权重,作为特定于表达的特征提取器。
效果:实验证明,VG-LAW无需额外的跨模态交互模块,就能有效地提取特定且相关的视觉特征,并在四个代表性数据集上取得了最先进的性能。

Although the impressive performance in visual grounding, the prevailing approaches usually exploit the visual backbone in a passive way, i.e., the visual backbone extracts features with fixed weights without expression-related hints. The passive perception may lead to mismatches (e.g., redundant and missing), limiting further performance improvement. Ideally, the visual backbone should actively extract visual features since the expressions already provide the blueprint of desired visual features. The active perception can take expressions as priors to extract relevant visual features, which can effectively alleviate the mismatches. Inspired by this, we propose an active perception Visual Grounding framework based on Language Adaptive Weights, called VG-LAW. The visual backbone serves as an expression-specific feature extractor through dynamic weights generated for various expressions. Benefiting from the specific and relevant visual features extracted from the language-aware visual backbone, VG-LAW does not require additional modules for cross-modal interaction. Along with a neat multi-task head, VG-LAW can be competent in referring expression comprehension and segmentation jointly. Extensive experiments on four representative datasets, i.e., RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, validate the effectiveness of the proposed framework and demonstrate state-of-the-art performance.

Indescribable Multi-Modal Spatial Evaluator
Kong, LingkeandQi, X.SharonandShen, QijinandWang, JiachengandZhang, JingyiandHu, YanleandZhou, Qichao



研究问题:多模态图像配准的主要挑战之一是来自不同成像机器的图像具有不同的成像分布,这使得难以仅关注图像的空间方面并忽略分布差异。
动机:为了解决这个问题,我们开发了一种自我监督的方法——无法形容的多模型空间评估器(IMSE),用于处理多模态图像配准。
方法:IMSE创建了一个精确的多模态空间评估器来测量两幅图像之间的空间差异,并通过最小化评估器预测的错误来优化配准。为了优化IMSE的性能,我们还提出了一种新的样式增强方法——洗牌重映射,该方法将图像分布随机分为多个段,然后随机打乱和重新映射这些段,从而改变原始图像的分布。
效果:实验结果表明,IMSE在T1-T2和CT-MRI数据集上的配准性能优于现有方法。IMSE还可以轻松集成到传统的配准过程中,并提供一种方便的方式来评估和可视化配准结果。此外,IMSE还有潜力成为一种新的图像到图像转换范式。

Multi-modal image registration spatially aligns two images with different distributions. One of its major challenges is that images acquired from different imaging machines have different imaging distributions, making it difficult to focus only on the spatial aspect of the images and ignore differences in distributions. In this study, we developed a self-supervised approach, Indescribable Multi-model Spatial Evaluator (IMSE), to address multi-modal image registration. IMSE creates an accurate multi-modal spatial evaluator to measure spatial differences between two images, and then optimizes registration by minimizing the error predicted of the evaluator. To optimize IMSE performance, we also proposed a new style enhancement method called Shuffle Remap which randomizes the image distribution into multiple segments, and then randomly disorders and remaps these segments, so that the distribution of the original image is changed. Shuffle Remap can help IMSE to predict the difference in spatial location from unseen target distributions. Our results show that IMSE outperformed the existing methods for registration using T1-T2 and CT-MRI datasets. IMSE also can be easily integrated into the traditional registration process, and can provide a convenient way to evaluate and visualize registration results. IMSE also has the potential to be used as a new paradigm for image-to-image translation. Our code is available at https://github.com/Kid-Liet/IMSE.

ImageBind: One Embedding Space To Bind Them All
Girdhar, RohitandEl-Nouby, AlaaeldinandLiu, ZhuangandSingh, MannatandAlwala, KalyanVasudevandJoulin, ArmandandMisra, Ishan



研究问题:如何学习跨六种不同模态(图像、文本、音频、深度、热和IMU数据)的联合嵌入?
动机:现有的方法需要所有成对的数据来训练这样的联合嵌入,而我们的方法只需要图像配对数据。
方法:我们提出了ImageBind方法,该方法利用了最新的大规模视觉语言模型,并通过使用它们与图像的自然配对,将其零样本能力扩展到新的模态。
效果:实验结果表明,ImageBind在跨模态检索、用算术编写模态、跨模态检测和生成等新兴应用程序方面具有强大的涌现能力,并在跨模态的零样本识别任务上超越了专门的有监督模型。此外,我们还展示了强大的少数样本识别结果,证明ImageBind是评估视觉和非视觉任务的新方法。

We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.

On Data Scaling in Masked Image Modeling
Xie, ZhendaandZhang, ZhengandCao, YueandLin, YutongandWei, YixuanandDai, QiandHu, Han



研究问题:本文旨在通过大量实验,系统地研究了掩蔽图像建模(MIM)的扩展性,以打破其对大规模数据无法受益的偏见。
动机:尽管预训练语言模型的成功鼓励了大规模自我监督预训练语言模型的发展,并赋予了它们显著的建模能力,但扩展性似乎在最近的研究中被无意忽视。
方法:通过从10% ImageNet-1K到完整的ImageNet-22K的数据范围,从4900万到十亿的模型参数范围,以及从125K到500K次的训练长度范围进行广泛的实验,来系统地研究MIM的扩展行为。
效果:主要发现可以总结为两点:1) 为了扩大计算和模型参数,MIM仍然需要大量的数据;2) 在非过拟合的情况下,MIM不能从更多的数据中受益,这与先前在自我监督预训练语言模型或监督预训练视觉模型中的观察结果不同。此外,还揭示了MIM的几个有趣特性,如大型MIM模型的高样本效率和预训练验证损失与转移性能之间的强相关性。

Scaling properties have been one of the central issues in self-supervised pre-training, especially the data scalability, which has successfully motivated the large-scale self-supervised pre-trained language models and endowed them with significant modeling capabilities. However, scaling properties seem to be unintentionally neglected in the recent trending studies on masked image modeling (MIM), and some arguments even suggest that MIM cannot benefit from large-scale data. In this work, we try to break down these preconceptions and systematically study the scaling behaviors of MIM through extensive experiments, with data ranging from 10% of ImageNet-1K to full ImageNet-22K, model parameters ranging from 49-million to one-billion, and training length ranging from 125K to 500K iterations. And our main findings can be summarized in two folds: 1) masked image modeling remains demanding large-scale data in order to scale up computes and model parameters; 2) masked image modeling cannot benefit from more data under a non-overfitting scenario, which diverges from the previous observations in self-supervised pre-trained language models or supervised pre-trained vision models. In addition, we reveal several intriguing properties in MIM, such as high sample efficiency in large MIM models and strong correlation between pre-training validation loss and transfer performance. We hope that our findings could deepen the understanding of masked image modeling and facilitate future developments on large-scale vision models. Code and models will be available at https://github.com/microsoft/SimMIM.

Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding
Shaharabany, TalandWolf, Lior



研究问题:如何通过考虑从模型图像编码器的潜表示中提取的自相似性图来优化短语接地模型。
动机:现有的短语接地模型在弱监督短语接地任务上的性能与最新技术存在较大差距,且在无文本输入的WWbL任务上也存在类似的问题。
方法:提出一种有效的方法,通过结合从模型图像编码器的潜表示中提取的自相似性图来优化短语接地模型,从而获得有用的伪标签进行自我训练。
效果:实验结果表明,该方法在弱监督短语接地任务和无文本输入的WWbL任务上的性能均大幅超过现有技术。

A phrase grounding model receives an input image and a text phrase and outputs a suitable localization map. We present an effective way to refine a phrase ground model by considering self-similarity maps extracted from the latent representation of the model's image encoder. Our main insights are that these maps resemble localization maps and that by combining such maps, one can obtain useful pseudo-labels for performing self-training. Our results surpass, by a large margin, the state-of-the-art in weakly supervised phrase grounding. A similar gap in performance is obtained for a recently proposed downstream task called WWbL, in which the input image is given without any text. Our code is available as supplementary.

Turning a CLIP Model Into a Scene Text Detector
Yu, WenwenandLiu, YuliangandHua, WeiandJiang, DeqiangandRen, BoandBai, Xiang



研究问题:本文旨在提出一种新的方法,即TCM,将预训练的CLIP模型直接用于文本检测,而无需进行预训练过程。
动机:现有的基于视觉语言模型的预训练方法在文本检测领域取得了有效进展。与这些工作相比,本文提出了一种新的方法,即TCM,重点关注直接利用CLIP模型进行文本检测。
方法:通过将CLIP模型转化为现有的场景文本检测方法,实现对现有场景文本检测器的改进。该方法具有少量样本训练能力,例如使用10%的标注数据即可显著提高基线方法的性能。
效果:实验结果表明,该方法在4个基准测试上的平均F-measure性能提高了22%。此外,通过将CLIP模型转化为现有的场景文本检测方法,还实现了有前景的领域适应能力。

The recent large-scale Contrastive Language-Image Pretraining (CLIP) model has shown great potential in various downstream tasks via leveraging the pretrained vision and language knowledge. Scene text, which contains rich textual and visual information, has an inherent connection with a model like CLIP. Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection. In contrast to these works, this paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process. We demonstrate the advantages of the proposed TCM as follows: (1) The underlying principle of our framework can be applied to improve existing scene text detector. (2) It facilitates the few-shot training capability of existing methods, e.g., by using 10% of labeled data, we significantly improve the performance of the baseline method with an average of 22% in terms of the F-measure on 4 benchmarks. (3) By turning the CLIP model into existing scene text detection methods, we further achieve promising domain adaptation ability. The code will be publicly released at https://github.com/wenwenyu/TCM.

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning
Wang, RuiandChen, DongdongandWu, ZuxuanandChen, YinpengandDai, XiyangandLiu, MengchenandYuan, LuandJiang, Yu-Gang



研究问题:本文旨在解决现有的视频表征学习方法主要通过重建低层次特征如原始像素值来从头开始学习表示的问题。
动机:虽然受益于被遮罩的视觉建模,自我监督的视频表征学习已经取得了显著的进步,但现有的方法主要集中在从零开始通过重建被遮罩的片段的低层次特征来学习表示。
方法:本文提出了一种简单而有效的两阶段被遮罩的特征模型框架——被遮罩的视频蒸馏(MVD)。首先,我们通过恢复被遮罩的片段的低层次特征来预训练一个图像(或视频)模型,然后我们将得到的特征作为目标进行被遮罩的特征建模。
效果:实验结果表明,使用空间-时间共教方法预训练的视频转换器在多个视频数据集上优于使用单一教师模型蒸馏的学生模型。我们的MVD与普通的ViT相比,在几个具有挑战性的视频下游任务上达到了最先进的性能。例如,使用ViT-Large模型,我们的MVD在Kinetics-400和Something-Something-v2上分别实现了86.4%和76.7%的Top-1准确率,比VideoMAE高出1.2%和2.4%。当采用更大的ViT-Huge模型时,MVD在Something-Something-v2上实现了77.3%的Top-1准确率,达到了最先进的性能。代码将在https://github.com/ruiwang2021/mvd上提供。

Benefiting from masked visual modeling, self-supervised video representation learning has achieved remarkable progress. However, existing methods focus on learning representations from scratch through reconstructing low-level features like raw pixel values. In this paper, we propose masked video distillation (MVD), a simple yet effective two-stage masked feature modeling framework for video representation learning: firstly we pretrain an image (or video) model by recovering low-level features of masked patches, then we use the resulting features as targets for masked feature modeling. For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks. Visualization analysis also indicates different teachers produce different learned patterns for students. To leverage the advantage of different teachers, we design a spatial-temporal co-teaching method for MVD. Specifically, we distill student models from both video teachers and image teachers by masked feature modeling. Extensive experimental results demonstrate that video transformers pretrained with spatial-temporal co-teaching outperform models distilled with a single teacher on a multitude of video datasets. Our MVD with vanilla ViT achieves state-of-the-art performance compared with previous methods on several challenging video downstream tasks. For example, with the ViT-Large model, our MVD achieves 86.4% and 76.7% Top-1 accuracy on Kinetics-400 and Something-Something-v2, outperforming VideoMAE by 1.2% and 2.4% respectively. When a larger ViT-Huge model is adopted, MVD achieves the state-of-the-art performance with 77.3% Top-1 accuracy on Something-Something-v2. Code will be available at https://github.com/ruiwang2021/mvd.

RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving
Ando, AngelikaandGidaris, SpyrosandBursuc, AndreiandPuy, GillesandBoulch, AlexandreandMarlet, Renaud



研究问题:能否通过最新的视觉转换器(ViTs)改进3D语义分割的投影方法?
动机:尽管现有的投影方法在户外LiDAR点云的语义分割上取得了不错的效果,但最新的计算机视觉研究表明,视觉转换器(ViTs)在许多基于图像的基准测试中已经达到了最先进的结果。
方法:我们提出了一种名为RangeViT的方法,该方法将ViTs与三个关键元素相结合:(a)保留与RGB图像相同的主干架构,以利用大型图像集合的知识;(b)用定制的卷积基替换经典的线性嵌入层,以补偿ViTs缺乏的归纳偏置;(c)使用卷积解码器和从卷积基到ViT编码器的跳跃连接来细化像素级预测。
效果:实验结果表明,RangeViT在nuScenes和SemanticKITTI上的表现优于现有的投影方法。

Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, e.g., via range projection, is an effective and popular approach. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results. Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. In this work, we question if projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs. We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate ViTs' lack of inductive bias by substituting a tailored convolutional stem for the classical linear embedding layer. (c) We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and SemanticKITTI. The code is available at https://github.com/valeoai/rangevit.

VQACL: A Novel Visual Question Answering Continual Learning Setting
Zhang, XiandZhang, FeifeiandXu, Changsheng



研究问题:本文旨在解决多模态任务如视觉问答(VQA)的持续学习问题。
动机:尽管在单模态领域的持续学习研究已经取得了很多成果,但在多模态任务如VQA方面的研究却鲜有关注。
方法:本文提出了一种新的VQA持续学习设置(VQACL),包括两个关键组件:一个嵌套了视觉和语言数据的双层级任务序列,以及一个新的组合测试,包含新的技能-概念组合。基于VQACL,我们对五种公认的持续学习方法进行了深入评估,并发现它们存在灾难性遗忘和弱泛化能力的问题。为解决这些问题,我们提出了一种新的表示学习方法,利用样本特定和样本不变的特征来学习具有判别性和泛化性的表示。
效果:大量的实验结果表明,我们的方法显著优于现有的模型,证明了所提出方法的有效性和复合性。

Research on continual learning has recently led to a variety of work in unimodal community, however little attention has been paid to multimodal tasks like visual question answering (VQA). In this paper, we establish a novel VQA Continual Learning setting named VQACL, which contains two key components: a dual-level task sequence where visual and linguistic data are nested, and a novel composition testing containing new skill-concept combinations. The former devotes to simulating the ever-changing multimodal datastream in real world and the latter aims at measuring models' generalizability for cognitive reasoning. Based on our VQACL, we perform in-depth evaluations of five well-established continual learning methods, and observe that they suffer from catastrophic forgetting and have weak generalizability. To address above issues, we propose a novel representation learning method, which leverages a sample-specific and a sample-invariant feature to learn representations that are both discriminative and generalizable for VQA. Furthermore, by respectively extracting such representation for visual and textual input, our method can explicitly disentangle the skill and concept. Extensive experimental results illustrate that our method significantly outperforms existing models, demonstrating the effectiveness and compositionality of the proposed approach.

What Can Human Sketches Do for Object Detection?
Chowdhury, PinakiNathandBhunia, AyanKumarandSain, AneeshanandKoley, SubhadeepandXiang, TaoandSong, Yi-Zhe



研究问题:如何利用人类草图的天生表现力进行对象检测?
动机:目前,人类草图的表现力在图像检索中的应用已经得到了探索,但在基本视觉任务对象检测中的应用还尚未开发。
方法:首次将人类草图的表现力用于对象检测的基本视觉任务中,形成了一种基于草图的对象检测框架。该模型可以在测试时无需知道预期的类别(零样本),也不需要额外的边界框(全监督)和类别标签(弱监督)。
效果:在PASCAL-VOC和MS-COCO等标准对象检测数据集上,该框架在零样本设置上优于监督(SOD)和弱监督对象检测器(WSOD)。

Sketches are highly expressive, inherently capturing subjective and fine-grained visual cues. The exploration of such innate properties of human sketches has, however, been limited to that of image retrieval. In this paper, for the first time, we cultivate the expressiveness of sketches but for the fundamental vision task of object detection. The end result is a sketch-enabled object detection framework that detects based on what you sketch -- that "zebra" (e.g., one that is eating the grass) in a herd of zebras (instance-aware detection), and only the part (e.g., "head" of a "zebra") that you desire (part-aware detection). We further dictate that our model works without (i) knowing which category to expect at testing (zero-shot) and (ii) not requiring additional bounding boxes (as per fully supervised) and class labels (as per weakly supervised). Instead of devising a model from the ground up, we show an intuitive synergy between foundation models (e.g., CLIP) and existing sketch models build for sketch-based image retrieval (SBIR), which can already elegantly solve the task -- CLIP to provide model generalisation, and SBIR to bridge the (sketch->photo) gap. In particular, we first perform independent prompting on both sketch and photo branches of an SBIR model to build highly generalisable sketch and photo encoders on the back of the generalisation ability of CLIP. We then devise a training paradigm to adapt the learned encoders for object detection, such that the region embeddings of detected boxes are aligned with the sketch and photo embeddings from SBIR. Evaluating our framework on standard object detection datasets like PASCAL-VOC and MS-COCO outperforms both supervised (SOD) and weakly-supervised object detectors (WSOD) on zero-shot setups. Project Page: https://pinakinathc.github.io/sketch-detect

All Are Worth Words: A ViT Backbone for Diffusion Models
Bao, FanandNie, ShenandXue, KaiwenandCao, YueandLi, ChongxuanandSu, HangandZhu, Jun



研究问题:设计一种基于视觉转换器的简单通用架构(命名为U-ViT),用于具有扩散模型的图像生成。
动机:虽然视觉转换器在各种视觉任务中显示出潜力,但基于卷积神经网络的U-Net在扩散模型中仍然占主导地位。
方法:将包括时间、条件和噪声图像补丁的所有输入都视为令牌,并在浅层和深层之间采用长跳跃连接,设计出一种简单的通用U-ViT架构。
效果:在无条件和有条件图像生成以及文本到图像生成任务中评估U-ViT,其表现即使不优于,也至少与类似规模的CNN基U-Net相当。特别是在ImageNet 256x256上的有条件图像生成和MS-COCO上的文本到图像生成中,使用U-ViT的潜在扩散模型实现了FID分数为2.29和5.48的记录,这是在训练生成模型期间未访问大型外部数据集的方法中的最好成绩。这些结果表明,对于基于扩散的图像建模,长跳跃连接至关重要,而CNN基U-Net中的下采样和上采样操作并不总是必要的。

Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation on ImageNet 256x256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets.

Sketch2Saliency: Learning To Detect Salient Objects From Human Drawings
Bhunia, AyanKumarandKoley, SubhadeepandKumar, AmandeepandSain, AneeshanandChowdhury, PinakiNathandXiang, TaoandSong, Yi-Zhe



研究问题:本文旨在探索草图在图像理解任务中的价值,特别是其突出性。
动机:草图是一种自然的注意过程,具有突出性。作者希望研究如何利用草图作为弱标签来检测图像中的突出对象。
方法:提出了一种新颖的方法,通过2D注意力机制生成序列草图坐标,以解释“突出对象”的草图。
效果:大量的定量和定性实验证明了作者的假设,并表明了基于草图的突出性检测模型与最先进的技术相比具有竞争力的性能。

Human sketch has already proved its worth in various visual understanding tasks (e.g., retrieval, segmentation, image-captioning, etc). In this paper, we reveal a new trait of sketches -- that they are also salient. This is intuitive as sketching is a natural attentive process at its core. More specifically, we aim to study how sketches can be used as a weak label to detect salient objects present in an image. To this end, we propose a novel method that emphasises on how "salient object" could be explained by hand-drawn sketches. To accomplish this, we introduce a photo-to-sketch generation model that aims to generate sequential sketch coordinates corresponding to a given visual photo through a 2D attention mechanism. Attention maps accumulated across the time steps give rise to salient regions in the process. Extensive quantitative and qualitative experiments prove our hypothesis and delineate how our sketch-based saliency detection model gives a competitive performance compared to the state-of-the-art.

ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding
Xue, LeandGao, MingfeiandXing, ChenandMart{\'\i



研究问题:如何改善当前最先进的3D模型在少量标注数据和预定义类别集上的理解能力。
动机:目前,通过语言等其他模态的知识可以显著缓解2D模型的类似问题。受此启发,利用多模态信息进行3D模态的学习可能在数据有限的情况下提高3D理解,但这方面的研究尚不充分。
方法:提出ULIP,通过从三种模态的对象三元组中预训练来学习图像、语言和3D点云的统一表示。为了克服训练三元组不足的问题,ULIP利用预先训练好的视觉-语言模型,该模型已经通过大量的图像-文本对学习了共同的视觉和文本空间。然后,ULIP使用自动合成的少量三元组学习与共同的图像-文本空间对齐的3D表示空间。
效果:实验表明,只需在ShapeNet55上使用我们的框架对多种最新的3D主干网络进行预训练,ULIP就能有效地提高其性能,在ModelNet40和ScanObjectNN的标准3D分类和零样本3D分类任务上都达到了最先进的性能。此外,ULIP还能使PointMLP在ScanObjectNN上的3D分类性能提高约3%,并在ModelNet40上的零样本3D分类任务上以28.8%的优势超过PointCLIP。

The recognition capabilities of current state-of-the-art 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. In its 2D counterpart, recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. Inspired by this, leveraging multimodal information for 3D modality could be promising to improve 3D understanding under the restricted data regime, but this line of research is not well studied. Therefore, we introduce ULIP to learn a unified representation of images, language, and 3D point clouds by pre-training with object triplets from the three modalities. To overcome the shortage of training triplets, ULIP leverages a pre-trained vision-language model that has already learned a common visual and textual space by training with massive image-text pairs. Then, ULIP learns a 3D representation space aligned with the common image-text space, using a small number of automatically synthesized triplets. ULIP is agnostic to 3D backbone networks and can easily be integrated into any 3D architecture. Experiments show that ULIP effectively improves the performance of multiple recent 3D backbones by simply pre-training them on ShapeNet55 using our framework, achieving state-of-the-art performance in both standard 3D classification and zero-shot 3D classification on ModelNet40 and ScanObjectNN. ULIP also improves the performance of PointMLP by around 3% in 3D classification on ScanObjectNN, and outperforms PointCLIP by 28.8% on top-1 accuracy for zero-shot 3D classification on ModelNet40. Our code and pre-trained models are released at https://github.com/salesforce/ULIP.

Being Comes From Not-Being: Open-Vocabulary Text-to-Motion Generation With Wordless Training
Lin, JunfanandChang, JianlongandLiu, LingboandLi, GuanbinandLin, LiangandTian, QiandChen, Chang-Wen



研究问题:本文旨在解决文本到运动生成这一新兴且具有挑战性的问题,目标是合成与输入文本语义相同的运动。
动机:由于缺乏多样化的标记训练数据,大多数方法要么局限于特定类型的文本注释,要么需要在线优化以适应推理过程中的文本,从而影响效率和稳定性。
方法:受NLP中的提示学习启发,我们预训练一个运动生成器,使其能够从被遮蔽的运动中重建完整的运动。在推理过程中,我们的方法不是改变运动生成器,而是将输入文本重新构造为被遮蔽的运动作为运动生成器的“重建”提示。
效果:实验结果表明,我们的方法相对于基线方法取得了显著的改进。代码可在https://github.com/junfanlin/oohmg获取。

Text-to-motion generation is an emerging and challenging problem, which aims to synthesize motion with the same semantics as the input text. However, due to the lack of diverse labeled training data, most approaches either limit to specific types of text annotations or require online optimizations to cater to the texts during inference at the cost of efficiency and stability. In this paper, we investigate offline open-vocabulary text-to-motion generation in a zero-shot learning manner that neither requires paired training data nor extra online optimization to adapt for unseen texts. Inspired by the prompt learning in NLP, we pretrain a motion generator that learns to reconstruct the full motion from the masked motion. During inference, instead of changing the motion generator, our method reformulates the input text into a masked motion as the prompt for the motion generator to "reconstruct" the motion. In constructing the prompt, the unmasked poses of the prompt are synthesized by a text-to-pose generator. To supervise the optimization of the text-to-pose generator, we propose the first text-pose alignment model for measuring the alignment between texts and 3D poses. And to prevent the pose generator from overfitting to limited training texts, we further propose a novel wordless training mechanism that optimizes the text-to-pose generator without any training texts. The comprehensive experimental results show that our method obtains a significant improvement against the baseline methods. The code is available at https://github.com/junfanlin/oohmg.

MetaCLUE: Towards Comprehensive Visual Metaphors Research
Akula, ArjunR.andDriscoll, BrendanandNarayana, PradyumnaandChangpinyo, SoravitandJia, ZhiweiandDamle, SuyashandPruthi, GarimaandBasu, SugatoandGuibas, LeonidasandFreeman, WilliamT.andLi, YuanzhenandJampani, Varun



研究问题:本文旨在解决计算机视觉中隐喻理解的问题,即如何通过抽象概念之间的微妙关系来理解和生成创造性的图像。
动机:虽然计算机视觉基准和方法是理解和生成图像字面解释的主要方式,但图像的隐喻理解仍然相对未被探索。为了解决这个问题,我们提出了MetaCLUE,一套关于视觉隐喻的视觉任务。
方法:我们收集了高质量的丰富隐喻标注(抽象对象、概念、关系以及相应的对象框),因为目前没有任何数据集可以方便地评估这些任务。我们基于我们的标注对当前最先进的视觉和语言模型进行了全面分析,突出了当前方法在视觉隐喻分类、定位、理解和生成(文本到图像合成)任务中的优缺点。
效果:实验结果表明,MetaCLUE为开发具有人类般创造力的AI系统提供了具体步骤。我们希望这项工作能推动AI系统的发展,使其具有更强的创新和理解能力。

Creativity is an indispensable part of human cognition and also an inherent part of how we make sense of the world. Metaphorical abstraction is fundamental in communicating creative ideas through nuanced relationships between abstract concepts such as feelings. While computer vision benchmarks and approaches predominantly focus on understanding and generating literal interpretations of images, metaphorical comprehension of images remains relatively unexplored. Towards this goal, we introduce MetaCLUE, a set of vision tasks on visual metaphor. We also collect high-quality and rich metaphor annotations (abstract objects, concepts, relationships along with their corresponding object boxes) as there do not exist any datasets that facilitate the evaluation of these tasks. We perform a comprehensive analysis of state-of-the-art models in vision and language based on our annotations, highlighting strengths and weaknesses of current approaches in visual metaphor Classification, Localization, Understanding (retrieval, question answering, captioning) and gEneration (text-to-image synthesis) tasks. We hope this work provides a concrete step towards systematically developing AI systems with human-like creative capabilities. Project page: https://metaclue.github.io

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
Fang, YuxinandWang, WenandXie, BinhuiandSun, QuanandWu, LedellandWang, XinggangandHuang, TiejunandWang, XinlongandCao, Yue



研究问题:本文旨在探索大规模视觉表示的极限,仅使用公开可获取的数据。
动机:现有的预训练模型在处理大规模视觉表示时存在限制,需要更强大的模型来提高性能。
方法:本文提出了一种名为EVA的视觉基础模型,通过预训练重建被遮盖的图像-文本对齐视觉特征,以实现大规模视觉表示。
效果:实验结果表明,EVA在各种代表性视觉下游任务上取得了显著改进,如图像识别、视频动作识别、目标检测、实例分割和语义分割等,并且在迁移学习性能方面也取得了突破性进展。此外,EVA还可以作为连接图像和文本的多模态基础模型。

We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVIS dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.

Gloss Attention for Gloss-Free Sign Language Translation
Yin, AoxiongandZhong, TianyunandTang, LiandJin, WeikeandJin, TaoandZhao, Zhou



研究问题:目前大多数手语翻译方法需要使用词汇注释来提供额外的监督信息,但获取词汇注释并不容易。
动机:为了解决这个问题,我们首先对现有模型进行分析,确认了词汇注释如何使手语翻译变得更容易。我们发现,它可以为模型提供两方面的信息,1)帮助模型隐式地学习连续手语视频中的语义边界位置,2)帮助模型全局理解手语视频。
方法:我们提出了词汇关注机制,使模型能够将其注意力保持在具有相同局部语义的视频片段内,就像词汇注释帮助现有模型所做的那样。此外,我们将句子到句子的相似性知识从自然语言模型转移到我们的词汇关注手语翻译网络(GASLT)中,以帮助它在句子级别理解手语视频。
效果:我们在多个大规模的手语数据集上进行实验,结果显示我们的GASLT模型显著优于现有方法。

Most sign language translation (SLT) methods to date require the use of gloss annotations to provide additional supervision information, however, the acquisition of gloss is not easy. To solve this problem, we first perform an analysis of existing models to confirm how gloss annotations make SLT easier. We find that it can provide two aspects of information for the model, 1) it can help the model implicitly learn the location of semantic boundaries in continuous sign language videos, 2) it can help the model understand the sign language video globally. We then propose gloss attention, which enables the model to keep its attention within video segments that have the same semantics locally, just as gloss helps existing models do. Furthermore, we transfer the knowledge of sentence-to-sentence similarity from the natural language model to our gloss attention SLT network (GASLT) to help it understand sign language videos at the sentence level. Experimental results on multiple large-scale sign language datasets show that our proposed GASLT model significantly outperforms existing methods. Our code is provided in https://github.com/YinAoXiong/GASLT.

Siamese Image Modeling for Self-Supervised Vision Representation Learning
Tao, ChenxinandZhu, XizhouandSu, WeijieandHuang, GaoandLi, BinandZhou, JieandQiao, YuandWang, XiaogangandDai, Jifeng



研究问题:如何同时解决预训练语言模型对结构化知识的利用不足和现有SSL框架在语义对齐和空间敏感性方面的缺陷。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,而现有的SSL框架在语义对齐和空间敏感性方面存在不足。
方法:提出Siamese Image Modeling(SiameseIM)方法,通过匹配不同图像视图的强增广来实现语义对齐,并通过预测带有掩码的图像的密集表示来提高空间敏感性。
效果:实验结果表明,SiameseIM在各种下游任务上都能超越现有的ID和MIM框架,尤其在少样本、长尾和鲁棒性关注的场景中,改进效果更为显著。

Self-supervised learning (SSL) has delivered superior performance on a variety of downstream vision tasks. Two main-stream SSL frameworks have been proposed, i.e., Instance Discrimination (ID) and Masked Image Modeling (MIM). ID pulls together representations from different views of the same image, while avoiding feature collapse. It lacks spatial sensitivity, which requires modeling the local structure within each image. On the other hand, MIM reconstructs the original content given a masked image. It instead does not have good semantic alignment, which requires projecting semantically similar views into nearby representations. To address this dilemma, we observe that (1) semantic alignment can be achieved by matching different image views with strong augmentations; (2) spatial sensitivity can benefit from predicting dense representations with masked images. Driven by these analysis, we propose Siamese Image Modeling (SiameseIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations. SiameseIM uses a Siamese network with two branches. The online branch encodes the first view, and predicts the second view's representation according to the relative positions between these two views. The target branch produces the target by encoding the second view. SiameseIM can surpass both ID and MIM on a wide range of downstream tasks, including ImageNet finetuning and linear probing, COCO and LVIS detection, and ADE20k semantic segmentation. The improvement is more significant in few-shot, long-tail and robustness-concerned scenarios. Code shall be released.

Mixed Autoencoder for Self-Supervised Visual Representation Learning
Chen, KaiandLiu, ZhiliandHong, LanqingandXu, HangandLi, ZhenguoandYeung, Dit-Yan



研究问题:本文旨在解决在对Masked Autoencoder(MAE)进行数据增强时,简单混合会降低模型性能的问题。
动机:虽然MAE在各种视觉任务上表现出优越的性能,但其有效的数据增强策略仍然是一个开放的问题。与对比学习中最重要的部分不同,简单混合会因为互信息(MI)的增加而导致模型性能下降。
方法:我们提出了同源识别这一辅助的预训练任务,不仅通过显式要求每个补丁识别同源补丁来减轻MI的增加,而且还进行了对象感知的自监督预训练以获得更好的下游密集感知性能。
效果:实验表明,我们的混合自动编码器(MixedAE)在不同的下游任务上实现了掩蔽图像建模(MIM)增强中最先进的迁移结果,具有显著的效率。具体来说,我们的MixedAE在ImageNet-1K、ADE20K和COCO上分别比MAE提高了+0.3%的准确率、+1.7 mIoU和+0.9 AP,同时在标准的ViT-Base上训练速度提高了2倍。此外,MixedAE超过了结合实例判别的强MIM方法iBOT。据我们所知,这是第一个从预训练任务设计的角度考虑MIM混合的工作。

Masked Autoencoder (MAE) has demonstrated superior performance on various vision tasks via randomly masking image patches and reconstruction. However, effective data augmentation strategies for MAE still remain open questions, different from those in contrastive learning that serve as the most important part. This paper studies the prevailing mixing augmentation for MAE. We first demonstrate that naive mixing will in contrast degenerate model performance due to the increase of mutual information (MI). To address, we propose homologous recognition, an auxiliary pretext task, not only to alleviate the MI increasement by explicitly requiring each patch to recognize homologous patches, but also to perform object-aware self-supervised pre-training for better downstream dense perception performance. With extensive experiments, we demonstrate that our proposed Mixed Autoencoder (MixedAE) achieves the state-of-the-art transfer results among masked image modeling (MIM) augmentations on different downstream tasks with significant efficiency. Specifically, our MixedAE outperforms MAE by +0.3% accuracy, +1.7 mIoU and +0.9 AP on ImageNet-1K, ADE20K and COCO respectively with a standard ViT-Base. Moreover, MixedAE surpasses iBOT, a strong MIM method combined with instance discrimination, while accelerating training by 2x. To our best knowledge, this is the very first work to consider mixing for MIM from the perspective of pretext task design. Code will be made available.

MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers
Liu, JihaoandHuang, XinandZheng, JinliangandLiu, YuandLi, Hongsheng



研究问题:本文旨在提出一种适用于各种分层视觉变压器的简单但有效的预训练方法。
动机:现有的分层视觉变压器的遮蔽图像建模(MIM)方法使用特殊的[MASK]符号替换输入令牌的随机子集,并试图从损坏的图像中重建原始图像令牌,但这种方法训练速度慢且预训练-微调不一致。
方法:通过将一张图像的遮蔽令牌替换为另一张图像的可见令牌,即创建混合图像,然后对两个原始图像进行双重重建,以从混合输入中重建两个原始图像,从而显著提高效率。
效果:实验结果表明,MixMAE可以有效地学习高质量的视觉表示。特别地,MixMAE与Swin-B/W14一起在ImageNet-1K上实现了85.1%的top-1准确率,只需预训练600个周期。此外,其在其它6个数据集上的转移性能表明,MixMAE比之前流行的MIM方法具有更好的FLOPs /性能权衡。

In this paper, we propose Mixed and Masked AutoEncoder (MixMAE), a simple but efficient pretraining method that is applicable to various hierarchical Vision Transformers. Existing masked image modeling (MIM) methods for hierarchical Vision Transformers replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes pretraining-finetuning inconsistency, due to the large masking ratio (e.g., 60% in SimMIM). On the other hand, MAE does not introduce [MASK] tokens at its encoder at all but is not applicable for hierarchical Vision Transformers. To solve the issue and accelerate the pretraining of hierarchical models, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the two original images from the mixed input, which significantly improves efficiency. While MixMAE can be applied to various hierarchical Transformers, this paper explores using Swin Transformer with a large window size and scales up to huge model size (to reach 600M parameters). Empirical results demonstrate that MixMAE can learn high-quality visual representations efficiently. Notably, MixMAE with Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs. Besides, its transfer performances on the other 6 datasets show that MixMAE has better FLOPs / performance tradeoff than previous popular MIM methods.

Video-Text As Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
Jin, PengandHuang, JinfaandXiong, PengfeiandTian, ShangxuanandLiu, ChangandJi, XiangyangandYuan, LiandChen, Jie



研究问题:本文旨在解决对比学习为基础的视频-语言表示学习方法在细粒度跨模态学习中面临的挑战。
动机:目前的模型在进行预定义的视频-文本对的语义交互时,存在粗糙的全局交互问题,需要进一步解决细粒度的交互问题。
方法:本文创新地将视频-文本建模为具有多变量合作博弈论的游戏玩家,以智能处理不同粒度、灵活组合和模糊强度的细粒度语义交互中的不确定性。具体来说,提出了分层的班扎夫互动(HBI)来评估视频帧和文本词之间可能的对应关系,实现敏感且可解释的跨模态对比。
效果:通过在常用的文本-视频检索和视频-问题回答基准上进行大量实验,证明了该方法的有效性。此外,它还可以作为可视化工具,帮助理解跨模态交互,对社区产生深远影响。

Contrastive learning-based video-language representation learning approaches, e.g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs. To clarify this coarse-grained global interaction and move a step further, we have to encounter challenging shell-breaking interactions for fine-grained cross-modal learning. In this paper, we creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity. Concretely, we propose Hierarchical Banzhaf Interaction (HBI) to value possible correspondence between video frames and text words for sensitive and explainable cross-modal contrast. To efficiently realize the cooperative game of multiple video frames and multiple text words, the proposed method clusters the original video frames (text words) and computes the Banzhaf Interaction between the merged tokens. By stacking token merge modules, we achieve cooperative games at different semantic levels. Extensive experiments on commonly used text-video retrieval and video-question answering benchmarks with superior performances justify the efficacy of our HBI. More encouragingly, it can also serve as a visualization tool to promote the understanding of cross-modal interaction, which may have a far-reaching impact on the community. Project page is available at https://jpthu17.github.io/HBI/.

All in One: Exploring Unified Video-Language Pre-Training
Wang, JinpengandGe, YixiaoandYan, RuiandGe, YuyingandLin, KevinQinghongandTsutsui, SatoshiandLin, XudongandCai, GuanyuandWu, JianpingandShan, YingandQie, XiaohuandShou, MikeZheng



研究问题:本文旨在解决主流视频-语言预训练模型在处理多模态信息时效率低下的问题。
动机:现有的视频-语言预训练模型存在参数过多、效率低下等问题,影响了其在下游任务中的表现。
方法:本文首次提出了一种端到端的视频-语言模型——all-in-one Transformer,该模型通过统一的骨干架构将原始的视频和文本信号嵌入到联合表示中。为了克服视频数据的临时信息对设计模态无关的Transformer的挑战,我们引入了一种新颖而有效的令牌滚动操作来以非参数化的方式编码视频片段的临时表示。
效果:我们的预训练all-in-one Transformer被转移到各种下游的视频-文本任务中进行微调,包括文本-视频检索、视频问答、多项选择和视觉常识推理。在九个数据集上,我们的方法在最小的模型FLOPs下实现了最先进的性能,证明了我们的方法优于竞争模型。

Mainstream Video-Language Pre-training models consist of three parts, a video encoder, a text encoder, and a video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal fusion Transformers, resulting in increased parameters with lower efficiency in downstream tasks. In this work, we for the first time introduce an end-to-end video-language model, namely all-in-one Transformer, that embeds raw video and textual signals into joint representations using a unified backbone architecture. We argue that the unique temporal information of video data turns out to be a key barrier hindering the design of a modality-agnostic Transformer. To overcome the challenge, we introduce a novel and effective token rolling operation to encode temporal representations from video clips in a non-parametric manner. The careful design enables the representation learning of both video-text multimodal inputs and unimodal inputs using a unified backbone model. Our pre-trained all-in-one Transformer is transferred to various downstream video-text tasks after fine-tuning, including text-video retrieval, video-question answering, multiple choice and visual commonsense reasoning. State-of-the-art performances with the minimal model FLOPs on nine datasets demonstrate the superiority of our method compared to the competitive counterparts.

VILA: Learning Image Aesthetics From User Comments With Vision-Language Pretraining
Ke, JunjieandYe, KerenandYu, JiahuiandWu, YonghuiandMilanfar, PeymanandYang, Feng



研究问题:评估图像美学是具有挑战性的,因为它受到包括构图、颜色、风格和高级语义在内的多个因素的影响。
动机:现有的图像美学评估(IAA)方法主要依赖于人类标注的评分,这过于简化了人类感知到的视觉美学信息。相反,用户评论提供了更全面的信息,是表达人类对图像美学意见和偏好的自然方式。
方法:我们提出从用户评论中学习图像美学,并探索视觉语言预训练方法来学习多模态美学表示。具体来说,我们使用图像-评论对预训练一个图像-文本编码器-解码器模型,使用对比和生成目标来学习丰富和通用的美学语义,而无需人工标签。为了高效地适应下游IAA任务,我们进一步提出了一种轻量级的基于排名的适配器,该适配器使用文本作为锚点来学习美学排名概念。
效果:我们的预训练美学视觉语言模型在AVA-Captions数据集上的图像美学描述任务上优于先前的工作,并且它具有强大的零样本能力,可以用于美学任务,如零样本风格分类和零样本IAA,超越了许多监督基线。通过使用提出的适配器模块进行最小的参数微调,我们的模型在AVA数据集上实现了最先进的IAA性能。

Assessing the aesthetics of an image is challenging, as it is influenced by multiple factors including composition, color, style, and high-level semantics. Existing image aesthetic assessment (IAA) methods primarily rely on human-labeled rating scores, which oversimplify the visual aesthetic information that humans perceive. Conversely, user comments offer more comprehensive information and are a more natural way to express human opinions and preferences regarding image aesthetics. In light of this, we propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations. Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels. To efficiently adapt the pretrained model for downstream IAA tasks, we further propose a lightweight rank-based adapter that employs text as an anchor to learn the aesthetic ranking concept. Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset, and it has powerful zero-shot capability for aesthetic tasks such as zero-shot style classification and zero-shot IAA, surpassing many supervised baselines. With only minimal finetuning parameters using the proposed adapter module, our model achieves state-of-the-art IAA performance over the AVA dataset.

Fine-Grained Audible Video Description
Shen, XuyangandLi, DongandZhou, JinxingandQin, ZhenandHe, BowenandHan, XiaodongandLi, AixuanandDai, YuchaoandKong, LingpengandWang, MengandQiao, YuandZhong, Yiran



研究问题:本文提出了一种新的音频-视觉-语言建模任务,即精细的可听视频描述(FAVD),旨在为给定的可听视频提供详细的文本描述。
动机:现有的视觉-语言建模任务往往关注视频中的视觉线索,而低估了语言和音频模态的价值。另一方面,FAVD不仅需要音频-视觉-语言建模技能,还需要段落级的语言生成能力。
方法:构建了第一个精细的可听视频描述基准(FAVDBench),并为每个视频片段提供了一段总结(即标题)和4-6段描述视觉细节以及1-2段音频相关描述。同时,创建了两个新的度量标准:EntityScore用于衡量视觉描述中实体的完整性,AudioScore用于评估音频描述。作为解决此任务的初步方法,提出了一种音频-视觉-语言转换器,该转换器通过添加额外的音频分支扩展了现有的视频字幕模型。
效果:通过在提出的基准上使用传统的字幕度量标准和提出的度量标准进行评估,证明了模型在音频-视觉-语言建模方面的效率。进一步将基准应用于视频生成模型,证明使用精细的视频描述可以创建比使用字幕更复杂的视频。代码和数据集可在https://github.com/OpenNLPLab/FAVDBench获取,在线基准可在www.avlbench.opennlplab.cn查看。

We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD). It aims to provide detailed textual descriptions for the given audible videos, including the appearance and spatial locations of each object, the actions of moving objects, and the sounds in videos. Existing visual-language modeling tasks often concentrate on visual cues in videos while undervaluing the language and audio modalities. On the other hand, FAVD requires not only audio-visual-language modeling skills but also paragraph-level language generation abilities. We construct the first fine-grained audible video description benchmark (FAVDBench) to facilitate this research. For each video clip, we first provide a one-sentence summary of the video, ie, the caption, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end. The descriptions are provided in both English and Chinese. We create two new metrics for this task: an EntityScore to gauge the completeness of entities in the visual descriptions, and an AudioScore to assess the audio descriptions. As a preliminary approach to this task, we propose an audio-visual-language transformer that extends existing video captioning model with an additional audio branch. We combine the masked language modeling and auto-regressive language modeling losses to optimize our model so that it can produce paragraph-level descriptions. We illustrate the efficiency of our model in audio-visual-language modeling by evaluating it against the proposed benchmark using both conventional captioning metrics and our proposed metrics. We further put our benchmark to the test in video generation models, demonstrating that employing fine-grained video descriptions can create more intricate videos than using captions. Code and dataset are available at https://github.com/OpenNLPLab/FAVDBench. Our online benchmark is available at www.avlbench.opennlplab.cn.

Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification
Yang, YueandPanagopoulou, ArtemisandZhou, ShenghaoandJin, DanielandCallison-Burch, ChrisandYatskar, Mark



研究问题:如何构建高性能的概念瓶颈模型(CBM)以实现与黑箱模型相当的准确性,同时不需要手动指定概念。
动机:现有的CBM需要手动指定概念,且性能往往不如黑箱模型,限制了其广泛应用。
方法:提出一种语言引导瓶颈(LaBo)的方法,利用GPT-3语言模型定义可能的瓶颈空间,通过生成关于类别的事实句子来形成候选概念,并通过一种新的次模态效用函数高效地搜索可能的瓶颈。
效果:实验表明,LaBo是一种高效的视觉识别概念先验,其在11个不同数据集上的评估表现出色,证明了可解释模型可以广泛应用,且性能与黑箱方法相当甚至更好。

Concept Bottleneck Models (CBM) are inherently interpretable models that factor model decisions into human-readable concepts. They allow people to easily understand why a model is failing, a critical feature for high-stakes applications. CBMs require manually specified concepts and often under-perform their black box counterparts, preventing their broad adoption. We address these shortcomings and are first to show how to construct high-performance CBMs without manual specification of similar accuracy to black box models. Our approach, Language Guided Bottlenecks (LaBo), leverages a language model, GPT-3, to define a large space of possible bottlenecks. Given a problem domain, LaBo uses GPT-3 to produce factual sentences about categories to form candidate concepts. LaBo efficiently searches possible bottlenecks through a novel submodular utility that promotes the selection of discriminative and diverse information. Ultimately, GPT-3's sentential concepts can be aligned to images using CLIP, to form a bottleneck layer. Experiments demonstrate that LaBo is a highly effective prior for concepts important to visual recognition. In the evaluation with 11 diverse datasets, LaBo bottlenecks excel at few-shot classification: they are 11.7% more accurate than black box linear probes at 1 shot and comparable with more data. Overall, LaBo demonstrates that inherently interpretable models can be widely applied at similar, or better, performance than black box approaches.

EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata
Zheng, ChenhaoandShrivastava, AyushandOwens, Andrew



研究问题:如何通过联合训练图像补丁和EXIF元数据,学习一种视觉表示来捕捉相机拍摄照片的信息。
动机:现有的方法在图像取证和校准任务上的性能不佳,需要更强大的特征。
方法:利用转换器处理自动插入到图像文件中的EXIF元数据,将其转换为文本形式,然后与图像补丁进行多模态嵌入训练。
效果:该方法学习的特征在图像取证和校准任务上显著优于其他自监督和有监督的特征,能够成功地对拼接图像区域进行"零射"定位。

We learn a visual representation that captures information about the camera that recorded a given photo. To do this, we train a multimodal embedding between image patches and the EXIF metadata that cameras automatically insert into image files. Our model represents this metadata by simply converting it to text and then processing it with a transformer. The features that we learn significantly outperform other self-supervised and supervised features on downstream image forensics and calibration tasks. In particular, we successfully localize spliced image regions "zero shot" by clustering the visual embeddings for all of the patches within an image.

ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning Over Untrimmed Videos
Yu, ZhouandZheng, LixiangandZhao, ZhouandWu, FeiandFan, JianpingandRen, KuiandYu, Jun



研究问题:构建视频问答(VideoQA)模型能力的系统分析基准具有挑战性但至关重要。现有的基准测试常常使用非组合的简单问题,并受到语言偏见的影响,使得难以准确诊断模型的弱点。
动机:为了解决这些问题,研究人员提出了AGQA,这是一个从预注释的场景图中自动生成问答对的新基准,能够以精细的控制测量多样化的推理能力。然而,其问题在推理视频中的细粒度语义上存在限制,因为这样的信息在其场景图中是缺失的。
方法:因此,研究人员提出了ANetQA,这是一个大规模的基准,支持在ActivityNet的未修剪视频上进行细粒度的组合推理。与AGQA类似,ANetQA中的问答对也是从注释的视频场景图自动生成的。
效果:ANetQA的特点体现在:(i)具有细粒度语义的未修剪视频;(ii)具有细粒度分类法的空间-时间场景图;(iii)从细粒度模板生成的多样化问题。ANetQA获得了14亿个不平衡和1340万个平衡的问答对,这比具有相似数量视频的AGQA大一个数量级。对于最先进的方法进行了全面实验,最好的模型达到了44.5%的准确率,而人类的表现最高为84.5%,还有很大的改进空间。

Building benchmarks to systemically analyze different capabilities of video question answering (VideoQA) models is challenging yet crucial. Existing benchmarks often use non-compositional simple questions and suffer from language biases, making it difficult to diagnose model weaknesses incisively. A recent benchmark AGQA poses a promising paradigm to generate QA pairs automatically from pre-annotated scene graphs, enabling it to measure diverse reasoning abilities with granular control. However, its questions have limitations in reasoning about the fine-grained semantics in videos as such information is absent in its scene graphs. To this end, we present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over the challenging untrimmed videos from ActivityNet. Similar to AGQA, the QA pairs in ANetQA are automatically generated from annotated video scene graphs. The fine-grained properties of ANetQA are reflected in the following: (i) untrimmed videos with fine-grained semantics; (ii) spatio-temporal scene graphs with fine-grained taxonomies; and (iii) diverse questions generated from fine-grained templates. ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos. Comprehensive experiments are performed for state-of-the-art methods. The best model achieves 44.5% accuracy while human performance tops out at 84.5%, leaving sufficient room for improvement.

CLAMP: Prompt-Based Contrastive Learning for Connecting Language and Animal Pose
Zhang, XuandWang, WenandChen, ZheandXu, YufeiandZhang, JingandTao, Dacheng



研究问题:现有的图像方法在动物姿态估计上面临挑战,因为训练数据有限且物种内和物种间的差异大。
动机:受视觉语言研究的启发,我们提出预训练的语言模型(如CLIP)可以通过提供丰富的先验知识来描述文本中的动物关键点,从而帮助进行动物姿态估计。
方法:我们引入了一种新颖的基于提示的对比学习方案,用于有效地连接语言和动物姿态(CLAMP)。CLAMP通过在网络训练过程中将文本提示适应到动物关键点,试图弥合这一差距。适应过程被分解为空间感知和特征感知过程,并相应地设计了两种新的对比损失。
效果:实验结果表明,我们的方法在有监督、少样本和零样本设置下实现了最先进的性能,大大超过了基于图像的方法。

Animal pose estimation is challenging for existing image-based methods because of limited training data and large intra- and inter-species variances. Motivated by the progress of visual-language research, we propose that pre-trained language models (eg, CLIP) can facilitate animal pose estimation by providing rich prior knowledge for describing animal keypoints in text. However, we found that building effective connections between pre-trained language models and visual animal keypoints is non-trivial since the gap between text-based descriptions and keypoint-based visual features about animal pose can be significant. To address this issue, we introduce a novel prompt-based Contrastive learning scheme for connecting Language and AniMal Pose (CLAMP) effectively. The CLAMP attempts to bridge the gap by adapting the text prompts to the animal keypoints during network training. The adaptation is decomposed into spatial-aware and feature-aware processes, and two novel contrastive losses are devised correspondingly. In practice, the CLAMP enables the first cross-modal animal pose estimation paradigm. Experimental results show that our method achieves state-of-the-art performance under the supervised, few-shot, and zero-shot settings, outperforming image-based methods by a large margin. The code is available at https://github.com/xuzhang1199/CLAMP.

Mitigating Task Interference in Multi-Task Learning via Explicit Task Routing With Non-Learnable Primitives
Ding, ChuntaoandLu, ZhichaoandWang, ShangguangandCheng, RanandBoddeti, VishnuNaresh



研究问题:现有的多任务学习模型存在任务干扰问题,如何通过非学习式原语和显式任务路由来减轻任务干扰。
动机:现有的多任务学习模型由于任务间的干扰,效果并不理想。因此,本文提出采用非学习式原语和显式任务路由的方法来解决这个问题。
方法:通过使用非学习式原语提取一组与任务无关的多样化特征,并将其重组为所有任务共享的分支以及每个任务保留的特定分支。同时,将可学习的参数显式分离为共享和特定于任务的部分,以最小化任务干扰。
效果:实验结果表明,该方法在图像级别分类和像素级密集预测等多任务学习问题上,显著优于现有的最佳基线,且具有较少的学习参数和类似的浮点运算次数。

Multi-task learning (MTL) seeks to learn a single model to accomplish multiple tasks by leveraging shared information among the tasks. Existing MTL models, however, have been known to suffer from negative interference among tasks. Efforts to mitigate task interference have focused on either loss/gradient balancing or implicit parameter partitioning with partial overlaps among the tasks. In this paper, we propose ETR-NLP to mitigate task interference through a synergistic combination of non-learnable primitives (NLPs) and explicit task routing (ETR). Our key idea is to employ non-learnable primitives to extract a diverse set of task-agnostic features and recombine them into a shared branch common to all tasks and explicit task-specific branches reserved for each task. The non-learnable primitives and the explicit decoupling of learnable parameters into shared and task-specific ones afford the flexibility needed for minimizing task interference. We evaluate the efficacy of ETR-NLP networks for both image-level classification and pixel-level dense prediction MTL problems. Experimental results indicate that ETR-NLP significantly outperforms state-of-the-art baselines with fewer learnable parameters and similar FLOPs across all datasets. Code is available at this URL.

Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style
Lin, FengyinandLi, MingkangandLi, DaandHospedales, TimothyandSong, Yi-ZheandQi, Yonggang



研究问题:本文研究了基于零短草图的图像检索(ZS-SBIR)问题,并解决了现有技术的两个显著差异。
动机:我们的目标是通过一个网络解决所有ZS-SBIR的变体(跨类别、跨数据集),并理解草图和照片匹配的操作过程。
方法:我们将跨模态匹配问题简化为关键局部区域组的比较,类似于"词袋"模式。我们的创新在于实现了这一目标,同时不再需要外部语义知识。
效果:实验表明,我们的方法在所有ZS-SBIR设置中都表现出优越的性能。通过可视化跨模态标记对应关系,我们优雅地实现了可解释的目标。

This paper studies the problem of zero-short sketch-based image retrieval (ZS-SBIR), however with two significant differentiators to prior art (i) we tackle all variants (inter-category, intra-category, and cross datasets) of ZS-SBIR with just one network ("everything"), and (ii) we would really like to understand how this sketch-photo matching operates ("explainable"). Our key innovation lies with the realization that such a cross-modal matching problem could be reduced to comparisons of groups of key local patches -- akin to the seasoned "bag-of-words" paradigm. Just with this change, we are able to achieve both of the aforementioned goals, with the added benefit of no longer requiring external semantic knowledge. Technically, ours is a transformer-based cross-modal network, with three novel components (i) a self-attention module with a learnable tokenizer to produce visual tokens that correspond to the most informative local regions, (ii) a cross-attention module to compute local correspondences between the visual tokens across two modalities, and finally (iii) a kernel-based relation network to assemble local putative matches and produce an overall similarity metric for a sketch-photo pair. Experiments show ours indeed delivers superior performances across all ZS-SBIR settings. The all important explainable goal is elegantly achieved by visualizing cross-modal token correspondences, and for the first time, via sketch to photo synthesis by universal replacement of all matched photo patches.

Task Residual for Tuning Vision-Language Models
Yu, TaoandLu, ZhiheandJin, XinandChen, ZhiboandWang, Xinchao



研究问题:如何有效地将大规模视觉语言模型(VLMs)的知识结构转移到数据有限的下游任务中,同时保留适当的原有知识。
动机:现有的高效转移学习方法对于视觉语言模型的知识结构处理不当,可能会造成原有知识的损害或过度偏见。
方法:提出一种新的高效调优方法,名为任务残差调优(TaskRes)。该方法直接在基于文本的分类器上执行,并明确解耦预训练模型的原有知识和关于目标任务的新知识。具体来说,TaskRes保持了视觉语言模型原始分类器的权重不变,并通过调整一组与原有分类器无关的参数作为原有分类器的残差来获得目标任务的新分类器。
效果:在11个基准数据集上,TaskRes显著优于先前的ETL方法(如PT和AT),同时实现简单且效果显著。

Large-scale vision-language models (VLMs) pre-trained on billion-level data have learned general visual representations and broad visual concepts. In principle, the well-learned knowledge structure of the VLMs should be inherited appropriately when being transferred to downstream tasks with limited data. However, most existing efficient transfer learning (ETL) approaches for VLMs either damage or are excessively biased towards the prior knowledge, e.g., prompt tuning (PT) discards the pre-trained text-based classifier and builds a new one while adapter-style tuning (AT) fully relies on the pre-trained features. To address this, we propose a new efficient tuning approach for VLMs named Task Residual Tuning (TaskRes), which performs directly on the text-based classifier and explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task. Specifically, TaskRes keeps the original classifier weights from the VLMs frozen and obtains a new classifier for the target task by tuning a set of prior-independent parameters as a residual to the original one, which enables reliable prior knowledge preservation and flexible task-specific knowledge exploration. The proposed TaskRes is simple yet effective, which significantly outperforms previous ETL methods (e.g., PT and AT) on 11 benchmark datasets while requiring minimal effort for the implementation. Our code is available at https://github.com/geekyutao/TaskRes.

Hierarchical Prompt Learning for Multi-Task Learning
Liu, YajingandLu, YuningandLiu, HaoandAn, YaozuandXu, ZhuoranandYao, ZhuokunandZhang, BaofengandXiong, ZhiweiandGui, Chenguang



研究问题:本文旨在解决如何有效地将视觉语言模型(VLMs)适应于多种相似但不同的视觉任务。
动机:现有的方法需要为每个任务学习一个特定的提示,这限制了利用其他任务中可能共享的信息的能力。
方法:提出了分层提示(HiPro)学习,这是一种简单而有效的方法,用于将预训练的VLM共同适应于多个下游任务。该方法量化了任务间的亲和力,并随后构建了一个分层任务树。内部节点学习的任务共享提示探索了相应任务组内的信息,而叶节点学习的任务特定提示获取了针对每个任务的细粒度信息。分层提示的结合提供了不同粒度的高质量内容。
效果:在四个多任务学习数据集上评估HiPro,结果表明该方法的有效性。

Vision-language models (VLMs) can effectively transfer to various vision tasks via prompt learning. Real-world scenarios often require adapting a model to multiple similar yet distinct tasks. Existing methods focus on learning a specific prompt for each task, limiting the ability to exploit potentially shared information from other tasks. Naively training a task-shared prompt using a combination of all tasks ignores fine-grained task correlations. Significant discrepancies across tasks could cause negative transferring. Considering this, we present Hierarchical Prompt (HiPro) learning, a simple and effective method for jointly adapting a pre-trained VLM to multiple downstream tasks. Our method quantifies inter-task affinity and subsequently constructs a hierarchical task tree. Task-shared prompts learned by internal nodes explore the information within the corresponding task group, while task-individual prompts learned by leaf nodes obtain fine-grained information targeted at each task. The combination of hierarchical prompts provides high-quality content of different granularity. We evaluate HiPro on four multi-task learning datasets. The results demonstrate the effectiveness of our method.

Revealing the Dark Secrets of Masked Image Modeling
Xie, ZhendaandGeng, ZigangandHu, JingchengandZhang, ZhengandHu, HanandCao, Yue



研究问题:本文旨在通过可视化和实验比较了Masked Image Modeling(MIM)和监督预训练模型在视觉任务中的表现差异。
动机:虽然MIM已被证明对许多视觉下游任务有效,但其作用方式和位置尚不清楚。
方法:通过可视化和实验,比较了MIM和长期占主导地位的监督预训练模型的关键表示差异。
效果:实验结果表明,对于几何、运动任务或具有弱语义或细粒度分类的任务,MIM模型的性能明显优于其监督对应模型。此外,对于类别被监督预训练充分覆盖的语义理解数据集,MIM模型仍能实现高度竞争的迁移性能。

Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction. Code will be available at https://github.com/zdaxie/MIM-DarkSecrets.

Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network
Pan, ZhengxinandWu, FangyuandZhang, Bailing



研究问题:现有的图像-文本匹配方法通过交叉注意力机制隐含地对齐视觉语义片段,但研究问题:现有的图像-文本匹配方法通过交叉注意力机制隐含地对齐视觉语义片段,但这种方法可能会产生冗余或不相关的区域-单词对齐,降低检索准确性并限制效率。
动机:尽管许多研究者在挖掘有意义的对齐以提高准确性方面取得了进展,但效率低下的问题仍未得到解决。
方法:我们提出了一种从信息编码的角度学习细粒度图像-文本匹配的方法。具体来说,我们提出了一个编码框架来解释片段对齐过程,为重新审视交叉注意力机制和分析冗余对齐问题提供了新的视角。基于这个框架,我们设计了一个跨模态硬对齐网络(CHAN),它全面利用最相关的区域-单词对,并消除所有其他对齐。
效果:我们在MS-COCO和Flickr30K两个公共数据集上进行的大量实验验证了与图像-文本相似性最相关的单词-区域对的判别力,其在双向图像和文本检索任务上的准确性和效率均优于现有方法。

Current state-of-the-art image-text matching methods implicitly align the visual-semantic fragments, like regions in images and words in sentences, and adopt cross-attention mechanism to discover fine-grained cross-modal semantic correspondence. However, the cross-attention mechanism may bring redundant or irrelevant region-word alignments, degenerating retrieval accuracy and limiting efficiency. Although many researchers have made progress in mining meaningful alignments and thus improving accuracy, the problem of poor efficiency remains unresolved. In this work, we propose to learn fine-grained image-text matching from the perspective of information coding. Specifically, we suggest a coding framework to explain the fragments aligning process, which provides a novel view to reexamine the cross-attention mechanism and analyze the problem of redundant alignments. Based on this framework, a Cross-modal Hard Aligning Network (CHAN) is designed, which comprehensively exploits the most relevant region-word pairs and eliminates all other alignments. Extensive experiments conducted on two public datasets, MS-COCO and Flickr30K, verify that the relevance of the most associated word-region pairs is discriminative enough as an indicator of the image-text similarity, with superior accuracy and efficiency over the state-of-the-art approaches on the bidirectional image and text retrieval tasks. Our code will be available at https://github.com/ppanzx/CHAN.

Images Speak in Images: A Generalist Painter for In-Context Visual Learning
Wang, XinlongandWang, WenandCao, YueandShen, ChunhuaandHuang, Tiejun



研究问题:如何让计算机视觉模型通过少量的提示和例子快速适应各种任务。
动机:在计算机视觉中,由于任务的输出表示差异很大,因此对上下文学习的困难在于如何定义通用的任务提示,使视觉模型可以理解并将其转移到领域外的任务。
方法:提出了一种名为Painter的通才模型,其解决方案是以“图像”为中心,即将核心视觉任务的输出重新定义为图像,并将任务提示也指定为图像。训练过程非常简单,只需在输入和输出图像对上执行标准的掩蔽图像建模。这使得模型能够根据可见的图像块执行任务。
效果:实验结果表明,Painter可以在七个具有代表性的视觉任务上与成熟的特定任务模型竞争,包括从高级视觉理解到低级图像处理的各种任务。此外,Painter在一些具有挑战性的任务上显著优于最近的通才模型。

In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. But in computer vision, the difficulties for in-context learning lie in that tasks vary significantly in the output representations, thus it is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images. With this idea, our training process is extremely simple, which performs standard masked image modeling on the stitch of input and output image pairs. This makes the model capable of performing tasks conditioned on visible image patches. Thus, during inference, we can adopt a pair of input and output images from the same task as the input condition, to indicate which task to perform. Without bells and whistles, our generalist Painter can achieve competitive performance compared to well-established task-specific models, on seven representative vision tasks ranging from high-level visual understanding to low-level image processing. In addition, Painter significantly outperforms recent generalist models on several challenging tasks.

Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language
Li, ChuanhaoandLi, ZhenandJing, ChenchenandJia, YundeandWu, Yuwei



研究问题:本文旨在探索视觉-语言(V&L)领域中,基本元素如词、图像区域和视频帧对组合泛化效果的影响。
动机:组合泛化是模拟人类组合能力的关键,对于理解基本元素如何影响组合泛化能力至关重要。
方法:本文提出了一个基于自我监督学习的框架,赋予V&L方法两个特性:语义等变和语义不变性。通过这两个特性,该方法能理解基本元素通过感知基本元素变化对样本语义和真实值的影响。
效果:在两个任务上进行实验:时间视频基础和视觉问答,实验结果证明了该框架的有效性。

Compositionality is one of the fundamental properties of human cognition (Fodor & Pylyshyn, 1988). Compositional generalization is critical to simulate the compositional capability of humans, and has received much attention in the vision-and-language (V&L) community. It is essential to understand the effect of the primitives, including words, image regions, and video frames, to improve the compositional generalization capability. In this paper, we explore the effect of primitives for compositional generalization in V&L. Specifically, we present a self-supervised learning based framework that equips V&L methods with two characteristics: semantic equivariance and semantic invariance. With the two characteristics, the methods understand primitives by perceiving the effect of primitive changes on sample semantics and ground-truth. Experimental results on two tasks: temporal video grounding and visual question answering, demonstrate the effectiveness of our framework.

MAGE: MAsked Generative Encoder To Unify Representation Learning and Image Synthesis
Li, TianhongandChang, HuiwenandMishra, ShlokandZhang, HanandKatabi, DinaandKrishnan, Dilip



研究问题:本文旨在解决计算机视觉中生成模型和表示学习两个关键任务的训练通常独立进行,忽视了两者相互促进的潜力,导致训练和维护开销大的问题。
动机:作者提出遮蔽生成编码器(MAGE),这是首个统一图像生成和自监督表示学习的框架。主要思路是使用遮蔽图像建模预训练中的可变遮蔽比,使得在同一训练框架下可以进行高遮蔽比的生成训练和低遮蔽比的表示学习。
方法:MAGE利用矢量量化GAN学习的语义标记作为输入和输出,并将其与遮蔽相结合。通过在编码器输出上添加对比损失,可以进一步提高表示能力。
效果:在ImageNet-1K上,一个单一的MAGE ViT-L模型在类别无条件图像生成任务上获得了9.10 FID,在线性探测上获得了78.9%的Top-1准确率,在图像生成和表示学习方面都取得了最先进的性能。

Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage.

FashionSAP: Symbols and Attributes Prompt for Fine-Grained Fashion Vision-Language Pre-Training
Han, YunpengandZhang, LisaiandChen, QingcaiandChen, ZhijianandLi, ZhonghuaandYang, JianxinandCao, Zhao



研究问题:现有的视觉-语言预训练模型对细粒度的领域特征关注不足,而这些特征在区分特定领域任务和通用任务中非常重要。
动机:提出一种基于时尚符号和属性提示(FashionSAP)的细粒度时尚视觉-语言预训练方法,以建模细粒度的多模态时尚属性和特征。
方法:首先,提出时尚符号,一种新的抽象时尚概念层,用于表示不同的时尚项目并泛化各种细粒度的时尚特征,使建模细粒度属性更有效。其次,提出属性提示方法,使模型能够显式学习时尚项目的具体属性。根据时尚数据的形式设计适当的提示模板。
效果:在两个公共时尚基准测试集FashionGen和FashionIQ上进行综合实验,FashionSAP在四个流行的时尚任务上取得了最先进的性能。消融研究还表明,提出的抽象时尚符号和属性提示方法使模型能够有效地获取时尚领域的细粒度语义。FashionSAP的显著性能提升为未来的时尚任务研究提供了新的基线。

Fashion vision-language pre-training models have shown efficacy for a wide range of downstream tasks. However, general vision-language pre-training models pay less attention to fine-grained domain features, while these features are important in distinguishing the specific domain tasks from general tasks. We propose a method for fine-grained fashion vision-language pre-training based on fashion Symbols and Attributes Prompt (FashionSAP) to model fine-grained multi-modalities fashion attributes and characteristics. Firstly, we propose the fashion symbols, a novel abstract fashion concept layer, to represent different fashion items and to generalize various kinds of fine-grained fashion features, making modelling fine-grained attributes more effective. Secondly, the attributes prompt method is proposed to make the model learn specific attributes of fashion items explicitly. We design proper prompt templates according to the format of fashion data. Comprehensive experiments are conducted on two public fashion benchmarks, i.e., FashionGen and FashionIQ, and FashionSAP gets SOTA performances for four popular fashion tasks. The ablation study also shows the proposed abstract fashion symbols, and the attribute prompt method enables the model to acquire fine-grained semantics in the fashion domain effectively. The obvious performance gains from FashionSAP provide a new baseline for future fashion task research.

PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models
Liu, MinghuaandZhu, YinhaoandCai, HongandHan, ShizhongandLing, ZhanandPorikli, FatihandSu, Hao



研究问题:如何实现低成本、高泛化的三维部分分割?
动机:传统的监督学习方法需要大量精细标注的三维数据集,但收集这些数据成本高昂。
方法:利用预训练的图像-语言模型GLIP进行三维点云的部分检测,通过2D到3D的标签提升算法将2D的知识转移到3D,并使用多视图3D先验和少样本提示调优来提高性能。
效果:在PartNet和PartNet-Mobility数据集上的广泛评估表明,该方法可以实现优秀的零样本三维部分分割,其少样本版本不仅大幅超越了现有的少样本方法,而且与全监督版本相比也具有竞争力。此外,该方法可以直接应用于iPhone扫描的点云,无明显领域差距。

Generalizable 3D part segmentation is important but challenging in vision and robotics. Training deep models via conventional supervised methods requires large-scale 3D datasets with fine-grained part annotations, which are costly to collect. This paper explores an alternative way for low-shot part segmentation of 3D point clouds by leveraging a pretrained image-language model, GLIP, which achieves superior performance on open-vocabulary 2D detection. We transfer the rich knowledge from 2D to 3D through GLIP-based part detection on point cloud rendering and a novel 2D-to-3D label lifting algorithm. We also utilize multi-view 3D priors and few-shot prompt tuning to boost performance significantly. Extensive evaluation on PartNet and PartNet-Mobility datasets shows that our method enables excellent zero-shot 3D part segmentation. Our few-shot version not only outperforms existing few-shot approaches by a large margin but also achieves highly competitive results compared to the fully supervised counterpart. Furthermore, we demonstrate that our method can be directly applied to iPhone-scanned point clouds without significant domain gaps.

MAGVLT: Masked Generative Vision-and-Language Transformer
Kim, SungwoongandJo, DaejinandLee, DonghoonandKim, Jongmin



研究问题:本文旨在探索一种能同时生成图像和文本序列的统一生成视觉-语言(VL)模型。
动机:尽管在大规模配对数据集上进行多模态图像-文本数据生成建模的工作已经积极展开,但通过单一模型生成两种固定模态的数据,而非一种模态的条件生成另一种模态的数据的研究却十分有限。
方法:我们提出了一种基于非自回归掩码预测的生成性VL变换器,命名为MAGVLT,并将其与自回归生成性VL变换器(ARGVLT)进行了比较。相比于ARGVLT,我们提出的MAGVLT能够实现双向上下文编码、快速解码以及扩展编辑能力,如图像和文本填充。
效果:实验结果表明,我们的MAGVLT在各种VL基准测试的下游生成任务中表现优于ARGVLT,即使推理速度大大提高,也具有显著的优势。特别是在MS-COCO的零样本图像到文本和文本到图像生成任务上,MAGVLT甚至能在没有使用单模态数据和网络的情况下,仅用一个中等规模的模型(参数少于5亿)就取得了有竞争力的结果。

While generative modeling on multimodal image-text data has been actively developed with large-scale paired datasets, there have been limited attempts to generate both image and text data by a single model rather than a generation of one fixed modality conditioned on the other modality. In this paper, we explore a unified generative vision-and-language (VL) model that can produce both images and text sequences. Especially, we propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT). In comparison to ARGVLT, the proposed MAGVLT enables bidirectional context encoding, fast decoding by parallel token predictions in an iterative refinement, and extended editing capabilities such as image and text infilling. For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to image, and joint image-and-text mask prediction tasks. Moreover, we devise two additional tasks based on the step-unrolled mask prediction and the selective prediction on the mixture of two image-text pairs. Experimental results on various downstream generation tasks of VL benchmarks show that our MAGVLT outperforms ARGVLT by a large margin even with significant inference speedup. Particularly, MAGVLT achieves competitive results on both zero-shot image-to-text and text-to-image generation tasks from MS-COCO by one moderate-sized model (fewer than 500M parameters) even without the use of monomodal data and networks.

DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment
Yao, LeweiandHan, JianhuaandLiang, XiaodanandXu, DanandZhang, WeiandLi, ZhenguoandXu, Hang



研究问题:本文旨在开发一种高效可扩展的训练框架,通过整合大规模图像-文本对实现开放词汇对象检测(OVD)。
动机:现有的OVD框架通常依赖于预训练的视觉语言模型或通过伪标签过程利用图像-文本对,而DetCLIPv2直接从大规模的图像-文本对中以端到端的方式学习精细的词-区域对齐。
方法:DetCLIPv2采用最大词-区域相似性来引导对比目标,同时使用检测、基础和图像-文本对数据的混合监督进行训练,以提高模型的定位能力并学习广泛的概念。
效果:通过交替方案进行联合训练,并采用低分辨率输入的图像-文本对,DetCLIPv2有效地利用了图像-文本对数据。在13M图像-文本对的预训练下,DetCLIPv2显示出优越的开放词汇检测性能,例如,基于Swin-T主干的DetCLIPv2在LVIS基准测试上实现了40.4%的零射击AP,超过了之前的工作GLIP/GLIPv2/DetCLIP 14.4/11.4/4.5%的AP,甚至大幅度超越了其全监督的对应模型。

This paper presents DetCLIPv2, an efficient and scalable training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection (OVD). Unlike previous OVD frameworks that typically rely on a pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via a pseudo labeling process, DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner. To accomplish this, we employ a maximum word-region similarity between region proposals and textual words to guide the contrastive objective. To enable the model to gain localization capability while learning broad concepts, DetCLIPv2 is trained with a hybrid supervision from detection, grounding and image-text pair data under a unified data formulation. By jointly training with an alternating scheme and adopting low-resolution input for image-text pairs, DetCLIPv2 exploits image-text pair data efficiently and effectively: DetCLIPv2 utilizes 13x more image-text pairs than DetCLIP with a similar training time and improves performance. With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance, e.g., DetCLIPv2 with Swin-T backbone achieves 40.4% zero-shot AP on the LVIS benchmark, which outperforms previous works GLIP/GLIPv2/DetCLIP by 14.4/11.4/4.5% AP, respectively, and even beats its fully-supervised counterpart by a large margin.

Affordance Grounding From Demonstration Video To Target Image
Chen, JoyaandGao, DifeiandLin, KevinQinghongandShou, MikeZheng



研究问题:如何让智能机器人和助手,如AR眼镜,通过观看示范视频学习人类手部交互行为,并将其应用到用户AR眼镜视图的目标图像上。
动机:由于需要预测精细的交互行为以及训练数据有限且无法充分覆盖视频-图像差异,因此将人类手部交互从示范视频中落实到目标图像上的任务具有挑战性。
方法:提出了一种名为Affordance Transformer(Afformer)的方法,该方法使用基于转换器的精细解码器逐步改进交互行为的落实。同时引入了Mask Affordance Hand(MaskAHand)自我监督预训练技术,用于合成视频-图像数据并模拟上下文变化,以增强跨越视频-图像差异的交互行为落实。
效果:Afformer结合MaskAHand预训练在多个基准测试中实现了最先进的性能,包括在OPRA数据集上取得了37%的显著改进。

Humans excel at learning from expert demonstrations and solving their own problems. To equip intelligent robots and assistants, such as AR glasses, with this ability, it is essential to ground human hand interactions (i.e., affordances) from demonstration videos and apply them to a target image like a user's AR glass view. The video-to-image affordance grounding task is challenging due to (1) the need to predict fine-grained affordances, and (2) the limited training data, which inadequately covers video-image discrepancies and negatively impacts grounding. To tackle them, we propose Affordance Transformer (Afformer), which has a fine-grained transformer-based decoder that gradually refines affordance grounding. Moreover, we introduce Mask Affordance Hand (MaskAHand), a self-supervised pretraining technique for synthesizing video-image data and simulating context changes, enhancing affordance grounding across video-image discrepancies. Afformer with MaskAHand pre-training achieves state-of-the-art performance on multiple benchmarks, including a substantial 37% improvement on the OPRA dataset. Code is made available at https://github.com/showlab/afformer.

Unifying Vision, Text, and Layout for Universal Document Processing
Tang, ZinengandYang, ZiyiandWang, GuoxinandFang, YuweiandLiu, YangandZhu, ChenguangandZeng, MichaelandZhang, ChaandBansal, Mohit



研究问题:本文旨在提出一种通用文档处理(UDOP)模型,该模型将文本、图像和布局模态以及各种任务格式统一起来,包括文档理解和生成。
动机:目前的文档AI模型往往只关注单一的文本或图像模态,缺乏对多种模态和任务格式的统一处理。
方法:通过引入创新的视觉-文本-布局转换器,UDOP模型将预训练和多领域下游任务统一为基于提示的序列生成方案。同时,利用自监督目标和多样化的标记数据在大规模无标签文档语料库上进行预训练。
效果:实验结果表明,UDOP模型在文档理解、问答等8个文档AI任务上取得了显著改进,并在金融报告、学术论文和网站等不同数据领域的文档理解基准测试中排名第一。

We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 8 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark.

HOICLIP: Efficient Knowledge Transfer for HOI Detection With Vision-Language Models
Ning, ShanandQiu, LongtianandLiu, YongfeiandHe, Xuming



研究问题:本文旨在解决人类-物体交互(HOI)检测的问题,特别是在少量或零次场景下的性能下降。
动机:尽管对比性语言-图像预训练(CLIP)在提供HOI检测器交互先验方面具有巨大潜力,但这种方法通常依赖于大规模训练数据,并且在少量或零次场景下性能较差。
方法:本文提出了一种新的HOI检测框架,该框架能有效地从CLIP中提取先验知识并实现更好的泛化。具体来说,我们首先引入了一个新的交互解码器,通过交叉注意力机制在CLIP的视觉特征图中提取信息丰富的区域,然后通过知识整合模块与检测主干进行融合,以进行更准确的人体-物体对检测。此外,我们还利用CLIP文本编码器的先验知识,通过嵌入HOI描述来生成分类器。为了区分细粒度的交互,我们通过视觉语义算术和轻量级动词表示适配器从训练数据中构建了一个动词分类器。此外,我们还提出了一种无需训练的增强方法,利用CLIP的全局HOI预测。
效果:大量实验表明,我们的方法在各种设置下均大幅超越了现有技术,例如在HICO-Det上提高了+4.04 mAP。源代码可在https://github.com/Artanic30/HOICLIP获取。

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions. Recently, Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors via knowledge distillation. However, such approaches often rely on large-scale training data and suffer from inferior performance under few/zero-shot scenarios. In this paper, we propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization. In detail, we first introduce a novel interaction decoder to extract informative regions in the visual feature map of CLIP via a cross-attention mechanism, which is then fused with the detection backbone by a knowledge integration block for more accurate human-object pair detection. In addition, prior knowledge in CLIP text encoder is leveraged to generate a classifier by embedding HOI descriptions. To distinguish fine-grained interactions, we build a verb classifier from training data via visual semantic arithmetic and a lightweight verb representation adapter. Furthermore, we propose a training-free enhancement to exploit global HOI predictions from CLIP. Extensive experiments demonstrate that our method outperforms the state of the art by a large margin on various settings, e.g. +4.04 mAP on HICO-Det. The source code is available in https://github.com/Artanic30/HOICLIP.

SmallCap: Lightweight Image Captioning Prompted With Retrieval Augmentation
Ramos, RitaandMartins, BrunoandElliott, DesmondandKementchedjhieva, Yova



研究问题:本文旨在解决图像描述任务中,大规模数据和模型规模增加导致的预训练和微调成本过高的问题。
动机:针对大型模型的替代方案,本文提出了SmallCap,它根据输入图像和从数据存储库中检索的相关描述生成标题。
方法:SmallCap模型轻量且易于训练,因为唯一需要学习的参数是新引入的CLIP编码器和GPT-2解码器之间的交叉注意力层。
效果:实验表明,仅在COCO上训练的SmallCap在该基准测试中具有竞争力的性能,并且无需重新训练即可转移到其他领域,只需通过目标领域数据的检索即可实现。通过对多样化的人工标记和网络数据进行无训练的数据利用,可以进一步提高其性能,这在包括nocaps基准测试在内的一系列领域中均有效,该基准测试旨在测试对未见过视觉概念的泛化能力。

Recent advances in image captioning have focused on scaling the data and model size, substantially increasing the cost of pre-training and finetuning. As an alternative to large models, we present SmallCap, which generates a caption conditioned on an input image and related captions retrieved from a datastore. Our model is lightweight and fast to train as the only learned parameters are in newly introduced cross-attention layers between a pre-trained CLIP encoder and GPT-2 decoder. SmallCap can transfer to new domains without additional finetuning and can exploit large-scale data in a training-free fashion since the contents of the datastore can be readily replaced. Our experiments show that SmallCap, trained only on COCO, has competitive performance on this benchmark, and also transfers to other domains without retraining, solely through retrieval from target-domain data. Further improvement is achieved through the training-free exploitation of diverse human-labeled and web data, which proves effective for a range of domains, including the nocaps benchmark, designed to test generalization to unseen visual concepts.

Probing Sentiment-Oriented Pre-Training Inspired by Human Sentiment Perception Mechanism
Feng, TingleiandLiu, JiaxuanandYang, Jufeng



研究问题:如何提高深度卷积神经网络在视觉情感分析中的性能。
动机:目前的预训练方法主要依赖大规模对象分类数据集(如ImageNet),虽然这大大提高了性能,但可能导致模型过度关注物体识别,而忽视了情感的高层次概念。
方法:提出一种基于人类视觉情感感知机制的情感导向预训练方法。将视觉情感感知过程分为刺激接收、整体组织和高级感知三个步骤,通过模拟这三个步骤进行预训练,以挖掘情感区分表示。
效果:实验结果表明,该方法在单标签学习、多标签学习和标签分布学习等主流视觉情感分析任务上均有显著改进。

Pre-training of deep convolutional neural networks (DCNNs) plays a crucial role in the field of visual sentiment analysis (VSA). Most proposed methods employ the off-the-shelf backbones pre-trained on large-scale object classification datasets (i.e., ImageNet). While it boosts performance for a big margin against initializing model states from random, we argue that DCNNs simply pre-trained on ImageNet may excessively focus on recognizing objects, but failed to provide high-level concepts in terms of sentiment. To address this long-term overlooked problem, we propose a sentiment-oriented pre-training method that is built upon human visual sentiment perception (VSP) mechanism. Specifically, we factorize the process of VSP into three steps, namely stimuli taking, holistic organizing, and high-level perceiving. From imitating each VSP step, a total of three models are separately pre-trained via our devised sentiment-aware tasks that contribute to excavating sentiment-discriminated representations. Moreover, along with our elaborated multi-model amalgamation strategy, the prior knowledge learned from each perception step can be effectively transferred into a single target model, yielding substantial performance gains. Finally, we verify the superiorities of our proposed method over extensive experiments, covering mainstream VSA tasks from single-label learning (SLL), multi-label learning (MLL), to label distribution learning (LDL). Experiment results demonstrate that our proposed method leads to unanimous improvements in these downstream tasks. Our code is released on https://github.com/tinglyfeng/sentiment_pretraining

TOPLight: Lightweight Neural Networks With Task-Oriented Pretraining for Visible-Infrared Recognition
Yu, HaoandCheng, XuandPeng, Wei



研究问题:如何克服异构图像之间巨大的视觉差异,实现可见光-红外识别(VI recognition)?
动机:现有的方法主要通过预训练和先进的神经网络架构如ResNet和ViT来实现,但这些方法忽视了预训练的颜色先验知识对结果的负面影响,且计算负担重,难以在资源有限的实际场景中部署。
方法:本文提出了一种针对任务的轻量级预训练神经网络(TOPLight),通过模拟领域冲突和样本变化来引导网络学习如何处理这些困难,从而为异构图像学习更通用的模态共享特征表示。此外,还开发了一种有效的细粒度依赖关系重建模块(FDR)来发现两种模态中共享的重要模式依赖关系。
效果:在VI人物识别和VI面部识别数据集上的大量实验表明,提出的TOPLight方法优于当前最先进的方法,同时需要的计算资源更少。

Visible-infrared recognition (VI recognition) is a challenging task due to the enormous visual difference across heterogeneous images. Most existing works achieve promising results by transfer learning, such as pretraining on the ImageNet, based on advanced neural architectures like ResNet and ViT. However, such methods ignore the negative influence of the pretrained colour prior knowledge, as well as their heavy computational burden makes them hard to deploy in actual scenarios with limited resources. In this paper, we propose a novel task-oriented pretrained lightweight neural network (TOPLight) for VI recognition. Specifically, the TOPLight method simulates the domain conflict and sample variations with the proposed fake domain loss in the pretraining stage, which guides the network to learn how to handle those difficulties, such that a more general modality-shared feature representation is learned for the heterogeneous images. Moreover, an effective fine-grained dependency reconstruction module (FDR) is developed to discover substantial pattern dependencies shared in two modalities. Extensive experiments on VI person re-identification and VI face recognition datasets demonstrate the superiority of the proposed TOPLight, which significantly outperforms the current state of the arts while demanding fewer computational resources.

Where We Are and What We're Looking At: Query Based Worldwide Image Geo-Localization Using Hierarchies and Scenes
Clark, BrandonandKerrigan, AlecandKulkarni, ParthParagandCepeda, VicenteVivancoandShah, Mubarak



研究问题:尽管计算机视觉技术的进步,但确定照片的精确经纬度仍然是一项困难的任务。
动机:大部分先前的方法都选择学习查询图像的单一表示形式,然后在不同地理粒度级别进行分类,这种方法未能利用不同的视觉线索来提供不同层次的上下文信息。
方法:我们引入了一种基于变压器的端到端架构,通过分层交叉注意力来挖掘图像中不同地理层次(我们称之为层次)和相应视觉场景信息之间的关系。
效果:我们在4个标准的地理定位数据集上取得了最先进的精度,包括Im2GPS、Im2GPS3k、YFCC4k和YFCC26k,同时定性地展示了我们的方法如何学习不同的视觉层次和场景的不同表示,这是先前的方法所没有展示过的。

Determining the exact latitude and longitude that a photo was taken is a useful and widely applicable task, yet it remains exceptionally difficult despite the accelerated progress of other computer vision tasks. Most previous approaches have opted to learn single representations of query images, which are then classified at different levels of geographic granularity. These approaches fail to exploit the different visual cues that give context to different hierarchies, such as the country, state, and city level. To this end, we introduce an end-to-end transformer-based architecture that exploits the relationship between different geographic levels (which we refer to as hierarchies) and the corresponding visual scene information in an image through hierarchical cross-attention. We achieve this by learning a query for each geographic hierarchy and scene type. Furthermore, we learn a separate representation for different environmental scenes, as different scenes in the same location are often defined by completely different visual features. We achieve state of the art accuracy on 4 standard geo-localization datasets : Im2GPS, Im2GPS3k, YFCC4k, and YFCC26k, as well as qualitatively demonstrate how our method learns different representations for different visual hierarchies and scenes, which has not been demonstrated in the previous methods. Above previous testing datasets mostly consist of iconic landmarks or images taken from social media, which makes the dataset a simple memory task, or makes it biased towards certain places. To address this issue we introduce a much harder testing dataset, Google-World-Streets-15k, comprised of images taken from Google Streetview covering the whole planet and present state of the art results. Our code can be found at https://github.com/AHKerrigan/GeoGuessNet.

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
Han, XiaoandZhu, XiatianandYu, LichengandZhang, LiandSong, Yi-ZheandXiang, Tao



研究问题:在时尚领域,存在各种视觉和语言(V+L)任务,如跨模态检索、文本引导的图像检索、多模态分类和图像描述。这些任务在输入/输出格式和数据集大小上差异巨大,通常需要设计特定任务的模型并进行独立微调,这导致参数效率低下且无法利用任务间的相关性。
动机:为了解决这些问题,本文提出了一种针对时尚领域的多任务高效学习法FAME-ViL。
方法:FAME-ViL采用单一模型处理多种异构时尚任务,因此具有更高的参数效率。其实现依赖于两个新的组件:(1)一个集成了跨注意力适配器和任务特定适配器的统一V+L模型的任务通用架构;(2)一种稳定而有效的多任务训练策略,支持从异构数据中学习并防止负迁移。
效果:在四个时尚任务上的大量实验表明,FAME-ViL可以比替代方案节省61.5%的参数,同时显著优于传统独立训练的单任务模型。代码可在https://github.com/BrandonHanx/FAME-ViL获取。

In the fashion domain, there exists a variety of vision-and-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning. They differ drastically in each individual input/output format and dataset size. It has been common to design a task-specific model and fine-tune it independently from a pre-trained V+L model (e.g., CLIP). This results in parameter inefficiency and inability to exploit inter-task relatedness. To address such issues, we propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) in this work. Compared with existing approaches, FAME-ViL applies a single model for multiple heterogeneous fashion tasks, therefore being much more parameter-efficient. It is enabled by two novel components: (1) a task-versatile architecture with cross-attention adapters and task-specific adapters integrated into a unified V+L model, and (2) a stable and effective multi-task training strategy that supports learning from heterogeneous data and prevents negative transfer. Extensive experiments on four fashion tasks show that our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models. Code is available at https://github.com/BrandonHanx/FAME-ViL.

Open-Vocabulary Attribute Detection
Bravo, Mar{\'\i



研究问题:本文旨在介绍开放词汇属性检测(OVAD)任务和相应的OVAD基准,以探索视觉语言模型学习到的对象级属性信息。
动机:由于缺乏可靠的属性聚焦评估基准,现有的开放词汇任务主要关注对象类别,而对对象属性的研究有限。
方法:创建了一个覆盖MS COCO 80个对象类别的117个属性类别的干净、密集注释的测试集,包括正负注释,实现开放词汇评估。
效果:通过研究几种基础模型的属性检测性能,证明了该基准的价值。

Vision-language modeling has enabled open-vocabulary tasks where predictions can be queried using any text prompt in a zero-shot manner. Existing open-vocabulary tasks focus on object classes, whereas research on object attributes is limited due to the lack of a reliable attribute-focused evaluation benchmark. This paper introduces the Open-Vocabulary Attribute Detection (OVAD) task and the corresponding OVAD benchmark. The objective of the novel task and benchmark is to probe object-level attribute information learned by vision-language models. To this end, we created a clean and densely annotated test set covering 117 attribute classes on the 80 object classes of MS COCO. It includes positive and negative annotations, which enables open-vocabulary evaluation. Overall, the benchmark consists of 1.4 million annotations. For reference, we provide a first baseline method for open-vocabulary attribute detection. Moreover, we demonstrate the benchmark's value by studying the attribute detection performance of several foundation models.

Test of Time: Instilling Video-Language Models With a Sense of Time
Bagad, PiyushandTapaswi, MakarandandSnoek, CeesG.M.



研究问题:当前视频理解模型在时间建模和理解方面存在挑战,如何让基础的视频-语言模型具有时间感知能力。
动机:语言是实现强大泛化的关键,因此基础的视频-语言模型需要有对时间的理解。
方法:本文提出了一种基于VideoCLIP模型的临时适应方案,通过少量的视频-文本数据进行后预训练,以赋予模型时间感知能力。
效果:实验结果表明,经过适应性训练的模型在需要更高时间感知的任务上表现出色,为在无需大量数据和计算密集型重新训练的情况下,向现有视频-语言模型注入时间感知提供了初步步骤。

Modelling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that seven existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require varying degrees of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.

OpenScene: 3D Scene Understanding With Open Vocabularies
Peng, SongyouandGenova, KyleandJiang, Chiyu{\textquotedblleft



研究问题:本文旨在提出一种替代方法OpenScene,通过在CLIP特征空间中对3研究问题:本文旨在提出一种替代方法OpenScene,通过在CLIP特征空间中对3D场景点进行联合嵌入文本和图像像素,使模型预测密集特征,实现零样本学习。
动机:传统的3D场景理解方法依赖于标记的3D数据集进行单一任务的监督训练,而OpenScene则提出了一种新的思路,即通过在CLIP特征空间中对3D场景点进行联合嵌入文本和图像像素,使模型预测密集特征,实现零样本学习和开放词汇查询。
方法:OpenScene首先为每个3D点推断出CLIP特征,然后根据与任意类别标签嵌入的相似性对其进行分类。这种方法无需任何标记的3D数据,可以有效地识别复杂3D场景中的对象、材料、功能、活动和房间类型。
效果:实验结果表明,OpenScene在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries. For example, to perform SOTA zero-shot 3D semantic segmentation it first infers CLIP features for every 3D point and later classifies them based on similarities to embeddings of arbitrary class labels. More interestingly, it enables a suite of open-vocabulary scene understanding applications that have never been done before. For example, it allows a user to enter an arbitrary text query and then see a heat map indicating which parts of a scene match. Our approach is effective at identifying objects, materials, affordances, activities, and room types in complex 3D scenes, all using a single model trained without any labeled 3D data.

KiUT: Knowledge-Injected U-Transformer for Radiology Report Generation
Huang, ZhongzhenandZhang, XiaofanandZhang, Shaoting



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Radiology report generation aims to automatically generate a clinically accurate and coherent paragraph from the X-ray image, which could relieve radiologists from the heavy burden of report writing. Although various image caption methods have shown remarkable performance in the natural image field, generating accurate reports for medical images requires knowledge of multiple modalities, including vision, language, and medical terminology. We propose a Knowledge-injected U-Transformer (KiUT) to learn multi-level visual representation and adaptively distill the information with contextual and clinical knowledge for word prediction. In detail, a U-connection schema between the encoder and decoder is designed to model interactions between different modalities. And a symptom graph and an injected knowledge distiller are developed to assist the report generation. Experimentally, we outperform state-of-the-art methods on two widely used benchmark datasets: IU-Xray and MIMIC-CXR. Further experimental results prove the advantages of our architecture and the complementary benefits of the injected knowledge.

ShapeTalk: A Language Dataset and Framework for 3D Shape Edits and Deformations
Achlioptas, PanosandHuang, IanandSung, MinhyukandTulyakov, SergeyandGuibas, Leonidas



研究问题:如何通过自然语言编辑3D模型的几何形状。
动机:现有的技术需要专门的技能来编辑3D模型的几何形状,我们希望通过自然语言的使用来简化这个过程。
方法:我们创建了最全面的自然语言描述形状差异的语料库ShapeTalk,并开发了一个通用框架ChangeIt3D,该框架使用任意3D形状生成模型来产生与编辑或变形描述更一致的输出。
效果:我们的框架可以直接使用3D到语言的数据进行训练,避免了像神经渲染这样的2D到3D提升方法,大大提高了编辑效率和准确性。

Editing 3D geometry is a challenging task requiring specialized skills. In this work, we aim to facilitate the task of editing the geometry of 3D models through the use of natural language. For example, we may want to modify a 3D chair model to "make its legs thinner" or to "open a hole in its back". To tackle this problem in a manner that promotes open-ended language use and enables fine-grained shape edits, we introduce the most extensive existing corpus of natural language utterances describing shape differences: ShapeTalk. ShapeTalk contains over half a million discriminative utterances produced by contrasting the shapes of common 3D objects for a variety of object classes and degrees of similarity. We also introduce a generic framework, ChangeIt3D, which builds on ShapeTalk and can use an arbitrary 3D generative model of shapes to produce edits that align the output better with the edit or deformation description. Finally, we introduce metrics for the quantitative evaluation of language-assisted shape editing methods that reflect key desiderata within this editing setup. We note that ShapeTalk allows methods to be trained with explicit 3D-to-language data, bypassing the necessity of "lifting" 2D to 3D using methods like neural rendering, as required by extant 2D image-language foundation models. Our code and data are publicly available at https://changeit3d.github.io/.

Region-Aware Pretraining for Open-Vocabulary Object Detection With Vision Transformers
Kim, DahunandAngelova, AneliaandKuo, Weicheng



研究问题:如何弥合图像级预训练和开放词汇对象检测之间的差距。
动机:目前的图像-文本预训练方法无法有效应用于开放词汇对象检测。
方法:提出了区域感知的开放词汇视觉变换器(RO-ViT),在预训练阶段,对位置嵌入的区域进行随机裁剪和调整大小,以更好地匹配检测微调阶段的区域级别位置嵌入的使用。同时,使用焦点损失替换对比学习中的常见 softmax 交叉熵损失,以更好地学习有信息但困难的例子。最后,利用最新的新奇目标提议改进开放词汇检测微调。
效果:在LVIS和COCO开放词汇检测基准测试以及零样本转移中评估了完整的模型。RO-ViT在LVIS上实现了32.1 APr的先进性能,比现有最佳方法高出5.8个百分点,并且在竞争性的零样本转移检测中也表现出色。令人惊讶的是,RO-ViT还改善了图像级别的表示,并在COCO和Flickr图像-文本检索基准测试中的9个指标上实现了最先进的性能,超过了具有更大模型的竞争性方法。

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) -- a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 APr on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.

Learning Transferable Spatiotemporal Representations From Natural Script Knowledge
Zeng, ZiyunandGe, YuyingandLiu, XihuiandChen, BinandLuo, PingandXia, Shu-TaoandGe, Yixiao



研究问题:现有的预训练视频模型主要在高度策划的数据集上进行,无法很好地捕捉到时空语义,限制了视频理解的进步。
动机:受图像-文本预训练成功的启发,作者提出利用语言语义来提升可转移的时空表示学习。
方法:引入新的预训练任务——"转向视频进行转录排序"(TVTS),通过关注学到的视频表示对打乱的自动语音识别脚本进行排序。该方法不依赖描述性字幕,只从视频中学习,即利用自然转录的语音知识提供有用的时空语义。
效果:该方法在各种基准测试中表现出强大的初始时空表示能力,例如,在SSV2上比VideoMAE提高了+13.6%。代码可在https://github.com/TencentARC/TVTS获取。

Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We argue that it is due to the fact that they only capture pixel-level knowledge rather than spatiotemporal semantics, which hinders further progress in video understanding. Inspired by the great success of image-text pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. We introduce a new pretext task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR scripts by attending to learned video representations. We do not rely on descriptive captions and learn purely from video, i.e., leveraging the natural transcribed speech knowledge to provide noisy but useful semantics over time. Our method enforces the vision model to contextualize what is happening over time so that it can re-organize the narrative transcripts, and can seamlessly apply to large-scale uncurated video data in the real world. Our method demonstrates strong out-of-the-box spatiotemporal representations on diverse benchmarks, e.g., +13.6% gains over VideoMAE on SSV2 via linear probing. The code is available at https://github.com/TencentARC/TVTS.

3D Concept Learning and Reasoning From Multi-View Images
Hong, YiningandLin, ChunruandDu, YilunandChen, ZhenfangandTenenbaum, JoshuaB.andGan, Chuang



研究问题:如何通过多视角图像进行三维视觉推理。
动机:人类能够通过收集周围世界的多视角观察来准确推理三维空间,受此启发,研究人员提出了一个新的大规模3D多视角视觉问答(3DMV-VQA)基准测试。
方法:研究人员使用Habitat模拟器在环境中主动移动并捕获RGB图像,创建了一个包含大约5000个场景、60万个图像和5万个问题的数据集。他们评估了各种最先进的视觉推理模型,并提出了一种从多视角图像推断世界紧凑3D表示的方法,该方法进一步基于开放词汇语义概念,并在这些3D表示上执行推理。
效果:实验结果表明,他们的框架比基线模型有大幅度的改进,但这个挑战在很大程度上仍未解决。

Humans are able to accurately reason in 3D by gathering multi-view observations of the surrounding world. Inspired by this insight, we introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA). This dataset is collected by an embodied agent actively moving and capturing RGB images in an environment using the Habitat simulator. In total, it consists of approximately 5k scenes, 600k images, paired with 50k questions. We evaluate various state-of-the-art models for visual reasoning on our benchmark and find that they all perform poorly. We suggest that a principled approach for 3D reasoning from multi-view images should be to infer a compact 3D representation of the world from the multi-view images, which is further grounded on open-vocabulary semantic concepts, and then to execute reasoning on these 3D representations. As the first step towards this approach, we propose a novel 3D concept learning and reasoning (3D-CLR) framework that seamlessly combines these components via neural fields, 2D pre-trained vision-language models, and neural reasoning operators. Experimental results suggest that our framework outperforms baseline models by a large margin, but the challenge remains largely unsolved. We further perform an in-depth analysis of the challenges and highlight potential future directions.

Integrally Pre-Trained Transformer Pyramid Networks
Tian, YunjieandXie, LingxiandWang, ZhaozhiandWei, LonghuiandZhang, XiaopengandJiao, JianbinandWang, YaoweiandTian, QiandYe, Qixiang



研究问题:本文旨在提出一种基于掩蔽图像建模(MIM)的预训练框架,以最小化MIM与下游识别任务之间的转移差距。
动机:目前的预训练模型在处理视觉识别任务时存在一定的转移差距。作者认为,通过联合预训练主干网络和颈部网络,可以减小这种差距。
方法:作者提出了两个技术贡献。首先,他们在预训练阶段通过插入特征金字塔来统一重建和识别颈部网络。其次,他们用多阶段监督的掩蔽特征建模(MFM)补充了掩蔽图像建模(MIM)。预训练后的模型被称为整体预训练变换金字塔网络(iTPNs),是强大的基础模型用于视觉识别。
效果:实验结果表明,基础/大型级别的iTPN在ImageNet-1K上达到了86.2%/87.8%的Top-1准确率,使用Mask-RCNN在COCO对象检测上达到了53.2%/55.6%的box AP,使用UPerHead在ADE20K语义分割上达到了54.7%/57.7%的mIoU——所有这些结果都创造了新的记录。

In this paper, we present an integral pre-training framework based on masked image modeling (MIM). We advocate for pre-training the backbone and neck jointly so that the transfer gap between MIM and downstream recognition tasks is minimal. We make two technical contributions. First, we unify the reconstruction and recognition necks by inserting a feature pyramid into the pre-training stage. Second, we complement mask image modeling (MIM) with masked feature modeling (MFM) that offers multi-stage supervision to the feature pyramid. The pre-trained models, termed integrally pre-trained transformer pyramid networks (iTPNs), serve as powerful foundation models for visual recognition. In particular, the base/large-level iTPN achieves an 86.2%/87.8% top-1 accuracy on ImageNet-1K, a 53.2%/55.6% box AP on COCO object detection with 1x training schedule using Mask-RCNN, and a 54.7%/57.7% mIoU on ADE20K semantic segmentation using UPerHead -- all these results set new records. Our work inspires the community to work on unifying upstream pre-training and downstream fine-tuning tasks. Code is available at https://github.com/sunsmarterjie/iTPN.

Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation
Lu, YuhengandXu, ChenfengandWei, XiaobaoandXie, XiaodongandTomizuka, MasayoshiandKeutzer, KurtandZhang, Shanghang



研究问题:本文旨在解决开放词汇的3D点云检测问题,即通过文本描述识别新的对象。
动机:目前的3D点云检测方法需要大量的标注数据,而我们的目标是在没有3D标注的情况下进行开放词汇的3D对象检测。
方法:我们采用分治策略,首先开发一个可以学习通用表示的点云检测器来定位各种对象,然后将文本和点云表示连接起来,使检测器能够根据文本提示对新的物体类别进行分类。具体来说,我们利用丰富的图像预训练模型,让点云检测器在2D预训练检测器预测的2D边界框的指导下学习定位物体。此外,我们还提出了一种新的去偏三元组跨模态对比学习法,将图像、点云和文本的模态连接起来,使点云检测器能够从视觉语言预训练模型(如CLIP)中受益。
效果:实验结果表明,我们的方法在ScanNet和SUN RGB-D数据集上分别比一系列基线提高了至少3.03点和7.47点。此外,我们还提供了全面分析来解释我们的方法为何有效。

The goal of open-vocabulary detection is to identify novel objects based on arbitrary textual descriptions. In this paper, we address open-vocabulary 3D point-cloud detection by a dividing-and-conquering strategy, which involves: 1) developing a point-cloud detector that can learn a general representation for localizing various objects, and 2) connecting textual and point-cloud representations to enable the detector to classify novel object categories based on text prompting. Specifically, we resort to rich image pre-trained models, by which the point-cloud detector learns localizing objects under the supervision of predicted 2D bounding boxes from 2D pre-trained detectors. Moreover, we propose a novel de-biased triplet cross-modal contrastive learning to connect the modalities of image, point-cloud and text, thereby enabling the point-cloud detector to benefit from vision-language pre-trained models, i.e., CLIP. The novel use of image and vision-language pre-trained models for point-cloud detectors allows for open-vocabulary 3D object detection without the need for 3D annotations. Experiments demonstrate that the proposed method improves at least 3.03 points and 7.47 points over a wide range of baselines on the ScanNet and SUN RGB-D datasets, respectively. Furthermore, we provide a comprehensive analysis to explain why our approach works.

Detecting Backdoors in Pre-Trained Encoders
Feng, ShiweiandTao, GuanhongandCheng, SiyuanandShen, GuangyuandXu, XiangzheandLiu, YingqiandZhang, KaiyuanandMa, ShiqingandZhang, Xiangyu



研究问题:现有的后门攻击方法主要针对有监督学习设置,无法处理预训练的编码器,尤其是当输入标签不可用时。
动机:计算机视觉中的自我监督学习在无标签数据上进行训练,以获取高质量的输入数据嵌入。新兴的后门攻击对编码器暴露了自我监督学习的关键技术漏洞,因为下游分类器(即使在干净的数据上进一步训练)可能会继承编码器的后门行为。
方法:本文提出了DECREE,这是第一个针对预训练编码器的后门检测方法,既不需要分类器头信息也不需要输入标签。
效果:我们在超过400个编码器上进行了评估,这些编码器是在3种范式下植入后门的。我们的方法是有效的,即使在只有有限的或没有访问预训练数据集的情况下,也能在ImageNet和OpenAI的CLIP 4亿张图像-文本对上预训练的图像编码器上保持高检测精度。

Self-supervised learning in computer vision trains on unlabeled data, such as images or (image, text) pairs, to obtain an image encoder that learns high-quality embeddings for input data. Emerging backdoor attacks towards encoders expose crucial vulnerabilities of self-supervised learning, since downstream classifiers (even further trained on clean data) may inherit backdoor behaviors from encoders. Existing backdoor detection methods mainly focus on supervised learning settings and cannot handle pre-trained encoders especially when input labels are not available. In this paper, we propose DECREE, the first backdoor detection approach for pre-trained encoders, requiring neither classifier headers nor input labels. We evaluate DECREE on over 400 encoders trojaned under 3 paradigms. We show the effectiveness of our method on image encoders pre-trained on ImageNet and OpenAI's CLIP 400 million image-text pairs. Our method consistently has a high detection accuracy even if we have only limited or no access to the pre-training dataset.

CREPE: Can Vision-Language Foundation Models Reason Compositionally?
Ma, ZixianandHong, JerryandGul, MustafaOmerandGandhi, MonaandGao, IrenaandKrishna, Ranjay



研究问题:尽管大型视觉和语言预训练模型在性能上有所提升,但它们在组合性方面仍存在困难。
动机:为了解决这个问题,研究人员引入了一个新的组合性评估基准CREPE,以测量认知科学文献中确定的组合性的两个重要方面:系统性和生产力。
方法:CREPE测试集包含超过370K的图像-文本对和三个不同的可见-不可见分割。同时,还生成了325K、316K和309K的难负样本标题。
效果:实验结果表明,当新颖的组合在检索集中占主导地位时,模型性能会持续下降,Recall@1最高下降9%。随着复杂性的增加,模型的检索成功率也会降低,高复杂度时常常接近随机概率。这些结果不受模型和训练数据集大小的影响。

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that--across 7 architectures trained with 4 algorithms on massive datasets--they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over 370K image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate 325K, 316K, and 309K hard negative captions for a subset of the pairs. To test productivity, CREPE contains 17K image-text pairs with nine different complexities plus 278K hard negative captions with atomic, swapping, and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to 9%. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.

Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks
Wang, WenhuiandBao, HangboandDong, LiandBjorck, JohanandPeng, ZhiliangandLiu, QiangandAggarwal, KritiandMohammed, OwaisKhanandSinghal, SakshamandSom, SubhojitandWei, Furu



研究问题:本文旨在介绍一种通用的多模态基础模型BEiT-3,该模型在视觉和视觉语言任务上均取得了优秀的迁移性能。
动机:目前,语言、视觉和多模态预训练正在逐渐融合。我们通过三个方向推进这一大融合:主干架构、预训练任务和模型扩展。
方法:我们使用Multiway Transformers进行通用建模,其模块化架构可以实现深度融合和模态特定编码。基于共享的主干,我们对图像(Imglish)、文本(英语)和图像-文本对(“平行句子”)进行统一的掩码“语言”建模。
效果:实验结果表明,BEiT-3在物体检测(COCO)、语义分割(ADE20K)、图像分类(ImageNet)、视觉推理(NLVR2)、视觉问答(VQAv2)、图像描述(COCO)和跨模态检索(Flickr30K,COCO)等任务上表现出色。

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves excellent transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We use Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains remarkable performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).

Weakly Supervised Posture Mining for Fine-Grained Classification
Tang, ZhenchaoandYang, HualinandChen, CalvinYu-Chian



研究问题:如何提高细粒度分类任务的准确性,特别是在鸟类等常见视觉类别的子类别之间存在微妙差异的情况下。
动机:过去的工作主要关注图像中单个判别区域的独立特征,而忽视了整个图像中不同判别区域之间的联系。然而,不同判别区域之间的关系包含了丰富的姿势信息,通过加入姿势信息,模型可以学习对象的行为,从而提高分类性能。
方法:提出了一种名为PMRC(姿势挖掘和反向交叉熵)的新型细粒度框架,该框架可以与不同的骨干网络结合使用。在PMRC中,我们使用Deep Navigator从图像中生成判别区域,然后使用它们构建图。通过消息传递聚合图并获取分类结果。为了迫使PMRC学习如何挖掘姿势信息,我们设计了一种新的训练范式,使Deep Navigator和消息传递能够相互通信和共同训练。此外,我们还提出了反向交叉熵(RCE),并证明与交叉熵(CE)相比,RCE不仅可以提高模型的准确性,还可以推广到提高其他类型的细粒度分类模型的准确性。
效果:在基准数据集上的实验结果表明,PMRC可以实现最先进的性能。

Because the subtle differences between the different sub-categories of common visual categories such as bird species, fine-grained classification has been seen as a challenging task for many years. Most previous works focus towards the features in the single discriminative region isolatedly, while neglect the connection between the different discriminative regions in the whole image. However, the relationship between different discriminative regions contains rich posture information and by adding the posture information, model can learn the behavior of the object which attribute to improve the classification performance. In this paper, we propose a novel fine-grained framework named PMRC (posture mining and reverse cross-entropy), which is able to combine with different backbones to good effect. In PMRC, we use the Deep Navigator to generate the discriminative regions from the images, and then use them to construct the graph. We aggregate the graph by message passing and get the classification results. Specifically, in order to force PMRC to learn how to mine the posture information, we design a novel training paradigm, which makes the Deep Navigator and message passing communicate and train together. In addition, we propose the reverse cross-entropy (RCE) and demomenstate that compared to the cross-entropy (CE), RCE can not only promote the accurracy of our model but also generalize to promote the accuracy of other kinds of fine-grained classification models. Experimental results on benchmark datasets confirm that PMRC can achieve state-of-the-art.

LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling
Li, LinjieandGan, ZheandLin, KevinandLin, Chung-ChingandLiu, ZichengandLiu, CeandWang, Lijuan



研究问题:本文旨在开发一种统一的视频-语言(VidL)框架,以简化模型架构并实现跨任务的统一。
动机:现有的VidL模型在模型架构和训练目标上需要针对每个任务进行特定设计,缺乏统一性。
方法:本文提出了一种名为LAVENDER的统一的VidL框架,其中使用Masked Language Modeling (MLM)作为所有预训练和下游任务的通用接口。这种统一化使得只需要一个轻量级的MLM头,而不是具有更多参数的解码器,放在多模态编码器的顶部。
效果:实验结果表明,这个统一的框架在14个VidL基准测试中取得了有竞争力的性能,包括视频问答、文本到视频检索和视频字幕等。进一步的广泛分析表明,LAVENDER可以无缝支持所有下游任务,只需一组参数值进行多任务微调;可以在有限的训练样本下推广到各种下游任务;并能够在视频问答任务上实现零样本评估。

Unified vision-language frameworks have greatly advanced in recent years, most of which adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence generation. However, existing video-language (VidL) models still require task-specific designs in model architecture and training objectives for each task. In this work, we explore a unified VidL framework LAVENDER, where Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks. Such unification leads to a simplified model architecture, where only a lightweight MLM head, instead of a decoder with much more parameters, is needed on top of the multimodal encoder. Surprisingly, experimental results show that this unified framework achieves competitive performance on 14 VidL benchmarks, covering video question answering, text-to-video retrieval and video captioning. Extensive analyses further demonstrate LAVENDER can (i) seamlessly support all downstream tasks with just a single set of parameter values when multi-task finetuned; (ii) generalize to various downstream tasks with limited training samples; and (iii) enable zero-shot evaluation on video question answering tasks.

Shifted Diffusion for Text-to-Image Generation
Zhou, YufanandLiu, BingchenandZhu, YizheandYang, XiaoandChen, ChangyouandXu, Jinhui



研究问题:本文旨在提出一种名为Corgi的新的文本到图像生成方法。
动机:现有的DALL-E 2模型在文本到图像生成方面存在不足,本研究希望通过改进其扩散模型来提高图像嵌入生成的效率和效果。
方法:基于提出的移位扩散模型,通过设计新的初始分布和扩散的转换步骤,将预训练的CLIP模型的知识无缝地编码到扩散过程中。
效果:实验结果表明,该方法在从文本生成图像嵌入方面比强大的DALL-E 2基线更有效,从而产生更好的文本到图像生成效果。此外,该方法还实现了半监督和无语言的训练,即使只有部分或没有图像在训练数据集中具有关联的标题,也能进行有效的文本到图像生成。

We present Corgi, a novel method for text-to-image generation. Corgi is based on our proposed shifted diffusion model, which achieves better image embedding generation from input text. Different from the baseline diffusion model used in DALL-E 2, our method seamlessly encodes prior knowledge of the pre-trained CLIP model in its diffusion process by designing a new initialization distribution and a new transition step of the diffusion. Compared to the strong DALL-E 2 baseline, our method performs better in generating image embedding from the text in terms of both efficiency and effectiveness, which consequently results in better text-to-image generation. Extensive large-scale experiments are conducted and evaluated in terms of both quantitative measures and human evaluation, indicating a stronger generation ability of our method compared to existing ones. Furthermore, our model enables semi-supervised and language-free training for text-to-image generation, where only part or none of the images in the training dataset have an associated caption. Trained with only 1.7% of the images being captioned, our semi-supervised model obtains FID results comparable to DALL-E 2 on zero-shot text-to-image generation evaluated on MS-COCO. Corgi also achieves new state-of-the-art results across different datasets on downstream language-free text-to-image generation tasks, outperforming the previous method, Lafite, by a large margin.

OvarNet: Towards Open-Vocabulary Object Attribute Recognition
Chen, KeyanandJiang, XiaolongandHu, YaoandTang, XuandGao, YanandChen, JianqiandXie, Weidi



研究问题:本文旨在解决在训练阶段没有手动标注的情况下,如何在图像中同时检测对象并推断其视觉属性的问题。
动机:目前的模型大多将目标检测和属性分类分开处理,缺乏对开放词汇下的目标检测和属性分类的研究。
方法:首先采用两阶段策略CLIP-Attr进行开放词汇的目标检测和属性分类;然后通过联合所有可用数据集进行联邦学习来微调CLIP模型,使视觉表示与属性对齐;最后,为了提高效率,使用知识蒸馏训练了一个端到端的Faster-RCNN类型模型,该模型在语义类别和属性上执行类别无关的对象提议和分类。
效果:实验结果表明,识别语义类别和属性对于视觉场景理解是互补的,即联合训练目标检测和属性预测在很大程度上优于将这两个任务独立处理的现有方法,显示出对新属性和新类别的强大泛化能力。

In this paper, we consider the problem of simultaneously detecting objects and inferring their visual attributes in an image, even for those with no manual annotations provided at the training stage, resembling an open-vocabulary scenario. To achieve this goal, we make the following contributions: (i) we start with a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes; (ii) we combine all available datasets and train with a federated strategy to finetune the CLIP model, aligning the visual representation with attributes, additionally, we investigate the efficacy of leveraging freely available online image-caption pairs under weakly supervised learning; (iii) in pursuit of efficiency, we train a Faster-RCNN type model end-to-end with knowledge distillation, that performs class-agnostic object proposals and classification on semantic categories and attributes with classifiers generated from a text encoder; Finally, (iv) we conduct extensive experiments on VAW, MS-COCO, LSA, and OVAD datasets, and show that recognition of semantic category and attributes is complementary for visual scene understanding, i.e., jointly training object detection and attributes prediction largely outperform existing approaches that treat the two tasks independently, demonstrating strong generalization ability to novel attributes and categories.

TarViS: A Unified Approach for Target-Based Video Segmentation
Athar, AliandHermans, AlexanderandLuiten, JonathonandRamanan, DevaandLeibe, Bastian



研究问题:视频分割领域目前被分割成不同的任务,跨越多个基准。尽管最先进的技术取得了快速进展,但当前的方法绝大多数都是针对特定任务的,无法在概念上泛化到其他任务。
动机:受到最近具有多任务能力的方法和架构的启发,我们提出了TarViS:一种新的、统一的网络架构,可以应用于任何需要在视频中分割一组任意定义的“目标”的任务。
方法:我们的方法是灵活的,因为任务如何定义这些目标,它将这些目标建模为抽象的“查询”,然后用于预测像素精确的目标掩码。一个单一的TarViS模型可以在不同的数据集集合上进行联合训练,并在推理过程中无需任何特定任务的再训练就可以在任务之间进行热交换。
效果:为了证明其有效性,我们将TarViS应用于四个不同的任务,即视频实例分割(VIS)、视频全景分割(VPS)、视频对象分割(VOS)和点样本引导跟踪(PET)。我们统一、联合训练的模型在这四个任务中的五个基准上实现了最先进的性能,在其余两个基准上也有竞争力的表现。代码和模型权重可在以下链接获取:https://github.com/Ali2500/TarViS

The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video. Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks. A single TarViS model can be trained jointly on a collection of datasets spanning different tasks, and can hot-swap between tasks during inference without any task-specific retraining. To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking (PET). Our unified, jointly trained model achieves state-of-the-art performance on 5/7 benchmarks spanning these four tasks, and competitive performance on the remaining two. Code and model weights are available at: https://github.com/Ali2500/TarViS

EC2: Emergent Communication for Embodied Control
Mu, YaoandYao, ShunyuandDing, MingyuandLuo, PingandGan, Chuang



研究问题:如何利用多模态预训练快速学习新环境中的行动,并实现视频演示和语言指令的协同学习。
动机:现有的方法通过对比学习强制对齐两种模态,但更好的模拟它们的互补差异可以导致更全面的表示以进行下游适应。
方法:提出Emergent Communication for Embodied Control (EC^2),一种用于少样本具身控制的新颖视频-语言表示预训练方案。主要思想是通过涌现交流学习视频的“无监督语言”,连接视频细节的语义和自然语言的结构。
效果:在Metaworld和Franka Kitchen具身基准测试中,EC^2在作为任务输入的视频和文本方面始终优于先前的对比学习方法。进一步的消融实验确认了涌现语言的重要性,这对视频和语言学习都有利,明显优于使用预训练的视频字幕。

Embodied control requires agents to leverage multi-modal pre-training to quickly learn how to act in new environments, where video demonstrations contain visual and motion details needed for low-level perception and control, and language instructions support generalization with abstract, symbolic structures. While recent approaches apply contrastive learning to force alignment between the two modalities, we hypothesize better modeling their complementary differences can lead to more holistic representations for downstream adaption. To this end, we propose Emergent Communication for Embodied Control (EC^2), a novel scheme to pre-train video-language representations for few-shot embodied control. The key idea is to learn an unsupervised "language" of videos via emergent communication, which bridges the semantics of video details and structures of natural language. We learn embodied representations of video trajectories, emergent language, and natural language using a language model, which is then used to finetune a lightweight policy network for downstream control. Through extensive experiments in Metaworld and Franka Kitchen embodied benchmarks, EC^2 is shown to consistently outperform previous contrastive learning methods for both videos and texts as task inputs. Further ablations confirm the importance of the emergent language, which is beneficial for both video and language learning, and significantly superior to using pre-trained video captions. We also present a quantitative and qualitative analysis of the emergent language and discuss future directions toward better understanding and leveraging emergent communication in embodied tasks.

I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification
Naeem, MuhammadFerjadandKhan, MuhammadGulZainAliandXian, YongqinandAfzal, MuhammadZeshanandStricker, DidierandVanGool, LucandTombari, Federico



研究问题:如何利用大规模语言模型为零样本图像分类提供文本监督。
动机:现有的方法需要访问高质量的信息源,且仅限于单一信息源,而训练在网络级文本上的大型语言模型具有将所学知识用于多种任务的惊人能力。
方法:通过让大型语言模型根据不同注释者的几个文本描述生成每个类别的多个文本描述(称为视图),然后使用这些类视图学习多视图语义嵌入进行零样本图像分类。
效果:实验表明,每个类别的文本视图提供了互补的信息,使模型能够学习高度判别性的类别嵌入。此外,I2MVFormer在利用来自大型语言模型的多视图文本监督方面优于基线模型,并在三个公共基准数据集上建立了新的无监督语义嵌入的零样本图像分类的最先进的状态。

Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks. In this work, we provide a novel perspective on using an LLM to provide text supervision for a zero-shot image classification model. The LLM is provided with a few text descriptions from different annotators as examples. The LLM is conditioned on these examples to generate multiple text descriptions for each class (referred to as views). Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views. We show that each text view of a class provides complementary information allowing a model to learn a highly discriminative class embedding. Moreover, we show that I2MVFormer is better at consuming the multi-view text supervision from LLM compared to baseline models. I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings.

Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training
Jin, ZhaoandHayat, MunawarandYang, YuweiandGuo, YulanandLei, Yinjie



研究问题:本文旨在解决3D视觉语言推理在有效人机交互中的重要性,以及当前方法对特定任务的依赖性和缺乏可迁移的通用表示的问题。
动机:尽管图像-文本数据的视觉语言预训练取得了令人鼓舞的进步,但由于点云的高度稀疏和不规则结构以及3D对象空间关系的视点变化引起的模糊性,3D语言预训练仍然是一个开放的问题。
方法:本文提出了一种通用的3D语言预训练方法,通过学习通用表示来解决3D语言推理的多个方面。该方法包括两个主要部分:1)上下文感知的空间语义对齐,用于建立点云和文本之间的细粒度对应关系;2)互信息3D语言掩码建模,以实现跨模态信息交换。
效果:实验结果表明,所提出的3D语言预训练方法一旦适应各种下游任务,包括3D视觉定位、3D密集字幕生成和3D问答,就能取得良好的效果。

3D visual language reasoning plays an important role in effective human-computer interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks. Despite the encouraging progress in vision-language pre-training for image-text data, 3D-language pre-training is still an open issue due to limited 3D-language paired data, highly sparse and irregular structure of point clouds and ambiguities in spatial relations of 3D objects with viewpoint changes. In this paper, we present a generic 3D-language pre-training approach, that tackles multiple facets of 3D-language reasoning by learning universal representations. Our learning objective constitutes two main parts. 1) Context aware spatial-semantic alignment to establish fine-grained correspondence between point clouds and texts. It reduces relational ambiguities by aligning 3D spatial relationships with textual semantic context. 2) Mutual 3D-Language Masked modeling to enable cross-modality information exchange. Instead of reconstructing sparse 3D points for which language can hardly provide cues, we propose masked proposal reasoning to learn semantic class and mask-invariant representations. Our proposed 3D-language pre-training method achieves promising results once adapted to various downstream tasks, including 3D visual grounding, 3D dense captioning and 3D question answering. Our codes are available at https://github.com/leolyj/3D-VLP

Generalized Decoding for Pixel, Image, and Language
Zou, XueyanandDou, Zi-YiandYang, JianweiandGan, ZheandLi, LinjieandLi, ChunyuanandDai, XiyangandBehl, HarkiratandWang, JianfengandYuan, LuandPeng, NanyunandWang, LijuanandLee, YongJaeandGao, Jianfeng



研究问题:本文旨在提出一种通用解码器X-Decoder,能够无缝预测像素级分割和语言标记。
动机:现有的模型无法同时处理图像分割和视觉语言任务,而X-Decoder通过在相同的语义空间中解码不同的像素级和标记级输出,解决了这一问题。
方法:X-Decoder接受两种类型的查询输入:(i)非语义的通用查询,(ii)从文本输入中产生的语义查询,以在同一语义空间中解码不同类型的像素级和标记级输出。
效果:实验结果显示,X-Decoder在多种下游任务上具有强大的迁移能力,并在开放词汇分割和引用分割等任务上取得了最先进的结果。

We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition. Code, demo, video and visualization are available at: https://x-decoder-vl.github.io.

Towards Unified Scene Text Spotting Based on Sequence Generation
Kil, TaehoandKim, SeonghyeonandSeo, SukminandKim, YoonsikandKim, Daehee



研究问题:本文旨在解决自动回归模型在端到端文本检测任务中的限制,如特定检测格式、忽略不同文本形状以及最大可检测文本数量有限等问题。
动机:目前的自动回归模型在端到端文本检测任务上取得了一些进展,但它们使用特定的检测格式,忽视了各种文本形状,并且其能检测的文本数量有限。
方法:我们提出了一种名为UNITS的统一场景文本检测器,该模型统一了各种检测格式(包括四边形和多边形),使其能够检测任意形状的文本。此外,我们还应用了起点提示技术,使模型可以从任意起点提取文本,从而提取出比训练实例更多的文本。
效果:实验结果表明,我们的方法在性能上与最先进的方法相当。进一步的分析显示,UNITS可以提取出比训练实例更多的文本。我们的代码可以在https://github.com/clovaai/units获取。

Sequence generation models have recently made significant progress in unifying various vision tasks. Although some auto-regressive models have demonstrated promising results in end-to-end text spotting, they use specific detection formats while ignoring various text shapes and are limited in the maximum number of text instances that can be detected. To overcome these limitations, we propose a UNIfied scene Text Spotter, called UNITS. Our model unifies various detection formats, including quadrilaterals and polygons, allowing it to detect text in arbitrary shapes. Additionally, we apply starting-point prompting to enable the model to extract texts from an arbitrary starting point, thereby extracting more texts beyond the number of instances it was trained on. Experimental results demonstrate that our method achieves competitive performance compared to state-of-the-art methods. Further analysis shows that UNITS can extract a larger number of texts than it was trained on. We provide the code for our method at https://github.com/clovaai/units.

SpaText: Spatio-Textual Representation for Controllable Image Generation
Avrahami, OmriandHayes, ThomasandGafni, OranandGupta, SonalandTaigman, YanivandParikh, DeviandLischinski, DaniandFried, OhadandYin, Xi



研究问题:目前的文本到图像扩散模型无法精细地控制不同区域/对象的形状或布局。
动机:为了解决这个问题,我们提出了SpaText,一种使用开放词汇场景控制的文本到图像生成新方法。
方法:用户除了提供描述整个场景的全局文本提示外,还需要提供一个分割图,其中每个感兴趣区域都由自由形式的自然语言描述进行注释。由于缺乏具有图像中每个区域的详细文本描述的大规模数据集,我们选择利用现有的大规模文本到图像数据集,并基于一种新的基于CLIP的空间-文本表示法来构建我们的方法。
效果:实验结果表明,该方法在两种最先进的扩散模型(像素基和潜在基)上均取得了显著的效果。此外,我们还展示了如何将无分类器引导方法扩展到多条件情况,并提出了一种新的加速推理算法。最后,我们提供了几种自动评估指标,并通过FID分数和用户研究来评估我们的方法,结果显示该方法在自由形式文本场景控制下的图像生成方面达到了最先进的结果。

Recent text-to-image diffusion models are able to generate convincing results of unprecedented quality. However, it is nearly impossible to control the shapes of different regions/objects or their layout in a fine-grained fashion. Previous attempts to provide such controls were hindered by their reliance on a fixed set of labels. To this end, we present SpaText --- a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map where each region of interest is annotated by a free-form natural language description. Due to lack of large-scale datasets that have a detailed textual description for each region in the image, we choose to leverage the current large-scale text-to-image datasets and base our approach on a novel CLIP-based spatio-textual representation, and show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-based. In addition, we show how to extend the classifier-free guidance method in diffusion models to the multi-conditional case and present an alternative accelerated inference algorithm. Finally, we offer several automatic evaluation metrics and use them, in addition to FID scores and a user study, to evaluate our method and show that it achieves state-of-the-art results on image generation with free-form textual scene control.

Leveraging per Image-Token Consistency for Vision-Language Pre-Training
Gou, YunhaoandKo, TomandYang, HansiandKwok, JamesandZhang, YuandWang, Mingxuan



研究问题:现有的视觉-语言预训练(VLP)方法主要采用跨模态的掩码语言模型(CMLM)来学习视觉-语言的关联,但作者发现这种方法存在缺陷。
动机:CMLM在处理视觉-语言预训练时存在一些问题,如模态偏见和未充分利用未被遮蔽的标记。
方法:为解决这些问题,作者提出了EPIC(lEveraging Per Image-Token Consistency for vision-language pre-training)方法。EPIC通过为每张图片和句子对遮蔽与图片相关的标记(即基于显著性的遮蔽策略),并用语言模型生成的替代词替换它们,然后要求模型判断句子中的每个标记是否与图片一致(即图像-标记一致性任务)。
效果:实验表明,EPIC方法与当前最先进的预训练方法(包括ViLT、ALBEF、METER和X-VLM)结合后,可以在下游任务上取得显著改进。

Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Under-utilization of the unmasked tokens: CMLM primarily focuses on the masked token but it cannot simultaneously leverage other tokens to learn vision-language associations. To handle those limitations, we propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training). In EPIC, for each image-sentence pair, we mask tokens that are salient to the image (i.e., Saliency-based Masking Strategy) and replace them with alternatives sampled from a language model (i.e., Inconsistent Token Generation Procedure), and then the model is required to determine for each token in the sentence whether it is consistent with the image (i.e., Image-Token Consistency Task). The proposed EPIC method is easily combined with pre-training methods. Extensive experiments show that the combination of the EPIC method and state-of-the-art pre-training approaches, including ViLT, ALBEF, METER, and X-VLM, leads to significant improvements on downstream tasks. Our coude is released at https://github.com/gyhdog99/epic

Neural Congealing: Aligning Images to a Joint Semantic Atlas
Ofri-Amar, DolevandGeyer, MichalandKasten, YoniandDekel, Tali



研究问题:本文旨在提出一种零射击自监督框架,用于检测和联合对齐一组给定图像中语义上常见的内容。
动机:目前的预训练模型在处理图像集合时,往往需要大量的训练数据和复杂的输入信息,如分割掩码等。
方法:采用预先训练的DINO-ViT特征,学习一个联合语义图册(捕获输入集中DINO-ViT特征的模式)和从统一图册到每个输入图像的密集映射。通过优化图册表示和每张图像的映射,只需要少量真实世界图像作为输入即可。
效果:实验结果表明,该方法在各种具有挑战性的图像集上表现良好,包括混合领域的图像集(如描绘雕塑和猫艺术品的图像),描绘相关但不同类别的对象的图像集(如狗和虎),或训练数据稀缺的领域(如咖啡杯)。

We present Neural Congealing -- a zero-shot self-supervised framework for detecting and jointly aligning semantically-common content across a given set of images. Our approach harnesses the power of pre-trained DINO-ViT features to learn: (i) a joint semantic atlas -- a 2D grid that captures the mode of DINO-ViT features in the input set, and (ii) dense mappings from the unified atlas to each of the input images. We derive a new robust self-supervised framework that optimizes the atlas representation and mappings per image set, requiring only a few real-world images as input without any additional input information (e.g., segmentation masks). Notably, we design our losses and training paradigm to account only for the shared content under severe variations in appearance, pose, background clutter or other distracting objects. We demonstrate results on a plethora of challenging image sets including sets of mixed domains (e.g., aligning images depicting sculpture and artwork of cats), sets depicting related yet different object categories (e.g., dogs and tigers), or domains for which large-scale training data is scarce (e.g., coffee mugs). We thoroughly evaluate our method and show that our test-time optimization approach performs favorably compared to a state-of-the-art method that requires extensive training on large-scale datasets.

Learning Expressive Prompting With Residuals for Vision Transformers
Das, RajshekharandDukler, YonatanandRavichandran, AvinashandSwaminathan, Ashwin



研究问题:本文旨在提出一种有效的视觉变换器适应方法,通过在预训练模型的输入和中间表示中插入一组可学习的参数。
动机:现有的视觉变换器适应方法通常需要大量的计算资源,并且效果不尽如人意。
方法:本文提出了一种名为“表达提示残差”(EXPRES)的方法,该方法通过学习“输出”令牌来构建下游表示,类似于ViT的学习类令牌。此外,为了更好地控制由冻结的变换器处理的下游表示,我们在各种计算的输出中引入了残差可学习令牌。
效果:实验结果表明,EXPRES在图像分类、少样本学习和语义分割等任务上均取得了最先进的提示调优效果。此外,我们的方法比现有的视觉提示基线效率高一个数量级。

Prompt learning is an efficient approach to adapt transformers by inserting learnable set of parameters into the input and intermediate representations of a pre-trained model. In this work, we present Expressive Prompts with Residuals (EXPRES) which modifies the prompt learning paradigm specifically for effective adaptation of vision transformers (ViT). Out method constructs downstream representations via learnable "output" tokens, that are akin to the learned class tokens of the ViT. Further for better steering of the downstream representation processed by the frozen transformer, we introduce residual learnable tokens that are added to the output of various computations. We apply EXPRES for image classification, few shot learning, and semantic segmentation, and show our method is capable of achieving state of the art prompt tuning on 3/3 categories of the VTAB benchmark. In addition to strong performance, we observe that our approach is an order of magnitude more prompt efficient than existing visual prompting baselines. We analytically show the computational benefits of our approach over weight space adaptation techniques like finetuning. Lastly we systematically corroborate the architectural design of our method via a series of ablation experiments.

Learning To Generate Text-Grounded Mask for Open-World Semantic Segmentation From Only Image-Text Pairs
Cha, JunbumandMun, JonghwanandRoh, Byungseok



研究问题:如何仅使用图像-文本对进行开放世界语义分割,而无需密集注释。
动机:现有的开放世界分割方法通过对比学习来学习多样的视觉概念,并将学到的图像级理解转移到分割任务上,但存在训练和测试之间的差异。
方法:提出了一种新的基于文本的对比学习(TCL)框架,使模型能够直接学习区域-文本对齐。该方法为给定的文本生成分割掩码,从掩码区域提取基于文本的图像嵌入,并通过TCL将其与文本嵌入对齐。
效果:通过直接学习区域-文本对齐,该框架鼓励模型直接提高生成的分割掩码的质量。在统一的评估协议下,TCL在所有数据集上都实现了最先进的零样本分割性能。

We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task. However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment. Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.

Learning Video Representations From Large Language Models
Zhao, YueandMisra, IshanandKr\"ahenb\"uhl, PhilippandGirdhar, Rohit



研究问题:如何利用大型语言模型学习视频-语言表示。
动机:现有的预训练语言模型缺乏对视觉输入的处理,我们的目标是通过重新利用预训练的语言模型来创建自动视频叙述器。
方法:我们将预训练的语言模型进行条件设置以处理视觉输入,并对其进行微调以生成自动视频叙述。然后,我们使用这些叙述进行对比学习以获取视频-语言嵌入。
效果:实验结果表明,我们的方法在多个第一人称和第三人称的视频任务上超越了先前的最佳状态,无论是在零射击还是微调设置中。特别是在EGTEA分类和Epic-Kitchens-100多实例检索基准测试中,LAVILA获得了10.1%和5.9%的绝对增益。此外,使用Ego4D数据集一半的叙述进行训练的LAVILA表现优于使用完整数据集进行训练的模型,并且在增加预训练数据和模型大小时表现出积极的比例行为。

We introduce LAVILA, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-language embedding learned contrastively with these narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LAVILA obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LAVILA trained with only half the narrations from the Ego4D dataset outperforms models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
Dong, XiaoyiandBao, JianminandZheng, YinglinandZhang, TingandChen, DongdongandYang, HaoandZeng, MingandZhang, WeimingandYuan, LuandChen, DongandWen, FangandYu, Nenghai



研究问题:本文旨在提出一种简单有效的框架MaskCLIP,将一种新的被提出的遮蔽自我蒸馏方法融入到对比语言-图像预训练中。
动机:遮蔽自我蒸馏的核心思想是将完整图像的表示提炼为从遮蔽图像预测的表示。这种结合有两个重要的优点。首先,遮蔽自我蒸馏针对局部补丁表示学习,这与关注文本相关表示的视觉-语言对比是互补的。其次,遮蔽自我蒸馏在训练目标方面也与视觉-语言对比一致,因为它们都使用视觉编码器进行特征对齐,因此能够学习到从语言中获得间接监督的局部语义。
方法:通过设计专门的实验和全面分析来验证这两个优点。同时,我们还在文本分支中引入了局部语义监督,进一步提高了预训练性能。
效果:大量实验表明,MaskCLIP在各种具有挑战性的下游任务中,当应用语言编码器的指导时,在线性探测、微调和零样本性能方面都取得了优越的结果。

This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. The core idea of masked self-distillation is to distill representation from a full image to the representation predicted from a masked image. Such incorporation enjoys two vital benefits. First, masked self-distillation targets local patch representation learning, which is complementary to vision-language contrastive focusing on text-related representation. Second, masked self-distillation is also consistent with vision-language contrastive from the perspective of training objective as both utilize the visual encoder for feature aligning, and thus is able to learn local semantics getting indirect supervision from the language. We provide specially designed experiments with a comprehensive analysis to validate the two benefits. Symmetrically, we also introduce the local semantic supervision into the text branch, which further improves the pretraining performance. With extensive experiments, we show that MaskCLIP, when applied to various challenging downstream tasks, achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder. We will release the code and data after the publication.

Open-Vocabulary Semantic Segmentation With Mask-Adapted CLIP
Liang, FengandWu, BichenandDai, XiaoliangandLi, KunpengandZhao, YinanandZhang, HangandZhang, PeizhaoandVajda, PeterandMarculescu, Diana



研究问题:本文旨在解决开放词汇语义分割中的性能瓶颈,即预训练的CLIP模型在处理被遮盖的图像时表现不佳的问题。
动机:目前的两阶段方法首先生成类别无关的遮盖建议,然后利用预训练的视觉语言模型(如CLIP)对遮盖区域进行分类。但这种方法的性能瓶颈在于预训练的CLIP模型在处理被遮盖的图像时表现不佳。
方法:我们提出通过微调CLIP模型来解决这一问题,方法是使用一组被遮盖的图像区域及其对应的文本描述作为训练数据。我们通过挖掘现有的图像-字幕数据集(如COCO Captions),并使用CLIP将被遮盖的图像区域匹配到图像字幕中的名词来收集训练数据。
效果:实验表明,我们的“空白”区域方法可以在不修改CLIP的任何权重的情况下带来显著的改进,并且可以进一步改善完全微调的模型。特别是在COCO上训练并在ADE20K-150上评估时,我们的最佳模型实现了29.6%的mIoU,比之前最先进的方法高出+8.5%。这是第一次,开放词汇通用模型在没有特定数据集适应的情况下,其性能与2017年的有监督专业模型相匹配。

Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset-specific adaptations.

Supervised Masked Knowledge Distillation for Few-Shot Transformers
Lin, HanandHan, GuangxingandMa, JiaweiandHuang, ShiyuanandLin, XudongandChang, Shih-Fu



研究问题:本文旨在解决视觉转换器在少量学习设置下,由于缺乏类似卷积神经网络的归纳偏置,容易过拟合和性能下降的问题。
动机:尽管视觉转换器在数据丰富的计算机视觉任务中取得了令人印象深刻的性能,但在只有少量标记数据的小型数据集上的少量学习设置下,视觉转换器往往会过拟合并遭受严重性能下降。
方法:受最近自我监督知识蒸馏和掩蔽图像建模(MIM)的进展启发,我们提出了一种新的监督掩蔽知识蒸馏模型(SMKD),用于少量学习的视觉转换器,该模型将标签信息纳入自我蒸馏框架。
效果:实验结果表明,与之前的自我监督方法相比,我们的模型在四个少量分类基准数据集上的表现优于以往的方法,并达到了新的最先进的水平。详细的消融研究证实了我们模型的每个组件的有效性。

Vision Transformers (ViTs) emerge to achieve impressive performance on many data-abundant computer vision tasks by capturing long-range dependencies among local features. However, under few-shot learning (FSL) settings on small datasets with only a few labeled data, ViT tends to overfit and suffers from severe performance degradation due to its absence of CNN-alike inductive bias. Previous works in FSL avoid such problem either through the help of self-supervised auxiliary losses, or through the dextile uses of label information under supervised settings. But the gap between self-supervised and supervised few-shot Transformers is still unfilled. Inspired by recent advances in self-supervised knowledge distillation and masked image modeling (MIM), we propose a novel Supervised Masked Knowledge Distillation model (SMKD) for few-shot Transformers which incorporates label information into self-distillation frameworks. Compared with previous self-supervised methods, we allow intra-class knowledge distillation on both class and patch tokens, and introduce the challenging task of masked patch tokens reconstruction across intra-class images. Experimental results on four few-shot classification benchmark datasets show that our method with simple design outperforms previous methods by a large margin and achieves a new start-of-the-art. Detailed ablation studies confirm the effectiveness of each component of our model. Code for this paper is available here: https://github.com/HL-hanlin/SMKD.

Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning
Yu, YoungjaeandChung, JiwanandYun, HeeseungandHessel, JackandPark, JaeSungandLu, XimingandZellers, RowanandAmmanabrolu, PrithvirajandLeBras, RonanandKim, GunheeandChoi, Yejin



研究问题:如何将文本预训练模型的知识扩展到多模态输入,如图像和音频,而无需配对的领域数据?
动机:目前的模型可以执行常识推理,但需要扩展其知识以处理多模态任务,如视觉常识推理。
方法:提出ESPER,使用强化学习来对齐多模态输入和语言模型生成,无需直接监督。例如,奖励优化仅依赖于从CLIP派生的余弦相似度,无需额外的配对(图像,文本)数据。
效果:实验表明,ESPER在各种多模态文本生成任务上优于基线和先前的工作,包括一个新的基准测试——ESP数据集,该数据集要求模型为每张图片生成多个不同领域的文本。

Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (e.g. commonsense graphs [6], ethical norms [25]), and larger models like GPT-3 manifest broad commonsense reasoning capacity. Can their knowledge be extended to multimodal inputs such as images and audio without paired domain data? In this work, we propose ESPER (Extending Sensory PErception with Reinforcement learning) which enables text-only pretrained models to address multimodal tasks such as visual commonsense reasoning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision: for example, our reward optimization relies only on cosine similarity derived from CLIP and requires no additional paired (image, text) data. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of multimodal text generation tasks ranging from captioning to commonsense reasoning; these include a new benchmark we collect and release, the ESP dataset, which tasks models with generating the text of several different domains for each image. Our code and data are publicly released at https://github.com/JiwanChung/esper.

Meta-Personalizing Vision-Language Models To Find Named Instances in Video
Yeh, Chun-HsiaoandRussell, BryanandSivic, JosefandHeilbron, FabianCabaandJenni, Simon



研究问题:大型视觉语言模型在视频搜索应用中,虽然可以进行类别级别的查询,但目前还无法进行特定对象实例的个性化搜索。
动机:为了解决这一问题,我们提出了一种元个性化预训练视觉语言模型的方法,即学习如何在测试时个性化搜索视频中的特定对象实例。
方法:我们扩展了视觉语言模型的词汇表,通过学习每个实例特有的词嵌入来个性化模型。同时,我们将每个实例的嵌入表示为共享和学习的全局类别特征的组合,以捕获仅与实例相关的特征。
效果:我们在"This-Is-My"个人视频实例检索基准测试中评估了我们的方法,并在DeepFashion2数据集上取得了比现有技术提高15%的效果。

Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as "My dog Biscuit" appears. We present the following three contributions to address this problem. First, we describe a method to meta-personalize a pre-trained VLM, i.e., learning how to learn to personalize a VLM at test time to search in video. Our method extends the VLM's token vocabulary by learning novel word embeddings specific to each instance. To capture only instance-specific features, we represent each instance embedding as a combination of shared and learned global category features. Second, we propose to learn such personalization without explicit human supervision. Our approach automatically identifies moments of named visual instances in video using transcripts and vision-language similarity in the VLM's embedding space. Finally, we introduce This-Is-My, a personal video instance retrieval benchmark. We evaluate our approach on This-Is-My and DeepFashion2 and show that we obtain a 15% relative improvement over the state of the art on the latter dataset.

Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in Temporal Action Localization Tasks
Kang, HyolimandKim, HanjungandAn, JoungbinandCho, MinsuandKim, SeonJoo



研究问题:如何缓解预训练编码器和下游任务之间的转移性差距。
动机:现有的时序动作定位(TAL)方法在处理任务不匹配问题上,要么需要大量的内存和计算资源进行重新训练或端到端微调,要么通过预定义任务进行缓解,但都存在效率低下的问题。
方法:我们提出了软着陆(SoLa)策略,通过在冻结的编码器上添加一个轻量级的神经网络模块——SoLa模块,有效地连接预训练编码器和下游任务。同时,我们还提出了一种无监督的训练方案,让SoLa模块通过帧间隔作为监督信号进行学习,从而消除了对时间标注的需求。
效果:我们在多个下游TAL任务基准上进行了实验评估,结果显示我们的方法在显著提高计算效率的同时,有效地缓解了任务不匹配的问题。

Temporal Action Localization (TAL) methods typically operate on top of feature sequences from a frozen snippet encoder that is pretrained with the Trimmed Action Classification (TAC) tasks, resulting in a task discrepancy problem. While existing TAL methods mitigate this issue either by retraining the encoder with a pretext task or by end-to-end finetuning, they commonly require an overload of high memory and computation. In this work, we introduce Soft-Landing (SoLa) strategy, an efficient yet effective framework to bridge the transferability gap between the pretrained encoder and the downstream tasks by incorporating a light-weight neural network, i.e., a SoLa module, on top of the frozen encoder. We also propose an unsupervised training scheme for the SoLa module; it learns with inter-frame Similarity Matching that uses the frame interval as its supervisory signal, eliminating the need for temporal annotations. Experimental evaluation on various benchmarks for downstream TAL tasks shows that our method effectively alleviates the task discrepancy problem with remarkable computational efficiency.

Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images
Lu, MingY.andChen, BowenandZhang, AndrewandWilliamson, DrewF.K.andChen, RichardJ.andDing, TongandLe, LongPhiandChuang, Yung-SungandMahmood, Faisal



研究问题:现有的对比视觉语言预训练方法主要针对图像-文本对的大型数据集进行,且研究问题:现有的对比视觉语言预训练方法主要针对图像-文本对的大型数据集进行,且主要用于小型到中型图像的下游任务,不适用于计算病理学领域。
动机:计算病理学领域缺乏公开的成对图像-文本数据集,且每个图像的维度可能高达100,000 x 100,000像素。
方法:提出MI-Zero框架,通过对比对齐图像和文本模型的零样本转移能力,实现对十亿像素的组织病理学全幻灯片图像的处理。
效果:在超过55万份病理报告和其他可用的领域内文本语料库上预训练文本编码器,并在超过3.3万个组织病理学图像-标题对上进行预训练,实现了三种不同的真实世界癌症亚型任务的平均中位数零样本准确率为70.2%。

Contrastive visual language pretraining has emerged as a powerful method for either training new language-aware image encoders or augmenting existing pretrained models with zero-shot visual recognition capabilities. However, existing works typically train on large datasets of image-text pairs and have been designed to perform downstream tasks involving only small to medium sized-images, neither of which are applicable to the emerging field of computational pathology where there are limited publicly available paired image-text datasets and each image can span up to 100,000 x 100,000 pixels in dimensions. In this paper we present MI-Zero, a simple and intuitive framework for unleashing the zero-shot transfer capabilities of contrastively aligned image and text models to gigapixel histopathology whole slide images, enabling multiple downstream diagnostic tasks to be carried out by pretrained encoders without requiring any additional labels. MI-Zero reformulates zero-shot transfer under the framework of multiple instance learning to overcome the computational challenge of inference on extremely large images. We used over 550k pathology reports and other available in-domain text corpora to pretrain our text encoder. By effectively leveraging strong pretrained encoders, our best model pretrained on over 33k histopathology image-caption pairs achieves an average median zero-shot accuracy of 70.2% across three different real-world cancer subtyping tasks. Our code is available at: https://github.com/mahmoodlab/MI-Zero.

PMR: Prototypical Modal Rebalance for Multimodal Learning
Fan, YunfengandXu, WenchaoandWang, HaozhaoandWang, JunxiaoandGuo, Song



研究问题:多模态学习(MML)旨在联合利用不同模态的共同先验来弥补其内在局限性,但现有的MML方法往往对不同模态优化统一的学习目标,导致“模态失衡”问题和反效果的MML性能。
动机:为了解决上述问题,我们提出了原型模态再平衡(PMR)方法,通过刺激特定慢速学习的模态,而不受到其他模态的干扰,以更好地利用多模态的特性。
方法:我们引入了代表每个类别一般特征的原型,用于构建非参数分类器进行单模态性能评估。然后,我们尝试通过增强向原型的聚类来加速慢速学习的模态。此外,为了防止过早收敛,我们在早期训练阶段引入了一个基于原型的熵正则化项,以减轻主导模态的抑制作用。
效果:我们的PMR方法仅依赖于每个模态的表示,没有模型结构和融合方法的限制,因此在各种场景中具有巨大的应用潜力。

Multimodal learning (MML) aims to jointly exploit the common priors of different modalities to compensate for their inherent limitations. However, existing MML methods often optimize a uniform objective for different modalities, leading to the notorious "modality imbalance" problem and counterproductive MML performance. To address the problem, some existing methods modulate the learning pace based on the fused modality, which is dominated by the better modality and eventually results in a limited improvement on the worse modal. To better exploit the features of multimodal, we propose Prototypical Modality Rebalance (PMR) to perform stimulation on the particular slow-learning modality without interference from other modalities. Specifically, we introduce the prototypes that represent general features for each class, to build the non-parametric classifiers for uni-modal performance evaluation. Then, we try to accelerate the slow-learning modality by enhancing its clustering toward prototypes. Furthermore, to alleviate the suppression from the dominant modality, we introduce a prototype-based entropy regularization term during the early training stage to prevent premature convergence. Besides, our method only relies on the representations of each modality and without restrictions from model structures and fusion methods, making it with great application potential for various scenarios. The source code is available here.

Trainable Projected Gradient Method for Robust Fine-Tuning
Tian, JunjiaoandHe, ZechengandDai, XiaoliangandMa, Chih-YaoandLiu, Yen-ChengandKira, Zsolt



研究问题:如何提高预训练模型对分布外数据的鲁棒性并保持泛化能力。
动机:目前的迁移学习方法大多采用手动设计的启发式方法或昂贵的超参数搜索,这限制了它们在大数据集和神经网络上的扩展性。
方法:提出可训练投影梯度法(TPGM),将微调视为双层约束优化问题,自动学习每一层施加的约束。
效果:实验结果表明,TPGM在分布外性能上优于现有的微调方法,同时在最佳内部性能上与之匹配。例如,在DomainNet-Real和ImageNet上进行微调时,与普通微调相比,TPGM在其草图对应物上分别表现出22%和10%的相对分布外改进。

Recent studies on transfer learning have shown that selectively fine-tuning a subset of layers or customizing different learning rates for each layer can greatly improve robustness to out-of-distribution (OOD) data and retain generalization capability in the pre-trained models. However, most of these methods employ manually crafted heuristics or expensive hyper-parameter search, which prevent them from scaling up to large datasets and neural networks. To solve this problem, we propose Trainable Projected Gradient Method (TPGM) to automatically learn the constraint imposed for each layer for a fine-grained fine-tuning regularization. This is motivated by formulating fine-tuning as a bi-level constrained optimization problem. Specifically, TPGM maintains a set of projection radii, i.e., distance constraints between the fine-tuned model and the pre-trained model, for each layer, and enforces them through weight projections. To learn the constraints, we propose a bi-level optimization to automatically learn the best set of projection radii in an end-to-end manner. Theoretically, we show that the bi-level optimization formulation is the key to learn different constraints for each layer. Empirically, with little hyper-parameter search cost, TPGM outperforms existing fine-tuning methods in OOD performance while matching the best in-distribution (ID) performance. For example, when fine-tuned on DomainNet-Real and ImageNet, compared to vanilla fine-tuning, TPGM shows 22% and 10% relative OOD improvement respectively on their sketch counterparts.

Are Deep Neural Networks SMARTer Than Second Graders?
Cherian, AnoopandPeng, Kuan-ChuanandLohit, SuhasandSmith, KevinA.andTenenbaum, JoshuaB.



研究问题:深度神经网络在解决需要广泛技能的问题上的可泛化性如何?
动机:为了解决这个问题,我们提出了SMART和SMART-101数据集,用于评估神经网络在解决视觉语言难题上的抽象、演绎和泛化能力。
方法:我们设计了专门针对6-8岁儿童的视觉语言谜题,并创建了一个包含101个独特谜题的数据集。我们还开发了一种可以整合各种最先进的神经骨干的视觉和语言元学习模型。
效果:实验表明,虽然强大的深度模型在监督设置下的谜题上表现良好,但当分析其泛化能力时,它们并不比随机准确率更好。填补这一空白可能需要新的多模态学习方法。

Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, question answering (such as ChatGPT), etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle while retaining their solution algorithm. To benchmark the performance on the SMART-101 dataset, we propose a vision-and-language meta-learning model that can incorporate varied state-of-the-art neural backbones. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization -- filling this gap may demand new multimodal learning approaches.

Multi-Modal Learning With Missing Modality via Shared-Specific Feature Modelling
Wang, HuandChen, YuanhongandMa, CongboandAvery, JodieandHull, LouiseandCarneiro, Gustavo



研究问题:解决多模态模型中缺失模态的问题。
动机:当前方法处理多模态任务中的缺失模态问题,要么仅在评估期间处理缺失的模态,要么训练单独的模型来处理特定的缺失模态设置,且这些模型是为特定任务设计的,不易适应其他任务。
方法:提出共享-特定特征建模(ShaSpec)方法,通过学习共享和特定特征以更好地表示输入数据,利用所有可用的输入模态进行训练和评估。
效果:实验结果表明,ShaSpec在医学图像分割和计算机视觉分类等任务上的表现优于竞争方法,例如在BraTS2018数据集上,ShaSpec提高了肿瘤、肿瘤核心和整个肿瘤的3%以上的SOTA。

The missing modality issue is critical but non-trivial to be solved by multi-modal models. Current methods aiming to handle the missing modality problem in multi-modal tasks, either deal with missing modalities only during evaluation or train separate models to handle specific missing modality settings. In addition, these models are designed for specific tasks, so for example, classification models are not easily adapted to segmentation tasks and vice versa. In this paper, we propose the Shared-Specific Feature Modelling (ShaSpec) method that is considerably simpler and more effective than competing approaches that address the issues above. ShaSpec is designed to take advantage of all available input modalities during training and evaluation by learning shared and specific features to better represent the input data. This is achieved from a strategy that relies on auxiliary tasks based on distribution alignment and domain classification, in addition to a residual feature fusion procedure. Also, the design simplicity of ShaSpec enables its easy adaptation to multiple tasks, such as classification and segmentation. Experiments are conducted on both medical image segmentation and computer vision classification, with results indicating that ShaSpec outperforms competing methods by a large margin. For instance, on BraTS2018, ShaSpec improves the SOTA by more than 3% for enhancing tumour, 5% for tumour core and 3% for whole tumour.

Stare at What You See: Masked Image Modeling Without Reconstruction
Xue, HongweiandGao, PengandLi, HongyangandQiao, YuandSun, HaoandLi, HouqiangandLuo, Jiebo



研究问题:在有语义丰富的教师模型的掩蔽图像建模(MIM)中,重建是否必要?
动机:强大的教师模型提取的特征已经编码了完整图像中的跨区域丰富语义相关性,因此提出了一个问题:在有教师模型的MIM中,重建是否是必要的?
方法:本文提出了一种有效的MIM范式MaskAlign,它简单地学习了学生模型提取的可见补丁特征和教师模型提取的完整图像特征的一致性。为了进一步提高性能并解决学生和教师模型之间的输入不一致问题,我们提出了动态对齐(DA)模块来应用可学习的对齐。
效果:实验结果表明,即使在没有重建被遮蔽区域的情况下,掩蔽建模也不会失去效果。结合动态对齐,MaskAlign可以实现具有更高效率的最先进的性能。

Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training. By reconstructing masked image patches from a small portion of visible image regions, MAE forces the model to infer semantic correlation within an image. Recently, some approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance. However, unlike the low-level features such as pixel values, we argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image. This raises one question: is reconstruction necessary in Masked Image Modeling (MIM) with a teacher model? In this paper, we propose an efficient MIM paradigm named MaskAlign. MaskAlign simply learns the consistency of visible patch feature extracted by the student model and intact image features extracted by the teacher model. To further advance the performance and tackle the problem of input inconsistency between the student and teacher model, we propose a Dynamic Alignment (DA) module to apply learnable alignment. Our experimental results demonstrate that masked modeling does not lose effectiveness even without reconstruction on masked regions. Combined with Dynamic Alignment, MaskAlign can achieve state-of-the-art performance with much higher efficiency.

Joint Visual Grounding and Tracking With Natural Language Specification
Zhou, LiandZhou, ZikunandMao, KaigeandHe, Zhenyu



研究问题:本文旨在解决现有的视觉跟踪算法中,视觉基础和跟踪被分开处理,忽视了自然语言描述为两个步骤提供全局语义线索的问题。
动机:目前的视觉跟踪算法将视觉基础和跟踪分为两个步骤进行处理,这种分离的框架忽略了自然语言描述为这两个步骤提供全局语义线索的联系,并且难以进行端到端的训练。
方法:本文提出了一个联合视觉基础和跟踪的框架,将基础和跟踪重新定义为一个统一的任务:基于给定的视觉-语言参考来定位目标。具体来说,我们设计了一个多源关系模型模块来有效地建立视觉-语言参考和测试图像之间的关系,并设计了一个时间建模模块,以全局语义信息为指导提供时间线索,以提高模型对目标外观变化的适应性。
效果:在TNL2K、LaSOT、OTB99和RefCOCOg等数据集上的大量实验结果表明,我们的方法在跟踪和基础方面都优于最先进的算法。

Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description. Existing algorithms solve this issue in two steps, visual grounding and tracking, and accordingly deploy the separated grounding model and tracking model to implement these two steps, respectively. Such a separated framework overlooks the link between visual grounding and tracking, which is that the natural language descriptions provide global semantic cues for localizing the target for both two steps. Besides, the separated framework can hardly be trained end-to-end. To handle these issues, we propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task: localizing the referred target based on the given visual-language references. Specifically, we propose a multi-source relation modeling module to effectively build the relation between the visual-language references and the test image. In addition, we design a temporal modeling module to provide a temporal clue with the guidance of the global semantic information for our model, which effectively improves the adaptability to the appearance variations of the target. Extensive experimental results on TNL2K, LaSOT, OTB99, and RefCOCOg demonstrate that our method performs favorably against state-of-the-art algorithms for both tracking and grounding. Code is available at https://github.com/lizhou-cs/JointNLT.

Fake It Till You Make It: Learning Transferable Representations From Synthetic ImageNet Clones
Sar{\i



研究问题:最近的图像生成模型如Stable Diffusion是否能够完全取代真实图像来训练图像预测模型?
动机:探索仅使用类名,无需真实图像,能否训练出有效的ImageNet分类模型。
方法:利用Stable Diffusion生成ImageNet的合成克隆,并以此训练分类模型。
效果:结果显示,通过最小化和类别无关的提示工程,合成克隆能够大大缩小合成图像模型与真实图像训练模型之间的差距,且在几个标准的分类基准测试中表现良好。更重要的是,基于合成图像训练的模型展现出强大的泛化能力,其转移学习性能与真实数据训练的模型相当。

Recent image generation models such as Stable Diffusion have exhibited an impressive ability to generate fairly realistic images starting from a simple text prompt. Could such models render real images obsolete for training image prediction models? In this paper, we answer part of this provocative question by investigating the need for real images when training models for ImageNet classification. Provided only with the class names that have been used to build the dataset, we explore the ability of Stable Diffusion to generate synthetic clones of ImageNet and measure how useful these are for training classification models from scratch. We show that with minimal and class-agnostic prompt engineering, ImageNet clones are able to close a large part of the gap between models produced by synthetic images and models trained with real images, for the several standard classification benchmarks that we consider in this study. More importantly, we show that models trained on synthetic images exhibit strong generalization properties and perform on par with models trained on real data for transfer. Project page: https://europe.naverlabs.com/imagenet-sd

HIER: Metric Learning Beyond Class Labels via Hierarchical Regularization
Kim, SungyeonandJeong, BoseungandKwak, Suha



研究问题:传统的监督学习方式限制了度量学习的进步。
动机:提出一种新的正则化方法,通过发现训练数据的隐语义层次结构,提供比常见度量学习损失函数诱导的类间可分性更丰富、更精细的监督。
方法:提出了一种名为HIER的新方法,无需对语义层次结构进行注释,而是通过在双曲空间中学习分层代理来实现。
效果:在四个标准基准测试中,HIER始终优于传统方法,并在几乎所有设置中实现了最佳记录,甚至超过了现有的双曲度量学习方法。

Supervision for metric learning has long been given in the form of equivalence between human-labeled classes. Although this type of supervision has been a basis of metric learning for decades, we argue that it hinders further advances in the field. In this regard, we propose a new regularization method, dubbed HIER, to discover the latent semantic hierarchy of training data, and to deploy the hierarchy to provide richer and more fine-grained supervision than inter-class separability induced by common metric learning losses. HIER achieves this goal with no annotation for the semantic hierarchy but by learning hierarchical proxies in hyperbolic spaces. The hierarchical proxies are learnable parameters, and each of them is trained to serve as an ancestor of a group of data or other proxies to approximate the semantic hierarchy among them. HIER deals with the proxies along with data in hyperbolic space since the geometric properties of the space are well-suited to represent their hierarchical structure. The efficacy of HIER is evaluated on four standard benchmarks, where it consistently improved the performance of conventional methods when integrated with them, and consequently achieved the best records, surpassing even the existing hyperbolic metric learning technique, in almost all settings.

Interactive and Explainable Region-Guided Radiology Report Generation
Tanida, TimandM\"uller, PhilipandKaissis, GeorgiosandRueckert, Daniel



研究问题:如何有效地生成放射学报告,帮助放射科医生减轻报告撰写的负担。
动机:现有的方法从图像级别特征生成完整的报告,未能明确关注图像中的解剖区域。
方法:提出了一种简单而有效的区域引导报告生成模型,该模型检测解剖区域,然后描述各个显著的区域以形成最终报告。
效果:实验结果表明,该方法在报告生成方面非常有效,优于先前最先进的模型,并突出了其交互能力。

The automatic generation of radiology reports has the potential to assist radiologists in the time-consuming task of report writing. Existing methods generate the full report from image-level features, failing to explicitly focus on anatomical regions in the image. We propose a simple yet effective region-guided report generation model that detects anatomical regions and then describes individual, salient regions to form the final report. While previous methods generate reports without the possibility of human intervention and with limited explainability, our method opens up novel clinical use cases through additional interactive capabilities and introduces a high degree of transparency and explainability. Comprehensive experiments demonstrate our method's effectiveness in report generation, outperforming previous state-of-the-art models, and highlight its interactive capabilities. The code and checkpoints are available at https://github.com/ttanida/rgrg.

Benchmarking Self-Supervised Learning on Diverse Pathology Datasets
Kang, MinguandSong, HeonandPark, SeonwookandYoo, DonggeunandPereira, S\'ergio



研究问题:如何有效地利用未标记的病理图像数据进行预训练,以提高下游任务的性能。
动机:标注病理图像数据成本高昂,而自监督学习是一种有效的利用未标记数据的方法。然而,目前还没有关于如何将其应用于病理学领域的系统性研究。
方法:在本文中,我们对四种代表性的自监督学习方法进行了大规模的实验,并在各种下游任务上进行了评估。我们还提出了一套针对病理学领域的特定技术,并对其进行了实验验证。
效果:实验结果表明,在标准的自监督学习设置(如线性和微调评估)以及低标签环境下,大规模领域对齐的病理学预训练始终优于ImageNet预训练。此外,我们提出的领域特定技术也带来了显著的性能提升。最后,我们首次将自监督学习应用于具有挑战性的核实例分割任务,并在各种设置下实现了显著且一致的性能提升。

Computational pathology can lead to saving human lives, but models are annotation hungry and pathology images are notoriously expensive to annotate. Self-supervised learning has shown to be an effective method for utilizing unlabeled data, and its application to pathology could greatly benefit its downstream tasks. Yet, there are no principled studies that compare SSL methods and discuss how to adapt them for pathology. To address this need, we execute the largest-scale study of SSL pre-training on pathology image data, to date. Our study is conducted using 4 representative SSL methods on diverse downstream tasks. We establish that large-scale domain-aligned pre-training in pathology consistently out-performs ImageNet pre-training in standard SSL settings such as linear and fine-tuning evaluations, as well as in low-label regimes. Moreover, we propose a set of domain-specific techniques that we experimentally show leads to a performance boost. Lastly, for the first time, we apply SSL to the challenging task of nuclei instance segmentation and show large and consistent performance improvements under diverse settings.

From Images to Textual Prompts: Zero-Shot Visual Question Answering With Frozen Large Language Models
Guo, JiaxianandLi, JunnanandLi, DongxuandTiong, AnthonyMengHuatandLi, BoyangandTao, DachengandHoi, Steven



研究问题:如何有效利用大型语言模型进行零样本视觉问答。
动机:大型语言模型与视觉问答任务之间存在模态和任务的断开,直接端到端训练既不灵活又计算量大。
方法:提出Img2Prompt模块,通过LLM无关模型提供描述图像内容和自构建的问题-答案对的提示,以指导大型语言模型执行零样本视觉问答任务。
效果:Img2Prompt能灵活应用于各种大型语言模型进行VQA,无需端到端训练,大大降低了部署成本,且在性能上比依赖端到端训练的方法更好。

Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose Img2Prompt, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2) Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo by 5.6% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20%.

Neuralizer: General Neuroimage Analysis Without Re-Training
Czolbe, SteffenandDalca, AdrianV.



研究问题:如何解决神经影像处理任务中需要大量时间和专业知识来训练和调整深度学习模型的问题。
动机:现有的深度学习策略和架构在解决神经影像处理任务时,当面临新任务或具有不同视觉特征的数据集时,通常需要重新训练或微调模型,这对缺乏资源或机器学习专业知识的神经科学家和临床研究人员构成了巨大的障碍。
方法:我们提出了Neuralizer,这是一个可以推广到以前未见过的神经影像任务和模态而无需重新训练或微调的单一模型。该模型可以在单个前向传递过程中进行一般化,并在推理期间完成任务。
效果:我们在冠状切片上的实验表明,当我们只有少数标注的受试者可用时,我们的多任务网络在未训练任务的情况下优于特定任务的基线。

Neuroimage processing tasks like segmentation, reconstruction, and registration are central to the study of neuroscience. Robust deep learning strategies and architectures used to solve these tasks are often similar. Yet, when presented with a new task or a dataset with different visual characteristics, practitioners most often need to train a new model, or fine-tune an existing one. This is a time-consuming process that poses a substantial barrier for the thousands of neuroscientists and clinical researchers who often lack the resources or machine-learning expertise to train deep learning models. In practice, this leads to a lack of adoption of deep learning, and neuroscience tools being dominated by classical frameworks. We introduce Neuralizer, a single model that generalizes to previously unseen neuroimaging tasks and modalities without the need for re-training or fine-tuning. Tasks do not have to be known a priori, and generalization happens in a single forward pass during inference. The model can solve processing tasks across multiple image modalities, acquisition methods, and datasets, and generalize to tasks and modalities it has not been trained on. Our experiments on coronal slices show that when few annotated subjects are available, our multi-task network outperforms task-specific baselines without training on the task.

Visual Prompt Multi-Modal Tracking
Zhu, JiawenandLai, SimiaoandChen, XinandWang, DongandLu, Huchuan



研究问题:如何有效地利用预训练的基础模型进行多模态跟踪。
动机:全微调在RGB参数上的方式虽然有效,但由于下游数据稀缺和转移性差等问题,这种方式并非最优。
方法:受语言模型中提示学习的启发,开发了视觉提示多模态跟踪(ViPT),通过学习模态相关的提示来适应各种下游多模态跟踪任务的预训练基础模型。
效果:ViPT在RGB+Depth、RGB+Thermal、RGB+Event等多种下游跟踪任务上优于全微调模式,同时只引入少量可训练参数(小于1%的模型参数),实现了高效的参数使用并取得了最先进的性能。

Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at https://github.com/jiawen-zhu/ViPT.

GIVL: Improving Geographical Inclusivity of Vision-Language Models With Pre-Training Methods
Yin, DaandGao, FengandThattai, GovindandJohnston, MichaelandChang, Kai-Wei



研究问题:如何开发一种服务于所有社区,而不仅仅是某个特定地区的AI技术。
动机:由于文化差异,某些地区的知识可能不适用于其他地区,如果模型对地域特性不了解,可能导致在不同地区的表现差异,并对未被充分代表的群体产生偏见。
方法:提出GIVL模型,这是一种地理包容性视觉和语言预训练模型。设计了新的预训练目标——图像知识匹配(IKM)和图像编辑检查(IEC),以预训练GIVL模型。
效果:与在相似规模数据上进行预训练的类似大小模型相比,GIVL在地理多样性视觉和语言任务上取得了最先进的、更平衡的性能。

A key goal for the advancement of AI is to develop technologies that serve the needs not just of one group but of all communities regardless of their geographical region. In fact, a significant proportion of knowledge is locally shared by people from certain regions but may not apply equally in other regions because of cultural differences. If a model is unaware of regional characteristics, it may lead to performance disparity across regions and result in bias against underrepresented groups. We propose GIVL, a Geographically Inclusive Vision-and-Language Pre-trained model. There are two attributes of geo-diverse visual concepts which can help to learn geo-diverse knowledge: 1) concepts under similar categories have unique knowledge and visual characteristics, 2) concepts with similar visual features may fall in completely different categories. Motivated by the attributes, we design new pre-training objectives Image-Knowledge Matching (IKM) and Image Edit Checking (IEC) to pre-train GIVL. Compared with similar-size models pre-trained with similar scale of data, GIVL achieves state-of-the-art (SOTA) and more balanced performance on geo-diverse V&L tasks. Code and data are released at https://github.com/WadeYin9712/GIVL.

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval
Lin, XudongandTiwari, SimranandHuang, ShiyuanandLi, ManlingandShou, MikeZhengandJi, HengandChang, Shih-Fu



研究问题:如何有效地将视频和文本信息进行联合,以实现多通道视频-语言检索。
动机:现有的对比性多模态模型在图像/视频和文本的实体对齐上表现出色,但在有限的数据和资源下,如何快速适应多通道视频-语言检索仍不清楚。
方法:本文通过两个维度(如何表示视频和如何融合视频和文本信息)来设计原则性的模型空间,并基于最近的方法进行分类,探讨了使用连续特征向量或离散文本令牌表示视频,以及使用多模态转换器或预训练对比文本模型进行融合的可能性。
效果:实验结果表明,离散文本令牌与预训练对比文本模型的结合能产生最佳性能,甚至能在无需额外训练数百万视频-文本数据的情况下超越最新的iVQA和How2QA数据集上的最先进技术。

Multi-channel video-language retrieval require models to understand information from different channels (e.g. video+question, video+speech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text, e.g., CLIP; text contrastive models are extensively studied recently for their strong ability of producing discriminative sentence embeddings, e.g., SimCSE. However, there is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources. In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information. Based on categorization of recent methods, we investigate the options of representing videos using continuous feature vectors or discrete text tokens; for the fusion method, we explore the use of a multimodal transformer or a pretrained contrastive text model. We extensively evaluate the four combinations on five video-language datasets. We surprisingly find that discrete text tokens coupled with a pretrained contrastive text model yields the best performance, which can even outperform state-of-the-art on the iVQA and How2QA datasets without additional training on millions of video-text data. Further analysis shows that this is because representing videos as text tokens captures the key visual information and text tokens are naturally aligned with text models that are strong retrievers after the contrastive pretraining process. All the empirical analysis establishes a solid foundation for future research on affordable and upgradable multimodal intelligence. The code will be released at https://github.com/XudongLinthu/upgradable-multimodal-intelligence to facilitate future research.

Hierarchical Discriminative Learning Improves Visual Representations of Biomedical Microscopy
Jiang, ChengandHou, XinhaiandKondepudi, AkhilandChowdury, AsadurandFreudiger, ChristianW.andOrringer, DanielA.andLee, HonglakandHollon, ToddC.



研究问题:如何提高计算机视觉在生物医学显微镜和临床医学中的作用。
动机:现有的自监督表示学习方法主要针对实例判别,并将其直接应用于从用于癌症诊断的千兆像素全幅图像(WSIs)中采样的图像补丁或视场,这种方法存在局限性。
方法:提出HiDisc方法,利用临床生物医学显微镜中固有的患者-幻灯片-补丁层次结构定义了一个分层判别学习任务,该任务隐式地学习了潜在诊断的特征。HiDisc使用自监督对比学习框架,其中正补丁对是根据数据层次结构中的共同祖先定义的,并使用统一的补丁、幻灯片和患者判别学习目标进行视觉SSL。
效果:通过两个生物医学显微镜数据集上的两个视觉任务对HiDisc视觉表示进行基准测试,结果表明:(1)HiDisc预训练在癌症诊断和基因突变预测方面优于当前最先进的自监督预训练方法;(2)HiDisc在学习高质量视觉表示时使用了自然补丁多样性,而无需强大的数据增强。

Learning high-quality, self-supervised, visual representations is essential to advance the role of computer vision in biomedical microscopy and clinical medicine. Previous work has focused on self-supervised representation learning (SSL) methods developed for instance discrimination and applied them directly to image patches, or fields-of-view, sampled from gigapixel whole-slide images (WSIs) used for cancer diagnosis. However, this strategy is limited because it (1) assumes patches from the same patient are independent, (2) neglects the patient-slide-patch hierarchy of clinical biomedical microscopy, and (3) requires strong data augmentations that can degrade downstream performance. Importantly, sampled patches from WSIs of a patient's tumor are a diverse set of image examples that capture the same underlying cancer diagnosis. This motivated HiDisc, a data-driven method that leverages the inherent patient-slide-patch hierarchy of clinical biomedical microscopy to define a hierarchical discriminative learning task that implicitly learns features of the underlying diagnosis. HiDisc uses a self-supervised contrastive learning framework in which positive patch pairs are defined based on a common ancestry in the data hierarchy, and a unified patch, slide, and patient discriminative learning objective is used for visual SSL. We benchmark HiDisc visual representations on two vision tasks using two biomedical microscopy datasets, and demonstrate that (1) HiDisc pretraining outperforms current state-of-the-art self-supervised pretraining methods for cancer diagnosis and genetic mutation prediction, and (2) HiDisc learns high-quality visual representations using natural patch diversity without strong data augmentations.

ProD: Prompting-To-Disentangle Domain Knowledge for Cross-Domain Few-Shot Image Classification
Ma, TianyiandSun, YifanandYang, ZongxinandYang, Yi



研究问题:本文旨在解决跨领域少样本图像分类中训练测试领域差距对分类准确度的影响。
动机:现有的多领域训练方案和卷积神经网络提取主干特征的方法无法有效解决训练测试领域的鸿沟。
方法:提出一种通过提示机制进行解耦(ProD)的方法。该方法采用流行的多领域训练方案,并使用标准的卷积神经网络提取主干特征。其关键在于在变换器中使用提示机制从主干特征中解耦领域一般(DG)和领域特定(DS)的知识。具体来说,ProD将一个DG和一个DS提示连接到主干特征上,并将它们输入到一个轻量级的变换器中。DG提示是所有训练领域共享的可学习提示,而DS提示是根据感兴趣的领域实时生成的。结果,变换器并行输出DG和DS特征以及两个提示,产生了解耦效果。
效果:实验表明,1) 只需为所有训练领域共享一个DG提示,就可以提高对新测试领域的泛化能力;2) 通过使DG提示对训练领域保持中性,可以进一步加强跨领域的泛化能力;3) 当进行推理时,可以从支持样本中生成DS提示,并通过提示机制捕获测试领域的知识。综合以上三点优势,ProD显著提高了跨领域的少样本分类性能。例如,在CUB数据集上,ProD将5类5次采样的准确率从73.56%(基线)提高到79.19%,创造了新的最先进的状态。

This paper considers few-shot image classification under the cross-domain scenario, where the train-to-test domain gap compromises classification accuracy. To mitigate the domain gap, we propose a prompting-to-disentangle (ProD) method through a novel exploration with the prompting mechanism. ProD adopts the popular multi-domain training scheme and extracts the backbone feature with a standard Convolutional Neural Network. Based on these two common practices, the key point of ProD is using the prompting mechanism in the transformer to disentangle the domain-general (DG) and domain-specific (DS) knowledge from the backbone feature. Specifically, ProD concatenates a DG and a DS prompt to the backbone feature and feeds them into a lightweight transformer. The DG prompt is learnable and shared by all the training domains, while the DS prompt is generated from the domain-of-interest on the fly. As a result, the transformer outputs DG and DS features in parallel with the two prompts, yielding the disentangling effect. We show that: 1) Simply sharing a single DG prompt for all the training domains already improves generalization towards the novel test domain. 2) The cross-domain generalization can be further reinforced by making the DG prompt neutral towards the training domains. 3) When inference, the DS prompt is generated from the support samples and can capture test domain knowledge through the prompting mechanism. Combining all three benefits, ProD significantly improves cross-domain few-shot classification. For instance, on CUB, ProD improves the 5-way 5-shot accuracy from 73.56% (baseline) to 79.19%, setting a new state of the art.

ViLEM: Visual-Language Error Modeling for Image-Text Retrieval
Chen, YuxinandMa, ZongyangandZhang, ZiqiandQi, ZhongangandYuan, ChunfengandShan, YingandLi, BingandHu, WeimingandQie, XiaohuandWu, Jianping



研究问题:现有的图像-文本检索预训练模型采用"双编码器"架构进行高效的全局对齐,但忽略了图像和文本之间的详细语义关联。
动机:为了解决上述问题,我们提出了一种新的代理任务——视觉-语言错误建模(ViLEM),通过“校对”文本中的每个词与相应的图像,将详细的图像-文本关联注入到“双编码器”模型中。
方法:首先,我们使用预训练的语言模型自动生成多样化的合理负样本文本。然后,ViLEM强制模型区分这些合理负样本文本中每个词的正确性,并通过参考图像信息进一步纠正错误的词。此外,我们还提出了一个多粒度交互框架,通过将文本特征与全局和局部图像特征进行交互,实现ViLEM,使局部文本语义与高层视觉上下文和多级局部视觉信息相关联。
效果:实验结果表明,我们的方法在图像-文本检索任务上大幅超越了最先进的"双编码器"方法,并且显著提高了对局部文本语义的判别能力。我们的模型还可以很好地泛化到视频-文本检索任务上。

Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image and text representations and contrastive learning is employed for global alignment. However, coarse-grained global alignment ignores detailed semantic associations between image and text. In this work, we propose a novel proxy task, named Visual-Language Error Modeling (ViLEM), to inject detailed image-text association into "dual-encoder" model by "proofreading" each word in the text against the corresponding image. Specifically, we first edit the image-paired text to automatically generate diverse plausible negative texts with pre-trained language models. ViLEM then enforces the model to discriminate the correctness of each word in the plausible negative texts and further correct the wrong words via resorting to image information. Furthermore, we propose a multi-granularity interaction framework to perform ViLEM via interacting text features with both global and local image features, which associates local text semantics with both high-level visual context and multi-level local visual information. Our method surpasses state-of-the-art "dual-encoder" methods by a large margin on the image-text retrieval task and significantly improves discriminativeness to local textual semantics. Our model can also generalize well to video-text retrieval.

Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Chen, YuxiaoandYuan, JianboandTian, YuandGeng, ShijieandLi, XinyuandZhou, DingandMetaxas, DimitrisN.andYang, Hongxia



研究问题:如何通过联合训练大规模文本语料库和知识图谱来训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Contrastive learning-based vision-language pre-training approaches, such as CLIP, have demonstrated great success in many vision-language tasks. These methods achieve cross-modal alignment by encoding a matched image-text pair with similar feature embeddings, which are generated by aggregating information from visual patches and language tokens. However, direct aligning cross-modal information using such representations is challenging, as visual patches and text tokens differ in semantic levels and granularities. To alleviate this issue, we propose a Finite Discrete Tokens (FDT) based multimodal representation. FDT is a set of learnable tokens representing certain visual-semantic concepts. Both images and texts are embedded using shared FDT by first grounding multimodal inputs to FDT space and then aggregating the activated FDT representations. The matched visual and semantic concepts are enforced to be represented by the same set of discrete tokens by a sparse activation constraint. As a result, the granularity gap between the two modalities is reduced. Through both quantitative and qualitative analyses, we demonstrate that using FDT representations in CLIP-style models improves cross-modal alignment and performance in visual recognition and vision-language downstream tasks. Furthermore, we show that our method can learn more comprehensive representations, and the learned FDT capture meaningful cross-modal correspondence, ranging from objects to actions and attributes.

HumanBench: Towards General Human-Centric Perception With Projector Assisted Pretraining
Tang, ShixiangandChen, ChengandXie, QingsongandChen, MeilinandWang, YizhouandCi, YuanzhengandBai, LeiandZhu, FengandYang, HaiyangandYi, LiandZhao, RuiandOuyang, Wanli



研究问题:如何开发一种通用的预训练模型,以适应多样化的以人为中心的下游任务。
动机:以人为中心的视觉任务在工业应用中广泛存在,如监控、自动驾驶和元宇宙等,因此需要一种能适应这些任务的通用预训练模型。
方法:提出了基于现有数据集的HumanBench,用于全面评估不同预训练方法在19个数据集上的泛化能力,这些数据集来自6种不同的下游任务。同时,为了学习人体中的粗粒度和细粒度知识,进一步提出了Projector AssisTed Hierarchical预训练方法(PATH)。
效果:在HumanBench的综合评估中,PATH在17个下游数据集上取得了新的最先进成果,在其他2个数据集上达到了同等水平。

Human-centric perceptions include a variety of vision tasks, which have widespread industrial applications, including surveillance, autonomous driving, and the metaverse. It is desirable to have a general pretrain model for versatile human-centric downstream tasks. This paper forges ahead along this path from the aspects of both benchmark and pretraining methods. Specifically, we propose a HumanBench based on existing datasets to comprehensively evaluate on the common ground the generalization abilities of different pretraining methods on 19 datasets from 6 diverse downstream tasks, including person ReID, pose estimation, human parsing, pedestrian attribute recognition, pedestrian detection, and crowd counting. To learn both coarse-grained and fine-grained knowledge in human bodies, we further propose a Projector AssisTed Hierarchical pretraining method (PATH) to learn diverse knowledge at different granularity levels. Comprehensive evaluations on HumanBench show that our PATH achieves new state-of-the-art results on 17 downstream datasets and on-par results on the other 2 datasets. The code will be publicly at https://github.com/OpenGVLab/HumanBench.

PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers
Grainger, RyanandPaniagua, ThomasandSong, XiandCuntoor, NareshandLee, MunWaiandWu, Tianfu



研究问题:本文旨在解决视觉转换器(ViTs)中将图像块视为“视觉令牌”并学习补丁到补丁的注意力的问题,以及补丁嵌入标记器的语义差距和二次复杂性问题。
动机:目前的视觉转换器存在语义差距、二次复杂度问题以及难以解释的模型等问题。
方法:本文提出了在ViT中学习补丁到集群注意力(PaCa)的方法。我们的PaCa-ViT查询从补丁开始,而键和值直接基于聚类(预定义的小数量的集群)。这些集群是端到端学习的,从而产生更好的标记器,并引导联合聚类关注和关注聚类以获得更好且可解释的模型。二次复杂度被放宽为线性复杂度。
效果:实验结果表明,与现有技术相比,该方法在ImageNet-1k图像分类、MS-COCO对象检测和实例分割以及MIT-ADE20k语义分割的所有三个基准测试中都获得了比SWin和PVTs更好的性能。由于线性复杂度,它在MS-COCO和MIT-ADE20k上比PVT模型更高效。此外,学习到的集群在语义上具有意义。

Vision Transformers (ViTs) are built on the assumption of treating image patches as "visual tokens" and learn patch-to-patch attention. The patch embedding based tokenizer has a semantic gap with respect to its counterpart, the textual tokenizer. The patch-to-patch attention suffers from the quadratic complexity issue, and also makes it non-trivial to explain learned ViTs. To address these issues in ViT, this paper proposes to learn Patch-to-Cluster attention (PaCa) in ViT. Queries in our PaCa-ViT starts with patches, while keys and values are directly based on clustering (with a predefined small number of clusters). The clusters are learned end-to-end, leading to better tokenizers and inducing joint clustering-for-attention and attention-for-clustering for better and interpretable models. The quadratic complexity is relaxed to linear complexity. The proposed PaCa module is used in designing efficient and interpretable ViT backbones and semantic segmentation head networks. In experiments, the proposed methods are tested on ImageNet-1k image classification, MS-COCO object detection and instance segmentation and MIT-ADE20k semantic segmentation. Compared with the prior art, it obtains better performance in all the three benchmarks than the SWin and the PVTs by significant margins in ImageNet-1k and MIT-ADE20k. It is also significantly more efficient than PVT models in MS-COCO and MIT-ADE20k due to the linear complexity. The learned clusters are semantically meaningful. Code and model checkpoints are available at https://github.com/iVMCL/PaCaViT.

Vision Transformers Are Parameter-Efficient Audio-Visual Learners
Lin, Yan-BoandSung, Yi-LinandLei, JieandBansal, MohitandBertasius, Gedas



研究问题:本研究旨在探索仅在视觉数据上预训练的冻结视觉转换器(ViTs)在无需微调任何原始参数的情况下,对视听数据的泛化能力。
动机:现有的视听方法需要依赖昂贵的音频预训练或外部音频编码器,且需调整的参数较多。因此,本研究提出一种潜在视听混合适配器(LAVISH adapter),以减少参数数量并提高性能。
方法:通过将少量可训练参数注入每个冻结的ViT层,使预训练的ViT适应视听任务。为了有效融合视觉和听觉提示,该适配器使用一组形成注意力瓶颈的潜在令牌,从而消除了标准交叉注意力的二次成本。
效果:与现有的模态特定视听方法相比,该方法在各种视听任务上实现了竞争甚至更好的性能,同时使用的可调参数更少,且无需依赖昂贵的音频预训练或外部音频编码器。

Vision transformers (ViTs) have achieved impressive results on various computer vision tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained only on visual data, to generalize to audio-visual data without finetuning any of its original parameters. To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT. To efficiently fuse visual and audio cues, our LAVISH adapter uses a small set of latent tokens, which form an attention bottleneck, thus, eliminating the quadratic cost of standard cross-attention. Compared to the existing modality-specific audio-visual methods, our approach achieves competitive or even better performance on various audio-visual tasks while using fewer tunable parameters and without relying on costly audio pretraining or external audio encoders. Our code is available at https://genjib.github.io/project_page/LAVISH/

Towards Modality-Agnostic Person Re-Identification With Descriptive Query
Chen, CuiqunandYe, MangandJiang, Ding



研究问题:本文旨在解决跨模态和多模态的人物重识别问题,特别是在没有照片查询的情况下,如何利用文本或草图进行人物重识别。
动机:现有的人物重识别方法通常只关注单一模态的匹配,如文本到图像或草图到照片。然而,在实际应用中,我们往往无法确定是否有文本或草图可用。因此,本文提出了一种新的、具有挑战性的模态无关的人物重识别问题。
方法:为了解决这个问题,本文提出了一种统一的人物重识别(UNIReID)架构,该架构可以有效地适应跨模态和多模态任务。具体来说,UNIReID采用了一个简单的双编码器和特定于任务的模态学习来挖掘和融合视觉和文本模态信息。此外,为了解决UNIReID中不同任务训练不平衡的问题,本文还提出了一种基于任务难度的任务感知动态训练策略,以自适应地调整训练重点。
效果:通过收集照片对应的草图,构建了三个多模态人物重识别数据集。实验结果表明,UNIReID在不同的任务和未见过的场景上大大提高了检索准确性和泛化能力。

Person re-identification (ReID) with descriptive query (text or sketch) provides an important supplement for general image-image paradigms, which is usually studied in a single cross-modality matching manner, e.g., text-to-image or sketch-to-photo. However, without a camera-captured photo query, it is uncertain whether the text or sketch is available or not in practical scenarios. This motivates us to study a new and challenging modality-agnostic person re-identification problem. Towards this goal, we propose a unified person re-identification (UNIReID) architecture that can effectively adapt to cross-modality and multi-modality tasks. Specifically, UNIReID incorporates a simple dual-encoder with task-specific modality learning to mine and fuse visual and textual modality information. To deal with the imbalanced training problem of different tasks in UNIReID, we propose a task-aware dynamic training strategy in terms of task difficulty, adaptively adjusting the training focus. Besides, we construct three multi-modal ReID datasets by collecting the corresponding sketches from photos to support this challenging task. The experimental results on three multi-modal ReID datasets show that our UNIReID greatly improves the retrieval accuracy and generalization ability on different tasks and unseen scenarios.

Learning To Exploit Temporal Structure for Biomedical Vision-Language Processing
Bannur, ShruthiandHyland, StephanieandLiu, QianchuandP\'erez-Garc{\'\i



研究问题:如何更好地利用视觉和文本模态之间的语义对齐进行自我监督学习,特别是在生物医学领域。
动机:现有的生物医学视觉语言处理(VLP)工作主要依赖于单个图像和报告对的对齐,忽略了临床笔记通常引用的先前图像,这既引入了模态间的不良对齐,也错过了通过数据中现有时间内容进行丰富自我监督的机会。
方法:我们的方法名为BioViL-T,使用一个CNN-Transformer混合多图像编码器与文本模型联合训练。当有可用的先前图像和报告时,我们在训练和微调阶段都明确考虑它们。设计用于应对诸如姿势变化和跨时间缺失输入图像等挑战。
效果:在单图像和多图像设置中,我们的模型在下游任务上都表现出色,在进展分类、短语基础和报告生成方面达到最先进的性能,同时在疾病分类和句子相似性任务上提供持续改进。我们发布了一个新的多模态时间基准数据集CXR-T,以量化视觉语言表示的质量。实验结果表明,引入先前的图像和报告可以最大限度地利用数据,具有显著优势。

Self-supervised learning in vision--language processing (VLP) exploits semantic alignment between imaging and text modalities. Prior work in biomedical VLP has mostly relied on the alignment of single image and report pairs even though clinical notes commonly refer to prior images. This does not only introduce poor alignment between the modalities but also a missed opportunity to exploit rich self-supervision through existing temporal content in the data. In this work, we explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN--Transformer hybrid multi-image encoder trained jointly with a text model. It is designed to be versatile to arising challenges such as pose variations and missing input images across time. The resulting model excels on downstream tasks both in single- and multi-image setups, achieving state-of-the-art (SOTA) performance on (I) progression classification, (II) phrase grounding, and (III) report generation, whilst offering consistent improvements on disease classification and sentence-similarity tasks. We release a novel multi-modal temporal benchmark dataset, CXR-T, to quantify the quality of vision--language representations in terms of temporal semantics. Our experimental results show the significant advantages of incorporating prior images and reports to make most use of the data.

Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval
Jiang, DingandYe, Mang



研究问题:本文旨在解决基于文本描述查询的目标人物检索问题,即如何将视觉和文本模态映射到共同的潜在空间中。
动机:现有的方法主要依赖于预训练的单模态模型来提取视觉和文本特征,但这些方法缺乏必要的底层对齐能力,无法有效地匹配多模态数据。此外,这些方法使用先验信息探索显式的部分对齐,可能导致模内信息的扭曲。
方法:本文提出了IRRA框架,这是一种跨模态的隐含关系推理和对齐框架,通过学习局部视觉-文本标记之间的关系,增强全局图像-文本匹配,而无需额外的先验监督。具体来说,首先设计了一个在掩码语言建模范式中的隐含关系推理模块,通过一个跨模态的多模态交互编码器将视觉线索整合到文本标记中,实现跨模态交互。其次,为了全局对齐视觉和文本嵌入,提出了相似性分布匹配,以最小化图像-文本相似性分布与归一化的标签匹配分布之间的KL散度。
效果:所提出的方法在所有三个公开数据集上都取得了新的最先进的结果,与现有方法相比,Rank-1准确率的显著差距为3%-9%。

Text-to-image person retrieval aims to identify the target person based on a given textual description query. The primary challenge is to learn the mapping of visual and textual modalities into a common latent space. Prior works have attempted to address this challenge by leveraging separately pre-trained unimodal models to extract visual and textual features. However, these approaches lack the necessary underlying alignment capabilities required to match multimodal data effectively. Besides, these works use prior information to explore explicit part alignments, which may lead to the distortion of intra-modality information. To alleviate these issues, we present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework that learns relations between local visual-textual tokens and enhances global image-text matching without requiring additional prior supervision. Specifically, we first design an Implicit Relation Reasoning module in a masked language modeling paradigm. This achieves cross-modal interaction by integrating the visual cues into the textual tokens with a cross-modal multimodal interaction encoder. Secondly, to globally align the visual and textual embeddings, Similarity Distribution Matching is proposed to minimize the KL divergence between image-text similarity distributions and the normalized label matching distributions. The proposed method achieves new state-of-the-art results on all three public datasets, with a notable margin of about 3%-9% for Rank-1 accuracy compared to prior methods.

REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory
Hu, ZiniuandIscen, AhmetandSun, ChenandWang, ZiruiandChang, Kai-WeiandSun, YizhouandSchmid, CordeliaandRoss, DavidA.andFathi, Alireza



研究问题:本文旨在提出一种端到端的检索增强视觉语言模型(REVEAL),用于研究问题:本文旨在提出一种端到端的检索增强视觉语言模型(REVEAL),用于将世界知识编码到大规模记忆中,并从中检索以回答知识密集型查询。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries. REVEAL consists of four key components: the memory, the encoder, the retriever and the generator. The large-scale memory encodes various sources of multimodal world knowledge (e.g. image-text pairs, question answering pairs, knowledge graph triplets, etc.) via a unified encoder. The retriever finds the most relevant knowledge entries in the memory, and the generator fuses the retrieved knowledge with the input query to produce the output. A key novelty in our approach is that the memory, encoder, retriever and generator are all pre-trained end-to-end on a massive amount of data. Furthermore, our approach can use a diverse set of multimodal knowledge sources, which is shown to result in significant gains. We show that REVEAL achieves state-of-the-art results on visual question answering and image captioning.

Generic-to-Specific Distillation of Masked Autoencoders
Huang, WeiandPeng, ZhiliangandDong, LiandWei, FuruandJiao, JianbinandYe, Qixiang



研究问题:如何充分利用预训练的大型视觉转换器(ViTs)模型的知识,提升小型ViT模型的性能。
动机:尽管大型的视觉转换器通过自我监督的预训练机制取得了显著的进步,但受限于模型容量的小型ViT模型却无法从中受益。
方法:提出了通用到特定的蒸馏(G2SD)方法,通过在大型模型(教师模型)的监督下进行预训练,将大型模型的知识转移到小型模型中。在通用蒸馏阶段,鼓励小型模型的解码器将其特征预测与大型模型的隐藏表示对齐,以转移任务无关的知识;在特定蒸馏阶段,限制小型模型的预测结果与大型模型保持一致,以转移保证任务性能的任务特定特征。
效果:使用G2SD方法,小型ViT-Small模型在图像分类、目标检测和语义分割任务上的性能分别达到了其教师模型ViT-Base的98.7%、98.1%和99.3%,为两阶段视觉蒸馏设定了坚实的基线。

Large vision Transformers (ViTs) driven by self-supervised pre-training mechanisms achieved unprecedented progress. Lightweight ViT models limited by the model capacity, however, benefit little from those pre-training mechanisms. Knowledge distillation defines a paradigm to transfer representations from large (teacher) models to small (student) ones. However, the conventional single-stage distillation easily gets stuck on task-specific transfer, failing to retain the task-agnostic knowledge crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. In generic distillation, decoder of the small model is encouraged to align feature predictions with hidden representations of the large model, so that task-agnostic knowledge can be transferred. In specific distillation, predictions of the small model are constrained to be consistent with those of the large model, to transfer task-specific features which guarantee task performance. With G2SD, the vanilla ViT-Small model respectively achieves 98.7%, 98.1% and 99.3% the performance of its teacher (ViT-Base) for image classification, object detection, and semantic segmentation, setting a solid baseline for two-stage vision distillation. Code will be available at https://github.com/pengzhiliang/G2SD.

Improving Cross-Modal Retrieval With Set of Diverse Embeddings
Kim, DongwonandKim, NamyupandKwak, Suha



研究问题:跨模态检索是一个挑战性的任务,因为图像和文本模态之间存在固有的模糊性。
动机:为了解决这个问题,我们提出了一种新的基于集合的嵌入方法。
方法:我们提出了一种新的相似度函数——平滑Chamfer相似度,以及一种新的集合预测模块,通过槽位注意力机制有效地捕获输入的多样化语义。
效果:在COCO和Flickr30K数据集上进行评估,该方法优于现有的方法,包括那些在推理时需要大量计算的方法。

Cross-modal retrieval across image and text modalities is a challenging task due to its inherent ambiguity: An image often exhibits various situations, and a caption can be coupled with diverse images. Set-based embedding has been studied as a solution to this problem. It seeks to encode a sample into a set of different embedding vectors that capture different semantics of the sample. In this paper, we present a novel set-based embedding method, which is distinct from previous work in two aspects. First, we present a new similarity function called smooth-Chamfer similarity, which is designed to alleviate the side effects of existing similarity functions for set-based embedding. Second, we propose a novel set prediction module to produce a set of embedding vectors that effectively captures diverse semantics of input by the slot attention mechanism. Our method is evaluated on the COCO and Flickr30K datasets across different visual backbones, where it outperforms existing methods including ones that demand substantially larger computation at inference.

Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning
Lu, XiaochengandGuo, SongandLiu, ZimingandGuo, Jingcai



研究问题:本文旨在解决组合零样本学习(CZSL)中识别训练期间由已知状态和对象形成的新型概念的问题。
动机:现有的方法要么学习结合的状态-对象表示,挑战未见过的组合的泛化能力,要么设计两个分类器分别从图像特征中识别状态和对象,忽略了它们之间的固有关系。
方法:提出了一种名为分解融合软提示(DFSP)的新框架,通过引入视觉语言模型(VLMs)进行未见过的组合识别。具体来说,DFSP构建了可学习的软提示与状态和对象的向量组合,以建立它们的联合表示。此外,还在语言和图像分支之间设计了一个跨模态分解融合模块,该模块在语言特征中分解状态和对象,而不是在图像特征中。值得注意的是,与分解后的特征融合后,图像特征可以更具表现力地学习与状态和对象的关系,从而改善对对空间中未见过的组合的反应,缩小已见和未见集合之间的领域差距。
效果:在三个具有挑战性的基准测试上进行的实验结果表明,我们的方法比其他最先进的方法有显著的改进。

Compositional Zero-Shot Learning (CZSL) aims to recognize novel concepts formed by known states and objects during training. Existing methods either learn the combined state-object representation, challenging the generalization of unseen compositions, or design two classifiers to identify state and object separately from image features, ignoring the intrinsic relationship between them. To jointly eliminate the above issues and construct a more robust CZSL system, we propose a novel framework termed Decomposed Fusion with Soft Prompt (DFSP), by involving vision-language models (VLMs) for unseen composition recognition. Specifically, DFSP constructs a vector combination of learnable soft prompts with state and object to establish the joint representation of them. In addition, a cross-modal decomposed fusion module is designed between the language and image branches, which decomposes state and object among language features instead of image features. Notably, being fused with the decomposed features, the image features can be more expressive for learning the relationship with states and objects, respectively, to improve the response of unseen compositions in the pair space, hence narrowing the domain gap between seen and unseen sets. Experimental results on three challenging benchmarks demonstrate that our approach significantly outperforms other state-of-the-art methods by large margins.

Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
Garcia, NoaandHirota, YusukeandWu, YankunandNakashima, Yuta



研究问题:大规模未标注数据集用于训练视觉-语言模型可能导致公平性问题。
动机:手动标注的小型数据集,如MSCOCO,也受到社会偏见的影响,而从互联网上爬取的数据缺乏控制,可能使这个问题变得更糟。
方法:对广泛用于训练视觉-语言模型的Google概念描述数据集进行部分注释,包括四个人口统计和两个上下文属性。
效果:通过评估三种主流的视觉-语言任务(图像描述、文本-图像CLIP嵌入和文本到图像生成),发现社会偏见在所有任务中都是一个持续存在的问题。

The increasing tendency to collect large and uncurated datasets to train vision-and-language models has raised concerns about fair representations. It is known that even small but manually annotated datasets, such as MSCOCO, are affected by societal bias. This problem, far from being solved, may be getting worse with data crawled from the Internet without much control. In addition, the lack of tools to analyze societal bias in big collections of images makes addressing the problem extremely challenging. Our first contribution is to annotate part of the Google Conceptual Captions dataset, widely used for training vision-and-language models, with four demographic and two contextual attributes. Our second contribution is to conduct a comprehensive analysis of the annotations, focusing on how different demographic groups are represented. Our last contribution lies in evaluating three prevailing vision-and-language tasks: image captioning, text-image CLIP embeddings, and text-to-image generation, showing that societal bias is a persistent problem in all of them.

FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation
Qin, JieandWu, JieandYan, PengxiangandLi, MingandYuxi, RenandXiao, XuefengandWang, YitongandWang, RuiandWen, ShileiandPan, XinandWang, Xingang



研究问题:本文旨在解决现有方法在特定分割任务上需要设计专门的架构或参数,导致各种分割任务之间的碎片化,阻碍了分割模型的统一性。
动机:为了实现更通用的应用场景,开放词汇学习已经出现,用于完成基于文本描述的任意类别的分割,这使得分割系统更加普遍化。然而,现有的方法致力于为特定的分割任务设计专门的架构或参数。
方法:本文提出了FreeSeg,一个通用的框架来完成统一、通用和开放词汇的图像分割。FreeSeg通过一次训练优化一个一体化的网络,并在推理过程中使用相同的架构和参数无缝处理不同的分割任务。此外,自适应提示学习有助于统一的模型捕获任务感知和类别敏感的概念,提高模型在多任务和不同场景中的鲁棒性。
效果:大量的实验结果表明,FreeSeg在性能和泛化性方面建立了新的最先进的结果,在三个分割任务上都超过了最好的特定架构的任务:在COCO数据集上的语义分割提高了5.5% mIoU,实例分割提高了17.6% mAP,全景分割对未见类别提高了20.1% PQ。

Recently, open-vocabulary learning has emerged to accomplish segmentation for arbitrary categories of text-based descriptions, which popularizes the segmentation system to more general-purpose application scenarios. However, existing methods devote to designing specialized architectures or parameters for specific segmentation tasks. These customized design paradigms lead to fragmentation between various segmentation tasks, thus hindering the uniformity of segmentation models. Hence in this paper, we propose FreeSeg, a generic framework to accomplish Unified, Universal and Open-Vocabulary Image Segmentation. FreeSeg optimizes an all-in-one network via one-shot training and employs the same architecture and parameters to handle diverse segmentation tasks seamlessly in the inference procedure. Additionally, adaptive prompt learning facilitates the unified model to capture task-aware and category-sensitive concepts, improving model robustness in multi-task and varied scenarios. Extensive experimental results demonstrate that FreeSeg establishes new state-of-the-art results in performance and generalization on three segmentation tasks, which outperforms the best task-specific architectures by a large margin: 5.5% mIoU on semantic segmentation, 17.6% mAP on instance segmentation, 20.1% PQ on panoptic segmentation for the unseen class on COCO. Project page: https://FreeSeg.github.io.

AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR
Seo, PaulHongsuckandNagrani, ArshaandSchmid, Cordelia



研究问题:本文旨在通过整合视觉信息,提高语音识别系统的鲁棒性。
动机:完全从零开始训练有监督的多模态模型需要大量的标记视听数据集,这在实际应用中受到限制。
方法:提出了AVFormer,一种简单的方式,通过使用轻量级的可训练适配器将视觉嵌入注入到冻结的ASR模型中,以增强音频模型的视觉信息。
效果:实验结果表明,该方法在三个不同的AV-ASR基准测试(How2,VisSpeech和Ego4D)上取得了最先进的零样本结果,同时在传统的纯音频语音识别基准测试(LibriSpeech)上也保持了良好的性能。

Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training fully supervised multimodal models for this task from scratch, however is limited by the need for large labelled audiovisual datasets (in each downstream domain of interest). We present AVFormer, a simple method for augmenting audioonly models with visual information, at the same time performing lightweight domain adaptation. We do this by (i) injecting visual embeddings into a frozen ASR model using lightweight trainable adaptors. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. (ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech). Qualitative results show that our model effectively leverages visual information for robust speech recognition.

CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data
Zeng, YihanandJiang, ChenhanandMao, JiagengandHan, JianhuaandYe, ChaoqiangandHuang, QingqiuandYeung, Dit-YanandYang, ZhenandLiang, XiaodanandXu, Hang



研究问题:如何将2D视觉语言模型的成功应用于3D空间,以实现开放世界的3D视觉理解。
动机:由于文本-3D数据对的数量有限,现有的方法通常通过构造中间的2D表示来处理3D数据,但这会丢失3D几何信息。
方法:提出对比性语言-图像-点云预训练(CLIP^2),利用自然存在的2D和3D场景对应关系,构建了良好对齐的实例化文本-图像-点云代理,并提出了跨模态对比目标来学习语义和实例级对齐的点云表示。
效果:实验结果表明,我们学习的3D表示在包括零样本和少样本3D识别在内的下游任务中具有强大的迁移能力,大幅提高了现有方法的性能。

Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks. However, due to the limited Text-3D data pairs, adapting the success of 2D Vision-Language Models (VLM) to the 3D space remains an open problem. Existing works that leverage VLM for 3D understanding generally resort to constructing intermediate 2D representations for the 3D data, but at the cost of losing 3D geometry information. To take a step toward open-world 3D vision understanding, we propose Contrastive Language-Image-Point Cloud Pretraining (CLIP^2) to directly learn the transferable 3D point cloud representation in realistic scenarios with a novel proxy alignment mechanism. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios. On top of that, we propose a cross-modal contrastive objective to learn semantic and instance-level aligned point cloud representation. Experimental results on both indoor and outdoor scenarios show that our learned 3D representation has great transfer ability in downstream tasks, including zero-shot and few-shot 3D recognition, which boosts the state-of-the-art methods by large margins. Furthermore, we provide analyses of the capability of different representations in real scenarios and present the optional ensemble scheme.

CLIPPO: Image-and-Language Understanding From Pixels Only
Tschannen, MichaelandMustafa, BasilandHoulsby, Neil



研究问题:如何利用单一的编码器处理图像和文本,以实现图像、文本和多模态任务的统一?
动机:目前的多模态模型往往由许多特定于任务和模态的组件组成,训练过程复杂。
方法:提出一种纯像素模型CLIP-Pixels Only(CLIPPO),该模型使用单个编码器处理常规图像和作为图像渲染的文本,仅通过对比损失进行训练。
效果:实验结果显示,CLIPPO在图像检索和零射击图像分类等图像任务上表现良好,参数数量仅为CLIP风格模型的一半,无需特定的文本塔或嵌入。当通过图像-文本对比学习和下一句对比学习联合训练时,CLIPPO可以在自然语言理解任务上表现良好,无需任何词级损失(语言建模或掩码语言建模),性能优于基于像素的前人工作。令人惊讶的是,CLIPPO只需将问题和图像一起渲染,就能在视觉问答任务上获得良好的准确性。最后,由于CLIPPO不需要分词器,证明了它可以在无需修改的情况下实现强大的多语言多模态检索性能。

Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP-style models, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without modifications. Code and pretrained models are available at https://github.com/google-research/big_vision.

Masked Auto-Encoders Meet Generative Adversarial Networks and Beyond
Fei, ZhengcongandFan, MingyuanandZhu, LiandHuang, JunshiandWei, XiaomingandWei, Xiaolin



研究问题:如何利用生成对抗网络(GAN)提高预训练的视觉变换器模型的效率和性能。
动机:虽然Masked Auto-Encoder (MAE)方法在图像任务上表现优秀,但需要大量的训练资源。
方法:提出一种名为GAN-MAE的新框架,其中生成器根据可见的图像块生成被遮蔽的图像块,判别器则用于判断图像块是否由生成器合成。判别器和生成器的视觉变换器主干共享参数。
效果:实验表明,与标准的MAE相比,GAN-MAE框架的对抗训练更有效,且在相同的模型大小、训练数据和计算资源下表现更好。此外,该方法在迁移下游任务时也表现出良好的效果。

Masked Auto-Encoder (MAE) pretraining methods randomly mask image patches and then train a vision Transformer to reconstruct the original pixels based on the unmasked patches. While they demonstrates impressive performance for downstream vision tasks, it generally requires a large amount of training resource. In this paper, we introduce a novel Generative Adversarial Networks alike framework, referred to as GAN-MAE, where a generator is used to generate the masked patches according to the remaining visible patches, and a discriminator is employed to predict whether the patch is synthesized by the generator. We believe this capacity of distinguishing whether the image patch is predicted or original is benefit to representation learning. Another key point lies in that the parameters of the vision Transformer backbone in the generator and discriminator are shared. Extensive experiments demonstrate that adversarial training of GAN-MAE framework is more efficient and accordingly outperforms the standard MAE given the same model size, training data, and computation resource. The gains are substantially robust for different model sizes and datasets, in particular, a ViT-B model trained with GAN-MAE for 200 epochs outperforms the MAE with 1600 epochs on fine-tuning top-1 accuracy of ImageNet-1k with much less FLOPs. Besides, our approach also works well at transferring downstream tasks.

iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition
Wei, YixuanandCao, YueandZhang, ZhengandPeng, HouwenandYao, ZhuliangandXie, ZhendaandHu, HanandGuo, Baining



研究问题:如何有效地结合图像分类和对比语言-图像预训练这两种常见的视觉识别方法。
动机:现有的多任务学习方法在处理图像分类和语言-图像预训练时,通常使用两个独立的头部进行处理,这种方法存在一定局限性。
方法:本文提出了一种名为iCLIP的方法,该方法将图像分类和语言-图像预训练深度融合,使图像分类与语言-图像预训练共享相同的公式和模型权重。同时,通过利用外部知识(如字典中的描述)增强图像分类任务的类别名称,进一步连接这两个任务。
效果:实验表明,该方法很好地结合了两种任务的优点:由于清晰的类别标签,图像分类任务具有强大的判别能力;而由于文本描述中的丰富语义,CLIP任务具有良好的零样本能力。在IN-1K上,该方法达到了82.9%的Top-1准确率,并在Kornblith 12数据集基准的零样本识别上超越了CLIP 1.8%,且模型大小相似。代码和模型已公开发布。

This paper presents a method that effectively combines two prevalent visual recognition methods, i.e., image classification and contrastive language-image pre-training, dubbed iCLIP. Instead of naive multi-task learning that use two separate heads for each task, we fuse the two tasks in a deep fashion that adapts the image classification to share the same formula and the same model weights with the language-image pre-training. To further bridge these two tasks, we propose to enhance the category names in image classification tasks using external knowledge, such as their descriptions in dictionaries. Extensive experiments show that the proposed method combines the advantages of two tasks well: the strong discrimination ability in image classification tasks due to the clear and clean category labels, and the good zero-shot ability in CLIP tasks ascribed to the richer semantics in the text descriptions. In particular, it reaches 82.9% top-1 accuracy on IN-1K, and surpasses CLIPby 1.8%, with similar model size, on zero-shot recognition of Kornblith 12-dataset benchmark. The code and models are publicly available at https://github.com/weiyx16/iCLIP.

CapDet: Unifying Dense Captioning and Open-World Detection Pretraining
Long, YanxinandWen, YoupengandHan, JianhuaandXu, HangandRen, PengzhenandZhang, WeiandZhao, ShenandLiang, Xiaodan



研究问题:现有的开放世界检测方法在推理阶段需要预定义类别空间,仅能预测属于该空间的对象。
动机:为了实现真正的开放世界检测,本文提出了一种名为CapDet的新方法,可以预测给定的类别列表或直接生成预测边界框的类别。
方法:通过引入额外的密集字幕头,将开放世界检测和密集字幕任务统一到一个有效框架中,生成基于区域的字幕。
效果:实验结果表明,通过统一密集字幕任务,CapDet在LVIS(1203类)上的性能比基线方法提高了2.1% mAP,并在VG V1.2和VG-COCO数据集上实现了最先进的密集字幕性能。

Benefiting from large-scale vision-language pre-training on image-text pairs, open-world detection methods have shown superior generalization ability under the zero-shot or few-shot detection settings. However, a pre-defined category space is still required during the inference stage of existing methods and only the objects belonging to that space will be predicted. To introduce a "real" open-world detector, in this paper, we propose a novel method named CapDet to either predict under a given category list or directly generate the category of predicted bounding boxes. Specifically, we unify the open-world detection and dense caption tasks into a single yet effective framework by introducing an additional dense captioning head to generate the region-grounded captions. Besides, adding the captioning task will in turn benefit the generalization of detection performance since the captioning dataset covers more concepts. Experiment results show that by unifying the dense caption task, our CapDet has obtained significant performance improvements (e.g., +2.1% mAP on LVIS rare classes) over the baseline method on LVIS (1203 classes). Besides, our CapDet also achieves state-of-the-art performance on dense captioning tasks, e.g., 15.44% mAP on VG V1.2 and 13.98% on the VG-COCO dataset.

RILS: Masked Visual Reconstruction in Language Semantic Space
Yang, ShushengandGe, YixiaoandYi, KunandLi, DianandShan, YingandQie, XiaohuandWang, Xinggang



研究问题:本文旨在探索图像遮蔽模型(MIM)和自然语言监督两种范式的协同作用,并研究它们结合时产生的新特性。
动机:通过将图像遮蔽模型与自然语言监督相结合,使视觉模型能够捕获结构化信息,并通过预测被遮蔽标记的适当语义来改善文本编码器。
方法:提出了一种新的遮蔽视觉重建语言语义空间(RILS)预训练框架,其中句子表示作为原型,将仅视觉信号转换为具有语义意义的遮蔽模型重建目标的补丁-句子概率。
效果:实验结果表明,该方法不仅拥有之前MIM和CLIP的最佳性能,而且在各种任务上由于它们的互惠互利而取得了进一步的改进。RILS在下游分类、检测和分割任务上表现出优越的可转移性,特别是在低样本量的情况下。

Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training. In this work, we seek the synergy between two paradigms and study the emerging properties when MIM meets natural language supervision. To this end, we present a novel masked visual Reconstruction In Language semantic Space (RILS) pre-training framework, in which sentence representations, encoded by the text encoder, serve as prototypes to transform the vision-only signals into patch-sentence probabilities as semantically meaningful MIM reconstruction targets. The vision models can therefore capture useful components with structured information by predicting proper semantic of masked tokens. Better visual representations could, in turn, improve the text encoder via the image-text alignment objective, which is essential for the effective MIM target transformation. Extensive experimental results demonstrate that our method not only enjoys the best of previous MIM and CLIP but also achieves further improvements on various tasks due to their mutual benefits. RILS exhibits advanced transferability on downstream classification, detection, and segmentation, especially for low-shot regimes. Code is available at https://github.com/hustvl/RILS.

Improving Visual Grounding by Encouraging Consistent Gradient-Based Explanations
Yang, ZiyanandKafle, KushalandDernoncourt, FranckandOrdonez, Vicente



研究问题:如何调整联合视觉-语言模型,使其基于梯度的解释与人类提供的较小基础数据集的区域级注释一致?
动机:目前的方法依赖于视觉-语言模型对物体检测器输出的评分,而我们的方法则通过使模型的解释与人类注释一致来提高视觉基础效果。
方法:我们提出了一种基于边界的损失函数来调整联合视觉-语言模型,使其解释与区域级注释一致,这种方法被称为注意力掩码一致性(AMC)。
效果:实验结果表明,使用AMC训练的标准视觉-语言模型在Flickr30k视觉基础基准测试中获得了86.49%的准确率,比之前在同一监督级别下训练的最佳模型提高了5.38%。此外,该方法在公认的引用表达式理解基准测试中也表现出色,RefCOCO+的简单测试中准确率为80.34%,困难分割中为64.55%。

We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding results than previous methods that rely on using vision-language models to score the outputs of object detectors. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.49% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.38% when compared to the best previous model trained under the same level of supervision. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension where it obtains 80.34% accuracy in the easy test of RefCOCO+, and 64.55% in the difficult split. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model, and can use any type of region annotations.

Learning Visual Representations via Language-Guided Sampling
ElBanani, MohamedandDesai, KaranandJohnson, Justin



研究问题:如何通过语言相似性来采样语义相似的图像对进行对比学习,以改进视觉表示学习。
动机:现有的视觉表示学习方法主要依赖于手工制作的增强或学习到的簇,而本文提出的方法则使用语言相似性来采样视图对,从而更好地捕捉语义信息。
方法:该方法利用预训练的语言模型指导学习,而不是直接最小化跨模态损失。具体来说,它通过使用语言相似性来采样语义相似的图像对进行对比学习。
效果:实验结果表明,语言引导的学习比基于图像和图像-文本的表示学习方法产生更好的特征。

Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an alternative approach to visual representation learning: using language similarity to sample semantically similar image pairs for contrastive learning. Our approach diverges from image-based contrastive learning by sampling view pairs using language similarity instead of hand-crafted augmentations or learned clusters. Our approach also differs from image-text contrastive learning by relying on pre-trained language models to guide the learning rather than directly minimizing a cross-modal loss. Through a series of experiments, we show that language-guided learning yields better features than image-based and image-text representation learning approaches.

Logical Implications for Visual Question Answering Consistency
Tascon-Morales, SergioandM\'arquez-Neila, PabloandSznitman, Raphael



研究问题:尽管视觉问答(VQA)模型在近期取得了显著进展,但其不一致或矛盾的答案仍对其真正的推理能力产生质疑。
动机:大多数现有的VQA方法采用间接策略或对问题和答案对进行强假设来强制模型的一致性,而我们提出了一种新的策略,通过直接减少逻辑不一致性来提高模型性能。
方法:我们引入了一个新的一致性损失项,可以广泛应用于各种VQA模型,该损失项依赖于了解问题和答案对之间的逻辑关系。当这种信息在VQA数据集中通常不可用时,我们提出使用专门的语言模型来推断这些逻辑关系,并将其用于我们提出的一致性损失函数中。
效果:我们在VQA Introspect和DME数据集上进行了广泛的实验,结果显示,我们的方法在改进最先进的VQA模型的同时,在不同的架构和设置下都表现出稳健性。

Despite considerable recent progress in Visual Question Answering (VQA) models, inconsistent or contradictory answers continue to cast doubt on their true reasoning capabilities. However, most proposed methods use indirect strategies or strong assumptions on pairs of questions and answers to enforce model consistency. Instead, we propose a novel strategy intended to improve model performance by directly reducing logical inconsistencies. To do this, we introduce a new consistency loss term that can be used by a wide range of the VQA models and which relies on knowing the logical relation between pairs of questions and answers. While such information is typically not available in VQA datasets, we propose to infer these logical relations using a dedicated language model and use these in our proposed consistency loss function. We conduct extensive experiments on the VQA Introspect and DME datasets and show that our method brings improvements to state-of-the-art VQA models while being robust across different architectures and settings.

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning With Masked Autoencoders
Bandara, WeleGedaraChamindaandPatel, NamanandGholami, AliandNikkhah, MehdiandAgrawal, MotilalandPatel, VishalM.



研究问题:本文旨在提出一种自适应掩蔽策略,用于通过重建被遮蔽的输入数据来学习图像、文本、音频、视频等的可泛化表示。
动机:目前的掩蔽自动编码器(MAEs)方法在处理视频时,依赖于随机的补丁、管道或帧为基础的遮蔽策略来选择这些标记。这种方法需要大量的内存和计算资源,且预训练速度慢。
方法:本文提出了AdaMAE,一种端到端可训练的自适应遮蔽策略。这种策略使用辅助采样网络根据语义上下文采样可见标记。这个网络估计一个空间-时间补丁标记的分类分布。那些增加预期重构误差的标记会被奖励并被选为可见标记,这是受强化学习中的策略梯度算法的启发。
效果:实验结果表明,AdaMAE从高时空信息区域采样了更多的标记,从而允许我们遮蔽95%的标记,降低了内存需求并加快了预训练速度。在Something-Something v2数据集上进行的消融研究表明,我们的自适应采样方法有效,并在Kinetics-400动作分类数据集上取得了最先进的结果,准确率分别为70.0%和81.7%。

Masked Autoencoders (MAEs) learn generalizable representations for image, text, audio, video, etc., by reconstructing masked input data from tokens of the visible data. Current MAE approaches for videos rely on random patch, tube, or frame based masking strategies to select these tokens. This paper proposes AdaMAE, an adaptive masking strategy for MAEs that is end-to-end trainable. Our adaptive masking strategy samples visible tokens based on the semantic context using an auxiliary sampling network. This network estimates a categorical distribution over spacetime-patch tokens. The tokens that increase the expected reconstruction error are rewarded and selected as visible tokens, motivated by the policy gradient algorithm in reinforcement learning. We show that AdaMAE samples more tokens from the high spatiotemporal information regions, thereby allowing us to mask 95% of tokens, resulting in lower memory requirements and faster pre-training. We conduct ablation studies on the Something-Something v2 (SSv2) dataset to demonstrate the efficacy of our adaptive sampling approach and report state-of-the-art results of 70.0% and 81.7% in top-1 accuracy on SSv2 and Kinetics-400 action classification datasets with a ViT-Base backbone and 800 pre-training epochs. Code and pre-trained models are available at: https://github.com/wgcban/adamae.git

Towards Flexible Multi-Modal Document Models
Inoue, NaotoandKikuchi, KotaroandSimo-Serra, EdgarandOtani, MayuandYamaguchi, Kota



研究问题:本文旨在构建一个能够同时解决多种设计任务的全面模型。
动机:在生成图形文档的创新工作流程中,存在许多复杂的相互关联的任务,如对齐元素、选择适当的字体或使用美观和谐的颜色等。
方法:我们建立了一个名为FlexDM的模型,将矢量图形文档视为一组多模态元素,并使用统一的架构学习预测被遮盖的字段,如元素类型、位置、样式属性、图像或文本。通过显式多任务学习和领域内预训练,我们的模型能更好地捕捉不同文档字段之间的多模态关系。
效果:实验结果表明,我们的单一FlexDM能够成功解决多种不同的设计任务,同时实现的性能与特定任务和昂贵的基线相竞争。

Creative workflows for generating graphical documents involve complex inter-related tasks, such as aligning elements, choosing appropriate fonts, or employing aesthetically harmonious colors. In this work, we attempt at building a holistic model that can jointly solve many different design tasks. Our model, which we denote by FlexDM, treats vector graphic documents as a set of multi-modal elements, and learns to predict masked fields such as element type, position, styling attributes, image, or text, using a unified architecture. Through the use of explicit multi-task learning and in-domain pre-training, our model can better capture the multi-modal relationships among the different document fields. Experimental results corroborate that our single FlexDM is able to successfully solve a multitude of different design tasks, while achieving performance that is competitive with task-specific and costly baselines.

DegAE: A New Pretraining Paradigm for Low-Level Vision
Liu, YihaoandHe, JingwenandGu, JinjinandKong, XiangtaoandQiao, YuandDong, Chao



研究问题:本文旨在解决预训练在低级视觉中的应用模糊和未完全确立的问题,以及回答预训练的初衷和核心问题。
动机:尽管自监督预训练在高级视觉中取得了显著的成功,但在低级视觉中的应用仍然不明确。作者认为预训练对于数据获取困难的高成本任务更为重要。
方法:通过考察高级和低级视觉中的先前预训练方法,将现有的低级视觉任务分为两组:低成本和高成本任务。针对高成本任务,提出了一种新的预训练范式——退化自动编码器(DegAE)。
效果:实验结果表明,使用DegAE预训练后,SwinIR在图像去雾任务上的性能提高了6.88dB,Uformer在去雾和去雨任务上分别提高了3.22dB和0.54dB。

Self-supervised pretraining has achieved remarkable success in high-level vision, but its application in low-level vision remains ambiguous and not well-established. What is the primitive intention of pretraining? What is the core problem of pretraining in low-level vision? In this paper, we aim to answer these essential questions and establish a new pretraining scheme for low-level vision. Specifically, we examine previous pretraining methods in both high-level and low-level vision, and categorize current low-level vision tasks into two groups based on the difficulty of data acquisition: low-cost and high-cost tasks. Existing literature has mainly focused on pretraining for low-cost tasks, where the observed performance improvement is often limited. However, we argue that pretraining is more significant for high-cost tasks, where data acquisition is more challenging. To learn a general low-level vision representation that can improve the performance of various tasks, we propose a new pretraining paradigm called degradation autoencoder (DegAE). DegAE follows the philosophy of designing pretext task for self-supervised pretraining and is elaborately tailored to low-level vision. With DegAE pretraining, SwinIR achieves a 6.88dB performance gain on image dehaze task, while Uformer obtains 3.22dB and 0.54dB improvement on dehaze and derain tasks, respectively.

ScaleDet: A Scalable Multi-Dataset Object Detector
Chen, YanbeiandWang, ManchenandMittal, AbhayandXu, ZhenlinandFavaro, PaoloandTighe, JosephandModolo, Davide



研究问题:如何利用异构大规模数据集进行无额外标注成本的多数据集训练。
动机:现有的多数据集学习者大多依赖人工重新标记或复杂的优化来统一跨数据集的标签,我们提出了一种简单且可扩展的公式来为多数据集训练生成统一的语义标签空间。
方法:我们提出了一个可扩展的多数据集检测器(ScaleDet),通过视觉-文本对齐学习跨数据集的标签分配和标签语义相似性。
效果:我们在LVIS、COCO、Objects365、OpenImages等上游数据集以及ODinW的13个下游数据集上进行了广泛的实验,结果显示,ScaleDet在相同的主干网络上取得了令人信服的强大模型性能,mAP达到了50.7(LVIS)、58.8(COCO)、46.8(Objects365)、76.2(OpenImages)和71.8(ODinW),超过了最先进的检测器。

Multi-dataset training provides a viable solution for exploiting heterogeneous large-scale datasets without extra annotation cost. In this work, we propose a scalable multi-dataset detector (ScaleDet) that can scale up its generalization across datasets when increasing the number of training datasets. Unlike existing multi-dataset learners that mostly rely on manual relabelling efforts or sophisticated optimizations to unify labels across datasets, we introduce a simple yet scalable formulation to derive a unified semantic label space for multi-dataset training. ScaleDet is trained by visual-textual alignment to learn the label assignment with label semantic similarities across datasets. Once trained, ScaleDet can generalize well on any given upstream and downstream datasets with seen and unseen classes. We conduct extensive experiments using LVIS, COCO, Objects365, OpenImages as upstream datasets, and 13 datasets from Object Detection in the Wild (ODinW) as downstream datasets. Our results show that ScaleDet achieves compelling strong model performance with an mAP of 50.7 on LVIS, 58.8 on COCO, 46.8 on Objects365, 76.2 on OpenImages, and 71.8 on ODinW, surpassing state-of-the-art detectors with the same backbone.

Language-Guided Music Recommendation for Video via Prompt Analogies
McKee, DanielandSalamon, JustinandSivic, JosefandRussell, Bryan



研究问题:提出一种方法,在用户用自由形式的自然语言指导音乐选择的同时,为输入视频推荐音乐。
动机:现有音乐视频数据集提供了所需的(视频,音乐)训练对,但缺乏音乐的文本描述。
方法:我们提出了一个文本合成方法,该方法依赖于基于类比的提示过程,从预训练的音乐标记器输出和少量的人工文本描述中生成自然语言音乐描述。然后使用这些合成的音乐描述来训练一个新的三模态模型,该模型融合了文本和视频输入表示以查询音乐样本。
效果:通过收集YT8M-MusicVideo数据集的一个子集4k片段进行标注,并公开可用的自然语言音乐描述作为我们的测试数据集,我们的方法可以在视频到音乐检索方面匹配或超过先前方法的性能,同时在使用文本指导时显著提高检索准确性。

We propose a method to recommend music for an input video while allowing a user to guide music selection with free-form natural language. A key challenge of this problem setting is that existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music. This work addresses this challenge with the following three contributions. First, we propose a text-synthesis approach that relies on an analogy-based prompting procedure to generate natural language music descriptions from a large-scale language model (BLOOM-176B) given pre-trained music tagger outputs and a small number of human text descriptions. Second, we use these synthesized music descriptions to train a new trimodal model, which fuses text and video input representations to query music samples. For training, we introduce a text dropout regularization mechanism which we show is critical to model performance. Our model design allows for the retrieved music audio to agree with the two input modalities by matching visual style depicted in the video and musical genre, mood, or instrumentation described in the natural language query. Third, to evaluate our approach, we collect a testing dataset for our problem by annotating a subset of 4k clips from the YT8M-MusicVideo dataset with natural language music descriptions which we make publicly available. We show that our approach can match or exceed the performance of prior methods on video-to-music retrieval while significantly improving retrieval accuracy when using text guidance.

LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision \& Language Models
Bulat, AdrianandTzimiropoulos, Georgios



研究问题:现有的软提示学习方法在训练数据上过度拟合,导致在相同领域的未见过类别上准确率大幅下降。
动机:为了缓解基本类别的过拟合问题,提高提示表示能力,并解决由提示学习和LASP引入的视觉语言不匹配问题。
方法:提出了一种新的语言感知软提示(LASP)学习方法,通过最大化学习提示与预定义的手工文本提示正确分类的概率来减轻基础类别过拟合的问题;提出了分组LASP,其中每组提示都根据单独的文本提示子集进行优化;识别了由提示学习和LASP引入的视觉语言不匹配问题,并提出了一种重新校准机制来解决它。
效果:实验结果表明,该方法在所有11个数据集上都显著优于所有先前的软提示学习方法,并在8个测试数据集上首次匹配和超过了手工提示和CLIP获得的新颖类别准确率。

Soft prompt learning has recently emerged as one of the methods of choice for adapting V&L models to a downstream task using a few training examples. However, current methods significantly overfit the training data, suffering from large accuracy degradation when tested on unseen classes from the same domain. To this end, in this paper, we make the following 4 contributions: (1) To alleviate base class overfitting, we propose a novel Language-Aware Soft Prompting (LASP) learning method by means of a text-to-text cross-entropy loss that maximizes the probability of the learned prompts to be correctly classified with respect to pre-defined hand-crafted textual prompts. (2) To increase the representation capacity of the prompts, we propose grouped LASP where each group of prompts is optimized with respect to a separate subset of textual prompts. (3) We identify a visual-language misalignment introduced by prompt learning and LASP, and more importantly, propose a re-calibration mechanism to address it. (4) We show that LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available, further increasing the robustness of the learned prompts. Through evaluations on 11 datasets, we show that our approach (a) significantly outperforms all prior works on soft prompting, and (b) matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets. Code will be made available.

AutoAD: Movie Description in Context
Han, TengdaandBain, MaxandNagrani, ArshaandVarol, G\"ulandXie, WeidiandZisserman, Andrew



研究问题:本文旨在开发一种自动音频描述(AD)模型,该模型可以接收电影输入并输出文本形式的音频描述。
动机:生成高质量的电影音频描述具有挑战性,因为描述依赖于上下文,并且可用的训练数据有限。
方法:我们利用预训练的基础模型(如GPT和CLIP)的力量,只训练一个映射网络来桥接这两个模型进行视觉条件文本生成。我们还从电影片段、前一段的音频描述以及字幕中获取上下文信息,并在大规模数据集上进行预训练以解决缺乏训练数据的问题。
效果:通过去除MAD数据集中的标签噪声并添加字符命名信息,我们对现有的音频描述数据集进行了改进。实验结果表明,我们的模型在电影音频描述任务上的表现优于先前的方法。

The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form. Generating high-quality movie AD is challenging due to the dependency of the descriptions on context, and the limited amount of training data available. In this work, we leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation. In order to obtain high-quality AD, we make the following four contributions: (i) we incorporate context from the movie clip, AD from previous clips, as well as the subtitles; (ii) we address the lack of training data by pretraining on large-scale datasets, where visual or contextual information is unavailable, e.g. text-only AD without movies or visual captioning datasets without context; (iii) we improve on the currently available AD datasets, by removing label noise in the MAD dataset, and adding character naming information; and (iv) we obtain strong results on the movie AD task compared with previous methods.

MaPLe: Multi-Modal Prompt Learning
Khattak, MuhammadUzairandRasheed, HanoonaandMaaz, MuhammadandKhan, SalmanandKhan, FahadShahbaz



研究问题:本文旨在解决预训练的视觉-语言(V-L)模型如CLIP对输入文本提示的选择敏感,需要仔细选择提示模板以执行良好。
动机:受自然语言处理(NLP)文献的启发,最近的CLIP适应方法学习提示作为文本输入来微调CLIP进行下游任务。作者注意到,只在CLIP的一个分支(语言或视觉)中使用提示调整表示是次优的,因为它不允许在下游任务上动态调整两个表示空间的灵活性。
方法:本文提出了多模态提示学习(MaPLe)用于视觉和语言分支,以提高视觉和语言表示之间的对齐。设计促进了视觉-语言提示的强大耦合,以确保互惠互利,并阻止学习独立的单模解决方案。此外,在不同的早期阶段学习单独的提示,逐步模拟阶段特征关系,以允许丰富的上下文学习。
效果:在三个代表性任务上评估了该方法的有效性,包括推广到新的类别、新的目标任务数据集和未见过的区域转移。与最先进的Co-CoOp方法相比,MaPLe表现出良好的性能,并在11个不同的图像识别数据集上实现了3.45%的绝对增益。

Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well. Inspired by the Natural Language Processing (NLP) literature, recent CLIP adaptation approaches learn prompts as the textual inputs to fine-tune CLIP for downstream tasks. We note that using prompting to adapt representations in a single branch of CLIP (language or vision) is sub-optimal since it does not allow the flexibility to dynamically adjust both representation spaces on a downstream task. In this work, we propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Our design promotes strong coupling between the vision-language prompts to ensure mutual synergy and discourages learning independent uni-modal solutions. Further, we learn separate prompts across different early stages to progressively model the stage-wise feature relationships to allow rich context learning. We evaluate the effectiveness of our approach on three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes and 2.72% on overall harmonic-mean, averaged over 11 diverse image recognition datasets. Our code and pre-trained models are available at https://github.com/muzairkhattak/multimodal-prompt-learning.

Multi-Modal Representation Learning With Text-Driven Soft Masks
Park, JaeyooandHan, Bohyung



研究问题:本文旨在提出一种视觉语言表示学习方法,通过引入新的操作、损失和数据增强策略。
动机:目前的预训练模型主要依赖图像-标题对进行训练,缺乏对精细标注的利用,同时存在过拟合和偏见问题。
方法:首先,我们通过软掩膜的方式生成与特定单词最相关的图像区域,以产生多样化的特征。然后,我们使用多模态编码器计算词条件视觉注意力,以确定每个单词的相关区域。接着,我们提出了焦点损失函数,以鼓励模型关注更难但更多样化的例子。最后,我们通过掩蔽文本和渲染图像畸变来进行多模态数据增强。
效果:实验结果表明,这三种创新的结合对于学习预训练模型非常有效,并在多个视觉语言下游任务上取得了优异的性能。

We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.

VindLU: A Recipe for Effective Video-and-Language Pretraining
Cheng, FengandWang, XiziandLei, JieandCrandall, DavidandBansal, MohitandBertasius, Gedas



研究问题:视频和语言理解(VidL)模型设计中最重要的因素是什么?
动机:现代VidL方法使用复杂且专门的模型架构和高级预训练协议,使得这些框架的可复制性、分析和比较变得困难。
方法:通过实证研究,调查了VidL模型设计中的多个因素,包括时空架构设计、多模态融合方案、预训练目标、预训练数据选择、预训练和微调协议以及数据集和模型规模。
效果:研究发现,最重要的设计因素包括:时间建模、视频到文本的多模态融合、掩蔽建模目标以及图像和视频的联合训练。基于这些实证洞察,开发了一个名为VindLU的分步式预训练方法。该方法在不依赖外部CLIP预训练的情况下,在几个VidL任务上取得了与最先进的结果相当甚至更好的结果。特别是在文本到视频检索任务上,该方法在DiDeMo上获得了61.2%的成绩,在ActivityNet上获得了55.0%的成绩,比当前最先进的方法分别高出7.8%和6.1%。此外,该方法还在ActivityNet-QA、MSRVTT-QA、MSRVTT-MC和TVQA上获得了最先进的视频问答结果。

The last several years have witnessed remarkable progress in video-and-language (VidL) understanding. However, most modern VidL approaches use complex and specialized model architectures and sophisticated pretraining protocols, making the reproducibility, analysis and comparisons of these frameworks difficult. Hence, instead of proposing yet another new VidL model, this paper conducts a thorough empirical study demystifying the most important factors in the VidL model design. Among the factors that we investigate are (i) the spatiotemporal architecture design, (ii) the multimodal fusion schemes, (iii) the pretraining objectives, (iv) the choice of pretraining data, (v) pretraining and finetuning protocols, and (vi) dataset and model scaling. Our empirical study reveals that the most important design factors include: temporal modeling, video-to-text multimodal fusion, masked modeling objectives, and joint training on images and videos. Using these empirical insights, we then develop a step-by-step recipe, dubbed VindLU, for effective VidL pretraining. Our final model trained using our recipe achieves comparable or better than state-of-the-art results on several VidL tasks without relying on external CLIP pretraining. In particular, on the text-to-video retrieval task, our approach obtains 61.2% on DiDeMo, and 55.0% on ActivityNet, outperforming current SOTA by 7.8% and 6.1% respectively. Furthermore, our model also obtains state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA, MSRVTT-MC and TVQA. Our code and pretrained models are publicly available at: https://github.com/klauscc/VindLU.

Scaling Language-Image Pre-Training via Masking
Li, YanghaoandFan, HaoqiandHu, RonghangandFeichtenhofer, ChristophandHe, Kaiming



研究问题:本文旨在提出一种更简单、更有效的CLIP训练方法,即快速语言-图像预训练(FLIP)。
动机:现有的语言-图像预训练模型在训练过程中需要处理大量的图像块,这既耗时又消耗内存。
方法:FLIP方法在训练过程中随机屏蔽并移除大部分图像块,通过这种方式,我们可以在相同的时间下学习更多的图像-文本对,并在每次迭代中对比相似内存占用的更多样本。
效果:实验结果表明,FLIP在准确性和训练速度上都优于无屏蔽的基线方法。在各种下游任务上,FLIP也显著优于在同一数据上训练的CLIP模型。此外,我们还探索了增加模型大小、数据大小或训练长度的扩展行为,并报告了令人鼓舞的结果和比较。

We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint. It leads to a favorable trade-off between accuracy and training time. In our experiments on 400 million image-text pairs, FLIP improves both accuracy and speed over the no-masking baseline. On a large diversity of downstream tasks, FLIP dominantly outperforms the CLIP counterparts trained on the same data. Facilitated by the speedup, we explore the scaling behavior of increasing the model size, data size, or training length, and report encouraging results and comparisons. We hope that our work will foster future research on scaling vision-language learning.

Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting
Wasim, SyedTalalandNaseer, MuzammalandKhan, SalmanandKhan, FahadShahbazandShah, Mubarak



研究问题:如何平衡预训练模型在有监督和零样本动作识别任务上的性能。
动机:目前的工作在有监督性能和零样本泛化之间存在权衡,因为提升一个方面会导致另一个方面的性能下降。
方法:提出一种多模态提示学习方案,通过统一的训练来平衡有监督和零样本性能。视觉提示包括全局视频级提示、局部帧级提示和总结提示;文本提示用于增强文本上下文。
效果:在Kinetics-600, HMDB51和UCF101上实现了最先进的零样本性能,同时在有监督设置中保持竞争力。通过冻结预训练的骨干网络,优化了更少的参数并保留了现有的通用表示,从而实现了强大的零样本性能。

Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes and models will be publicly released.

Learning Attribute and Class-Specific Representation Duet for Fine-Grained Fashion Analysis
Jiao, YangandGao, YanandMeng, JingjingandShang, JinandSun, Yi



研究问题:本文旨在解决现有时尚表示学习中,只关注细粒度属性级别的问题,而忽视了不同类别之间属性的关系和依赖性。
动机:通过利用关于时尚属性和类别分类学的先验知识,更好地建模这些属性关系和依赖性。
方法:提出一个包含两个子网络的嵌入网络,分别用于处理属性和类别,通过多粒度损失函数引入适当的归纳偏置,以跨不同的时尚表示粒度进行学习。
效果:在三个基准数据集上的实验结果表明,该方法优于最先进的方法,具有很大的优势。

Fashion representation learning involves the analysis and understanding of various visual elements at different granularities and the interactions among them. Existing works often learn fine-grained fashion representations at the attribute-level without considering their relationships and inter-dependencies across different classes. In this work, we propose to learn an attribute and class specific fashion representation duet to better model such attribute relationships and inter-dependencies by leveraging prior knowledge about the taxonomy of fashion attributes and classes. Through two sub-networks for the attributes and classes, respectively, our proposed an embedding network progressively learn and refine the visual representation of a fashion image to improve its robustness for fashion retrieval. A multi-granularity loss consisting of attribute-level and class-level losses is proposed to introduce appropriate inductive bias to learn across different granularities of the fashion representations. Experimental results on three benchmark datasets demonstrate the effectiveness of our method, which outperforms the state-of-the-art methods with a large margin.

Clover: Towards a Unified Video-Language Alignment and Fusion Model
Huang, JingjiaandLi, YinanandFeng, JiashiandWu, XinglongandSun, XiaoshuaiandJi, Rongrong



研究问题:构建一个通用的视频-语言模型来解决各种视频理解任务,如文本-视频检索、视频问答。
动机:目前大多数方法通过堆叠单模态和跨模态特征编码器并使用成对对比预训练任务进行训练,虽然具有吸引力的通用性,但结果模型在效率和性能之间需要折衷,且大多采用不同的架构来处理不同的下游任务。
方法:提出Clover——一种相关视频-语言预训练方法,通过一种新的三模态对齐预训练任务来提高跨模态特征的对齐和融合,并通过引入学习语义掩蔽样本和新成对排序损失来增强三模态对齐。
效果:Clover在多个下游任务上建立了新的最先进水平,包括三种零射击和微调设置的检索任务以及八个视频问答任务。

Building a universal video-language model for solving various video understanding tasks (e.g., text-video retrieval, video question answering) is an open challenge to the machine learning field. Towards this goal, most recent works build the model by stacking uni-modal and cross-modal feature encoders and train it with pair-wise contrastive pre-text tasks. Though offering attractive generality, the resulted models have to compromise between efficiency and performance. They mostly adopt different architectures to deal with different downstream tasks. We find this is because the pair-wise training cannot well align and fuse features from different modalities. We then introduce Clover--a Correlated Video-Language pre-training method--towards a universal video-language model for solving multiple video understanding tasks with neither performance nor efficiency compromise. It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task. Additionally, we propose to enhance the tri-modal alignment via incorporating learning from semantic masked samples and a new pair-wise ranking loss. Clover establishes new state-of-the-arts on multiple downstream tasks, including three retrieval tasks for both zero-shot and fine-tuning settings, and eight video question answering tasks. Codes and pre-trained models will be released at https://github.com/LeeYN-43/Clover.

Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture
Assran, MahmoudandDuval, QuentinandMisra, IshanandBojanowski, PiotrandVincent, PascalandRabbat, MichaelandLeCun, YannandBallas, Nicolas



研究问题:如何不依赖手工制作的数据增强来学习高度语义的图像表示。
动机:现有的方法需要依赖手工制作的数据增强,而本文提出的方法不需要。
方法:介绍了一种名为Image-based Joint-Embedding Predictive Architecture (I-JEPA)的非生成式自我监督学习方法,通过预测同一图像中不同目标块的表示来进行学习。
效果:实验证明,当与视觉变压器结合使用时,I-JEPA具有很高的可扩展性。例如,使用16个A100 GPU在72小时内训练ViT-Huge/14模型,在从线性分类到对象计数和深度预测的各种任务上都能取得强大的下游性能。

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

DeAR: Debiasing Vision-Language Models With Additive Residuals
Seth, AshishandHemani, MayurandAgarwal, Chirag



研究问题:大型预训练视觉-语言模型(VLMs)在各种基于视觉的下游任务中提供了丰富的、可适应的图像和文本表示,但由于训练数据中不同身份群体的分布不均,这些模型存在社会偏见。
动机:由于特定文本概念与不同身份群体的图像表示之间的相似性偏差,这些偏见限制了此类模型在现实世界高风险应用中的实用性。
方法:我们提出了DeAR(使用附加残差进行去偏),这是一种新的去偏方法,通过学习附加残差图像表示来抵消原始表示,确保公平的输出表示。
效果:通过多个数据集进行的公平性和零射性能保持的实验结果表明,我们的框架是有效的。我们还引入了一个新的基于上下文的偏见基准测试数据集——保护属性标签关联(PATA)数据集,用于评估大型预训练VLMs的公平性。

Large pre-trained vision-language models (VLMs) reduce the time for developing predictive models for various vision-grounded language downstream tasks by providing rich, adaptable image and text representations. However, these models suffer from societal biases owing to the skewed distribution of various identity groups in the training data. These biases manifest as the skewed similarity between the representations for specific text concepts and images of people of different identity groups and, therefore, limit the usefulness of such models in real-world high-stakes applications. In this work, we present DeAR (Debiasing with Additive Residuals), a novel debiasing method that learns additive residual image representations to offset the original representations, ensuring fair output representations. In doing so, it reduces the ability of the representations to distinguish between the different identity groups. Further, we observe that the current fairness tests are performed on limited face image datasets that fail to indicate why a specific text concept should/should not apply to them. To bridge this gap and better evaluate DeAR, we introduce a new context-based bias benchmarking dataset - the Protected Attribute Tag Association (PATA) dataset for evaluating the fairness of large pre-trained VLMs. Additionally, PATA provides visual context for a diverse human population in different scenarios with both positive and negative connotations. Experimental results for fairness and zero-shot performance preservation using multiple datasets demonstrate the efficacy of our framework.

Understanding Masked Image Modeling via Learning Occlusion Invariant Feature
Kong, XiangwenandZhang, Xiangyu



研究问题:目前,虽然掩蔽图像建模(MIM)在自监督视觉识别方面取得了巨大成功,但其作为基于重建的框架,其工作机制仍然是一个未解的问题。
动机:由于MIM与先前研究较多的siamese方法(如对比学习)差异较大,因此理解MIM如何工作仍是一个开放性问题。
方法:本文提出了一个新的观点:MIM隐式地学习了遮挡不变特征,这与其它siamese方法类似,只不过后者学习的是不同的不变性。通过将MIM的公式放松为等效的siamese形式,可以将MIM方法解释为与传统方法统一的框架,其中只有a)数据转换,即要学习的不变性,和b)相似度测量不同。
效果:以MAE(He等人,2021)为例,我们发现MIM模型的成功与选择的相似度函数关系不大,但由被遮蔽的图像引入的被学习的遮挡不变特征——它被证明是视觉变换器的优选初始化,尽管学到的特征可能语义较少。我们希望我们的发现能激发计算机视觉社区中的研究人员开发出更强大的自监督方法。

Recently, Masked Image Modeling (MIM) achieves great success in self-supervised visual recognition. However, as a reconstruction-based framework, it is still an open question to understand how MIM works, since MIM appears very different from previous well-studied siamese approaches such as contrastive learning. In this paper, we propose a new viewpoint: MIM implicitly learns occlusion-invariant features, which is analogous to other siamese methods while the latter learns other invariance. By relaxing MIM formulation into an equivalent siamese form, MIM methods can be interpreted in a unified framework with conventional methods, among which only a) data transformations, i.e. what invariance to learn, and b) similarity measurements are different. Furthermore, taking MAE (He et al., 2021) as a representative example of MIM, we empirically find the success of MIM models relates a little to the choice of similarity functions, but the learned occlusion invariant feature introduced by masked image -- it turns out to be a favored initialization for vision transformers, even though the learned feature could be less semantic. We hope our findings could inspire researchers to develop more powerful self-supervised methods in computer vision community.

Grounding Counterfactual Explanation of Image Classifiers to Textual Concept Space
Kim, SiwonandOh, JinohandLee, SungjinandYu, SeunghakandDo, JaeyoungandTaghavi, Tara



研究问题:本文旨在解决现有基于概念的解释方法需要大量手动收集的概念标注图像的问题。
动机:手动收集的概念标注图像既昂贵又存在人为偏见的风险。
方法:提出一种利用文本驱动的概念进行反事实解释的方法(CounTEX),通过预训练的多模态联合嵌入空间定义概念,无需额外的概念标注数据集。
效果:CounTEX能生成忠实的解释,提供对模型决策理由的语义理解,且不受人为偏见影响。

Concept-based explanation aims to provide concise and human-understandable explanations of an image classifier. However, existing concept-based explanation methods typically require a significant amount of manually collected concept-annotated images. This is costly and runs the risk of human biases being involved in the explanation. In this paper, we propose counterfactual explanation with text-driven concepts (CounTEX), where the concepts are defined only from text by leveraging a pre-trained multi-modal joint embedding space without additional concept-annotated datasets. A conceptual counterfactual explanation is generated with text-driven concepts. To utilize the text-driven concepts defined in the joint embedding space to interpret target classifier outcome, we present a novel projection scheme for mapping the two spaces with a simple yet effective implementation. We show that CounTEX generates faithful explanations that provide a semantic understanding of model decision rationale robust to human bias.

Fine-Tuned CLIP Models Are Efficient Video Learners
Rasheed, HanoonaandKhattak, MuhammadUzairandMaaz, MuhammadandKhan, SalmanandKhan, FahadShahbaz



研究问题:如何有效地将图像级的CLIP表示转移到视频上?
动机:由于在大规模上对视频进行训练是不可行的,因此最近的一些方法主要关注如何将基于图像的CLIP有效转移到视频领域。
方法:本文提出了一种简单的视频微调CLIP(ViFi-CLIP)基线,通过帧级别的处理,从CLIP的图像编码器开始,然后进行特征池化和与相应文本嵌入的相似性匹配,以隐式地在ViFi-CLIP中建模时间线索。
效果:实验结果表明,这种简单而强大的基线在零样本、基本到新颖的泛化、少样本和全监督设置等五种视频基准测试中表现出色。

Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the frame-level processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a 'bridge and prompt' approach that first uses finetuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code and models will be publicly released.

Visual Recognition by Request
Tang, ChufengandXie, LingxiandZhang, XiaopengandHu, XiaolinandTian, Qi



研究问题:如何实现无限粒度的视觉语义识别,弥补现有视觉识别算法的不足。
动机:人类具有无限粒度的视觉语义识别能力,但现有的视觉识别算法无法达到这个目标。
方法:提出一种新的视觉识别范式——视觉识别请求(ViRReq),将视觉识别分解为原子任务——请求,并利用知识库、分层文本字典辅助任务定义。
效果:ViRReq能够从高度不完整的标注中学习复杂的整体-部分层次结构,并能最小化地插入新概念。在两个具有层次化整体-部分标注的数据集CPP和ADE20K上,ViRReq表现出灵活的识别能力。

Humans have the ability of recognizing visual semantics in an unlimited granularity, but existing visual recognition algorithms cannot achieve this goal. In this paper, we establish a new paradigm named visual recognition by request (ViRReq) to bridge the gap. The key lies in decomposing visual recognition into atomic tasks named requests and leveraging a knowledge base, a hierarchical and text-based dictionary, to assist task definition. ViRReq allows for (i) learning complicated whole-part hierarchies from highly incomplete annotations and (ii) inserting new concepts with minimal efforts. We also establish a solid baseline by integrating language-driven recognition into recent semantic and instance segmentation methods, and demonstrate its flexible recognition ability on CPP and ADE20K, two datasets with hierarchical whole-part annotations.

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-Training Model
Ji, YataiandWang, JunjieandGong, YuanandZhang, LinandZhu, YanruandWang, HongfaandZhang, JiaxingandSakai, TetsuyaandYang, Yujiu



研究问题:多模态语义理解中的不确定性问题,包括模间和模内不确定性,以及研究问题:多模态语义理解中的不确定性问题,包括模间和模内不确定性,以及在未标记数据集上的预训练和特定任务的下游数据集上的微调中对此不确定性进行建模的研究不足。
动机:现有的确定性方法无法充分传达丰富的多模态语义信息和复杂的关系,因此需要对这种不确定性进行建模。
方法:通过利用序列级别的交互作用,使用概率分布编码器(PDE)将所有模态的表示投影为概率分布。并将不确定性建模与流行的预训练框架相结合,提出了适合的预训练任务:基于分布的视觉语言对比学习(D-VLC)、基于分布的掩码语言建模(D-MLM)和基于分布的图像文本匹配(D-ITM)。
效果:在具有挑战性的下游任务中,包括图像-文本检索、视觉问答、视觉推理和视觉蕴含等,经过微调的模型取得了最先进的结果。

Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our interpretation, including inter- and intra-modal uncertainty. Little effort has studied the modeling of this uncertainty, particularly in pre-training on unlabeled datasets and fine-tuning in task-specific downstream datasets. In this paper, we project the representations of all modalities as probabilistic distributions via a Probability Distribution Encoder (PDE) by utilizing sequence-level interactions. Compared to the exiting deterministic methods, such uncertainty modeling can convey richer multimodal semantic information and more complex relationships. Furthermore, we integrate uncertainty modeling with popular pre-training frameworks and propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM). The fine-tuned models are applied to challenging downstream tasks, including image-text retrieval, visual question answering, visual reasoning, and visual entailment, and achieve state-of-the-art results.

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Wu, WenhaoandLuo, HaipengandFang, BoandWang, JingdongandOuyang, Wanli



研究问题:如何利用视频的标题、标签和字幕等文本信息进行文本-视频检索。
动机:现有的文本-视频检索方法主要关注视频视觉内容与文本查询句子的跨模态匹配,忽视了视频附带的相关文本信息。
方法:提出一种新的文本-视频检索方法,通过零样本视频字幕生成从视频中直接生成相关字幕,并利用网络预训练模型(如CLIP和GPT-2)的知识。
效果:新提出的Cap4Video框架在四个标准文本-视频检索基准测试中取得了最先进的性能,验证了该方法的有效性。

Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences. However, in real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This insight has motivated us to propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning with knowledge from web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated captions, a natural question arises: what benefits do they bring to text-video retrieval? To answer this, we introduce Cap4Video, a new framework that leverages captions in three ways: i) Input data: video-caption pairs can augment the training data. ii) Intermediate feature interaction: we perform cross-modal feature interaction between the video and caption to produce enhanced video representations. iii) Output score: the Query-Caption matching branch can complement the original Query-Video matching branch for text-video retrieval. We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach. Without any post-processing, Cap4Video achieves state-of-the-art performance on four standard text-video retrieval benchmarks: MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is available at https://github.com/whwu95/Cap4Video.

Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles
Ye, ShuquanandXie, YujiaandChen, DongdongandXu, YichongandYuan, LuandZhu, ChenguangandLiao, Jing



研究问题:本文旨在分析和改善近期流行的视觉-语言(VL)模型的常识能力。
动机:尽管现有的VL模型取得了巨大的成功,但我们发现它们仍然缺乏常识知识和推理能力,这是迈向人工智能的关键组成部分。
方法:我们提出了一种名为"Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE)的数据增强策略,通过利用常识知识图谱(如ConceptNet)在训练过程中实时向现有VL数据集注入常识知识。
效果:通过在代表性的VL模型上进行大量实验,我们证明DANCE技术能够显著提高模型的常识能力,同时保持其在基本检索任务上的性能。

This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models. Despite the great success, we observe that existing VL-models still lack commonsense knowledge/reasoning ability (e.g., "Lemons are sour"), which is a vital component towards artificial general intelligence. Through our analysis, we find one important reason is that existing large-scale VL datasets do not contain much commonsense knowledge, which motivates us to improve the commonsense of VL-models from the data perspective. Rather than collecting a new VL training dataset, we propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE). It can be viewed as one type of data augmentation technique, which can inject commonsense knowledge into existing VL datasets on the fly during training. More specifically, we leverage the commonsense knowledge graph (e.g., ConceptNet) and create variants of text description in VL datasets via bidirectional sub-graph sequentialization. For better commonsense evaluation, we further propose the first retrieval-based commonsense diagnostic benchmark. By conducting extensive experiments on some representative VL-models, we demonstrate that our DANCE technique is able to significantly improve the commonsense ability while maintaining the performance on vanilla retrieval tasks.

S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning
Suo, WeiandSun, MengyangandLiu, WeisongandGao, YiqiandWang, PengandZhang, YanningandWu, Qi



研究问题:本文旨在解决视觉问答(VQA)模型决策过程的自然语言解释问题,以及现有方法在逻辑一致性和人工标注解释获取上的瓶颈。
动机:传统的关注点或梯度分析无法准确反映推理过程,而自由文本理由更易理解且能赢得用户信任。但现有的方法主要使用后验或自我合理化模型来获得合理的解释,存在逻辑不一致和人工标注解释获取困难的问题。
方法:本文提出了一种新的半监督VQA-NLE通过自我批判学习(S3C)的方法,通过回答奖励来评估候选解释,以提高答案和理由之间的逻辑一致性。利用半监督学习框架,S3C可以在没有人工标注解释的情况下从大量的样本中受益。
效果:大量的自动测量和人工评估都显示了该方法的有效性。同时,该框架在两个VQA-NLE数据集上实现了新的最先进的性能。

VQA Natural Language Explanation (VQA-NLE) task aims to explain the decision-making process of VQA models in natural language. Unlike traditional attention or gradient analysis, free-text rationales can be easier to understand and gain users' trust. Existing methods mostly use post-hoc or self-rationalization models to obtain a plausible explanation. However, these frameworks are bottlenecked by the following challenges: 1) the reasoning process cannot be faithfully responded to and suffer from the problem of logical inconsistency. 2) Human-annotated explanations are expensive and time-consuming to collect. In this paper, we propose a new Semi-Supervised VQA-NLE via Self-Critical Learning (S3C), which evaluates the candidate explanations by answering rewards to improve the logical consistency between answers and rationales. With a semi-supervised learning framework, the S3C can benefit from a tremendous amount of samples without human-annotated explanations. A large number of automatic measures and human evaluations all show the effectiveness of our method. Meanwhile, the framework achieves a new state-of-the-art performance on the two VQA-NLE datasets.

LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization
Liu, ShengandHuynh, CongPhuocandChen, CongandArap, MaximandHamid, Raffay



研究问题:本文旨在开发一种简单而有效的自我监督预训练方法,用于图像调和,该方法可以利用大规模的未标注图像数据集。
动机:现有的图像调和模型需要大量的标注数据进行训练,这在实际应用中是不现实的。因此,作者提出了一种新的自我监督预训练方法,可以在未标注的图像数据集上进行训练。
方法:首先,作者使用在线生成的预训练数据,通过标签高效掩蔽区域转换(LEMaRT)管道生成前景掩码,并对指定区域的视觉属性进行一系列变换,如散焦模糊、对比度和饱和度等。然后,通过从被干扰的图像中恢复原始图像来预训练图像调和模型。其次,作者通过将局部和全局自注意力机制相结合,对Swin Transformer进行改进,提出了一种新的图像调和模型SwinIH。
效果:实验结果表明,使用LEMaRT预训练的SwinIH在图像调和任务上达到了新的最先进的水平,同时比现有的方法更加标签高效,即在微调阶段需要的标注数据更少。在iHarmony4数据集上,当只使用50%的训练数据进行微调时,SwinIH的性能比最先进的SCS-Co高出0.4 dB;当使用全部训练数据集进行训练时,SwinIH的性能比SCS-Co高出1.0 dB。

We present a simple yet effective self-supervised pretraining method for image harmonization which can leverage large-scale unannotated image datasets. To achieve this goal, we first generate pre-training data online with our Label-Efficient Masked Region Transform (LEMaRT) pipeline. Given an image, LEMaRT generates a foreground mask and then applies a set of transformations to perturb various visual attributes, e.g., defocus blur, contrast, saturation, of the region specified by the generated mask. We then pre-train image harmonization models by recovering the original image from the perturbed image. Secondly, we introduce an image harmonization model, namely SwinIH, by retrofitting the Swin Transformer [27] with a combination of local and global self-attention mechanisms. Pretraining SwinIH with LEMaRT results in a new state of the art for image harmonization, while being label-efficient, i.e., consuming less annotated data for fine-tuning than existing methods. Notably, on iHarmony4 dataset [8], SwinIH outperforms the state of the art, i.e., SCS-Co [16] by a margin of 0.4 dB when it is fine-tuned on only 50% of the training data, and by 1.0 dB when it is trained on the full training dataset.

Multi-Concept Customization of Text-to-Image Diffusion
Kumari, NupurandZhang, BingliangandZhang, RichardandShechtman, EliandZhu, Jun-Yan



研究问题:如何让模型快速学习新概念并合成多个新概念?
动机:用户希望合成自己的新概念,如家庭、宠物或物品。
方法:提出Custom Diffusion方法,通过优化文本到图像的调节机制来表示新概念,同时实现快速调整。可以联合训练多个概念或将多个微调后的模型组合成一个。
效果:该方法在定性和定量评估中优于或与几个基线和同时进行的工作相媲美,同时具有内存和计算效率。

While generative models produce high-quality images of concepts learned from a large-scale database, a user often wishes to synthesize instantiations of their own concepts (for example, their family, pets, or items). Can we teach a model to quickly acquire a new concept, given a few examples? Furthermore, can we compose multiple new concepts together? We propose Custom Diffusion, an efficient method for augmenting existing text-to-image models. We find that only optimizing a few parameters in the text-to-image conditioning mechanism is sufficiently powerful to represent new concepts while enabling fast tuning ( 6 minutes). Additionally, we can jointly train for multiple concepts or combine multiple fine-tuned models into one via closed-form constrained optimization. Our fine-tuned model generates variations of multiple new concepts and seamlessly composes them with existing concepts in novel settings. Our method outperforms or performs on par with several baselines and concurrent works in both qualitative and quantitative evaluations, while being memory and computationally efficient.

Advancing Visual Grounding With Scene Knowledge: Benchmark and Method
Chen, ZhihongandZhang, RuifeiandSong, YibingandWan, XiangandLi, Guanbin



研究问题:本文旨在解决视觉基础任务中的问题,即如何建立视觉和语言之间的精细对齐。
动机:现有的视觉基础数据集大多使用简单的描述文本构建,这无法充分测试模型在图像和文本上的理解和推理能力。
方法:提出了一种新的场景知识引导的视觉基础(SK-VG)基准,其中图像内容和引用表达式不足以确定目标对象,迫使模型在长篇场景知识上进行推理。为此,我们提出了两种接受三元输入的方法,一种是将知识嵌入到图像特征中,另一种是利用语言结构帮助计算图像-文本匹配。
效果:实验结果表明,提出的方法取得了有希望的结果,但仍有改进的空间,包括性能和可解释性。

Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study, where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of Scene Knowledge-guided Visual Grounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, we propose two approaches to accept the triple-type input, where the former embeds knowledge into the image features before the image-query interaction; the latter leverages linguistic structure to assist in computing the image-text matching. We conduct extensive experiments to analyze the above methods and show that the proposed approaches achieve promising results but still leave room for improvement, including performance and interpretability.

Multiview Compressive Coding for 3D Reconstruction
Wu, Chao-YuanandJohnson, JustinandMalik, JitendraandFeichtenhofer, ChristophandGkioxari, Georgia



研究问题:如何通过单张图片理解物体和场景的三维姿态。
动机:二维识别由于大规模学习和通用表示取得了巨大进步,但三维姿态识别由于图像中未描绘的遮挡等问题带来了新的挑战。
方法:我们提出了一种简单的框架,通过学习自我监督学习的先进进展来获取可泛化的表示。我们的模型“多视图压缩编码(MCC)”从各种RGB-D视频中学习,将输入的外观和几何压缩以预测3D结构。
效果:MCC的普适性和效率使其能够从大规模和多样化的数据源进行学习,并对由DALL*E 2想象或在野外用iPhone捕获的新对象具有强大的泛化能力。

A central goal of visual recognition is to understand objects and scenes from a single image. 2D recognition has witnessed tremendous progress thanks to large-scale learning and general-purpose representations. But, 3D poses new challenges stemming from occlusions not depicted in the image. Prior works try to overcome these by inferring from multiple views or rely on scarce CAD models and category-specific priors which hinder scaling to novel settings. In this work, we explore single-view 3D reconstruction by learning generalizable representations inspired by advances in self-supervised learning. We introduce a simple framework that operates on 3D points of single objects or whole scenes coupled with category-agnostic large-scale training from diverse RGB-D videos. Our model, Multiview Compressive Coding (MCC), learns to compress the input appearance and geometry to predict the 3D structure by querying a 3D-aware decoder. MCC's generality and efficiency allow it to learn from large-scale and diverse data sources with strong generalization to novel objects imagined by DALL*E 2 or captured in-the-wild with an iPhone.

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
Piergiovanni, AJandKuo, WeichengandAngelova, Anelia



研究问题:如何将ViT编码器转化为能同时处理图像和视频输入的高效视频模型。
动机:现有的方法需要对两种类型的输入进行单独处理,缺乏效率。
方法:通过稀疏采样输入,使模型能够从两种输入中进行训练和推理,无需完全微调,易于扩展。
效果:该模型实现了最新的成果,能在大规模预训练ViTs上应用,且无需全量微调。

We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results.

Token Boosting for Robust Self-Supervised Visual Transformer Pre-Training
Li, TianjiaoandFoo, LinGengandHu, PingandShang, XindiandRahmani, HosseinandYuan, ZehuanandLiu, Jun



研究问题:预训练视觉转换器(VTs)时,输入数据可能被破坏和不可靠,这在现实世界的场景中是一个挑战。
动机:现有的预训练方法往往忽视了输入数据可能的不可靠性,特别是在使用遮蔽自动编码进行预训练时,输入和遮蔽的“真实”目标都可能是不可靠的。
方法:提出了一种插拔式组件——Token Boosting Module (TBM),用于VTs,使其能够在遮蔽自动编码预训练过程中学习提取干净和健壮的特征。
效果:通过理论分析和大量实验,证明了TBM能够提高模型预训练的稳健性和泛化性,从而有利于下游任务。在四个损坏的数据集上进行的实验表明,TBM能够持续提升下游任务的性能。

Learning with large-scale unlabeled data has become a powerful tool for pre-training Visual Transformers (VTs). However, prior works tend to overlook that, in real-world scenarios, the input data may be corrupted and unreliable. Pre-training VTs on such corrupted data can be challenging, especially when we pre-train via the masked autoencoding approach, where both the inputs and masked "ground truth" targets can potentially be unreliable in this case. To address this limitation, we introduce the Token Boosting Module (TBM) as a plug-and-play component for VTs that effectively allows the VT to learn to extract clean and robust features during masked autoencoding pre-training. We provide theoretical analysis to show how TBM improves model pre-training with more robust and generalizable representations, thus benefiting downstream tasks. We conduct extensive experiments to analyze TBM's effectiveness, and results on four corrupted datasets demonstrate that TBM consistently improves performance on downstream tasks.

Probabilistic Prompt Learning for Dense Prediction
Kwon, HyeongjunandSong, TaeyongandJeong, SomiandKim, JinandJang, JinhyunandSohn, Kwanghoon



研究问题:本文旨在提出一种新的概率提示学习方法,以充分利用视觉语言知识进行密集预测任务。
动机:目前的确定性提示学习方法在处理需要处理更复杂和多样化对象的密集预测任务时,由于单一且确定的描述无法充分表示整个图像,因此其性能有限。
方法:我们引入了可学习的类别无关属性提示来描述跨对象类别的通用属性。这些属性与类别信息和视觉上下文知识结合,定义了类别特定的文本分布。然后采样文本表示并使用概率像素-文本匹配损失来指导密集预测任务,从而提高了所提出方法的稳定性和泛化能力。
效果:我们在不同密集预测任务上进行了大量实验和消融研究,结果证明了我们提出的方法的有效性。

Recent progress in deterministic prompt learning has become a promising alternative to various downstream vision tasks, enabling models to learn powerful visual representations with the help of pre-trained vision-language models. However, this approach results in limited performance for dense prediction tasks that require handling more complex and diverse objects, since a single and deterministic description cannot sufficiently represent the entire image. In this paper, we present a novel probabilistic prompt learning to fully exploit the vision-language knowledge in dense prediction tasks. First, we introduce learnable class-agnostic attribute prompts to describe universal attributes across the object class. The attributes are combined with class information and visual-context knowledge to define the class-specific textual distribution. Text representations are sampled and used to guide the dense prediction task using the probabilistic pixel-text matching loss, enhancing the stability and generalization capability of the proposed method. Extensive experiments on different dense prediction tasks and ablation studies demonstrate the effectiveness of our proposed method.

WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
Jeong, JongheonandZou, YangandKim, TaewanandZhang, DongqingandRavichandran, AvinashandDabeer, Onkar



研究问题:如何实现工业质量检查的自动视觉异常分类和分割。
动机:目前的研究需要为每个质量检查任务训练定制模型,这需要特定于任务的图像和注释。
方法:提出基于窗口的CLIP(WinCLIP)和其少数正常样本扩展WinCLIP+。WinCLIP采用状态词和提示模板的组合集成,并提取和聚合与文本对齐的窗口/补丁/图像级特征。WinCLIP+使用来自正常图像的互补信息。
效果:在MVTec-AD和VisA上,未经进一步调整的WinCLIP在零样本异常分类和分割中分别达到91.8%/85.1%(78.1%/79.6%)的AUROC,而WinCLIP+在1个正常样本中达到93.1%/95.2%(83.8%/96.4%),大幅超过现有技术。

Visual anomaly classification and segmentation are vital for automating industrial quality inspection. The focus of prior research in the field has been on training custom models for each quality inspection task, which requires task-specific images and annotation. In this paper we move away from this regime, addressing zero-shot and few-normal-shot anomaly classification and segmentation. Recently CLIP, a vision-language model, has shown revolutionary generality with competitive zero/few-shot performance in comparison to full-supervision. But CLIP falls short on anomaly classification and segmentation tasks. Hence, we propose window-based CLIP (WinCLIP) with (1) a compositional ensemble on state words and prompt templates and (2) efficient extraction and aggregation of window/patch/image-level features aligned with text. We also propose its few-normal-shot extension WinCLIP+, which uses complementary information from normal images. In MVTec-AD (and VisA), without further tuning, WinCLIP achieves 91.8%/85.1% (78.1%/79.6%) AUROC in zero-shot anomaly classification and segmentation while WinCLIP+ does 93.1%/95.2% (83.8%/96.4%) in 1-normal-shot, surpassing state-of-the-art by large margins.

Learning Geometric-Aware Properties in 2D Representation Using Lightweight CAD Models, or Zero Real 3D Pairs
Arsomngern, PattaramaneeandNutanong, SaranaandSuwajanakorn, Supasorn



研究问题:如何利用轻量级的3D数据(如CAD模型)提升2D场景理解。
动机:大规模场景数据集的需求限制了可扩展性和进一步改进,因此需要寻找替代的学习方式。
方法:构建一个具有几何感知对齐的3D空间,通过Chamfer距离反映CAD模型的几何相似性,并将获得的几何感知属性引入到2D特征中。
效果:该方法在各种2D理解任务上的表现优于现有的RGB-CAD方法,即使只使用轻量级的CAD模型或伪数据,也能在NYUv2、SUNRGB-D、室内ADE20k和室内/室外COCO等四个任务上达到与最先进的场景扫描方法相当的结果。

Cross-modal training using 2D-3D paired datasets, such as those containing multi-view images and 3D scene scans, presents an effective way to enhance 2D scene understanding by introducing geometric and view-invariance priors into 2D features. However, the need for large-scale scene datasets can impede scalability and further improvements. This paper explores an alternative learning method by leveraging a lightweight and publicly available type of 3D data in the form of CAD models. We construct a 3D space with geometric-aware alignment where the similarity in this space reflects the geometric similarity of CAD models based on the Chamfer distance. The acquired geometric-aware properties are then induced into 2D features, which boost performance on downstream tasks more effectively than existing RGB-CAD approaches. Our technique is not limited to paired RGB-CAD datasets. By training exclusively on pseudo pairs generated from CAD-based reconstruction methods, we enhance the performance of SOTA 2D pre-trained models that use ResNet-50 or ViT-B backbones on various 2D understanding tasks. We also achieve comparable results to SOTA methods trained on scene scans on four tasks in NYUv2, SUNRGB-D, indoor ADE20k, and indoor/outdoor COCO, despite using lightweight CAD models or pseudo data.

Texts as Images in Prompt Tuning for Multi-Label Image Recognition
Guo, ZixianandDong, BowenandJi, ZhilongandBai, JinfengandGuo, YiwenandZuo, Wangmeng



研究问题:如何利用文本描述进行提示调优,以适应数据有限或标签有限的下游任务。
动机:现有的方法需要视觉数据(如图像)来学习提示,但文本描述易于收集且可直接获取类别标签。
方法:提出TaI提示法,将文本描述视为图像进行提示调优,并进一步提出双重粒度提示调优(TaI-DPT)以提升多标签识别性能。
效果:实验结果表明,TaI-DPT在多个基准测试中优于零射提示CLIP,并能与现有的图像提示方法结合以提高识别性能。

Prompt tuning has been employed as an efficient way to adapt large vision-language pre-trained models (e.g. CLIP) to various downstream tasks in data-limited or label-limited settings. Nonetheless, visual data (e.g., images) is by default prerequisite for learning prompts in existing methods. In this work, we advocate that the effectiveness of image-text contrastive learning in aligning the two modalities (for training CLIP) further makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting. In contrast to the visual data, text descriptions are easy to collect, and their class labels can be directly derived. Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning. Moreover, with TaI, double-grained prompt tuning (TaI-DPT) is further presented to extract both coarse-grained and fine-grained embeddings for enhancing the multi-label recognition performance. Experimental results show that our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks, e.g., MS-COCO, VOC2007, and NUS-WIDE, while it can be combined with existing methods of prompting from images to improve recognition performance further. The code is released at https://github.com/guozix/TaI-DPT.

Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models
Goyal, SachinandKumar, AnanyaandGarg, SankalpandKolter, ZicoandRaghunathan, Aditi



研究问题:本文旨在解决图像-文本模型(如CLIP)微调过程中的微妙差异可能导致最终性能出现大的差异的问题。
动机:最近的研究表明,即使是微调过程的微小差异,也可能导致分布内(ID)和分布外(OOD)数据的最终性能出现大的差异。
方法:本文提出了一种模仿对比预训练的自然而简单的方法,即通过将下游类别标签视为文本提示并继续优化图像嵌入和类描述提示嵌入之间的对比损失来进行对比微调。
效果:在7个分布转移、6个迁移学习和3个少样本学习基准测试中,该方法始终优于基线。在WILDs-iWILDCam上,该方法FLYP比领导者榜顶的结果提高了2.3% ID和2.7% OOD,达到了最高的报告准确率。在7个OOD数据集(2个WILDs和5个与ImageNet相关的转移)上,FLYP比标准微调提高了4.2% OOD,比当前最先进的LP-FT高出1%以上。同样,在3个少样本学习基准测试中,FLYP比标准微调和最先进的方法分别提高了4.6%和4.4%。因此,本文提出的对比微调方法被确立为监督图像-文本模型(如CLIP)微调的一种简单直观的最先进的方法。

Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works (Kumar et al., 2022; Wortsman et al., 2021) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In this work, we show that a natural and simple approach of mimicking contrastive pretraining consistently outperforms alternative finetuning approaches. Specifically, we cast downstream class labels as text prompts and continue optimizing the contrastive loss between image embeddings and class-descriptive prompt embeddings (contrastive finetuning). Our method consistently outperforms baselines across 7 distribution shift, 6 transfer learning, and 3 few-shot learning benchmarks. On WILDS-iWILDCam, our proposed approach FLYP outperforms the top of the leaderboard by 2.3% ID and 2.7% OOD, giving the highest reported accuracy. Averaged across 7 OOD datasets (2 WILDS and 5 ImageNet associated shifts), FLYP gives gains of 4.2% OOD over standard finetuning and outperforms current state-ofthe-art (LP-FT) by more than 1% both ID and OOD. Similarly, on 3 few-shot learning benchmarks, FLYP gives gains up to 4.6% over standard finetuning and 4.4% over the state-of-the-art. Thus we establish our proposed method of contrastive finetuning as a simple and intuitive state-ofthe-art for supervised finetuning of image-text models like CLIP. Code is available at https://github.com/locuslab/FLYP.

Hint-Aug: Drawing Hints From Foundation Vision Transformers Towards Boosted Few-Shot Parameter-Efficient Tuning
Yu, ZhongzhiandWu, ShangandFu, YongganandZhang, ShunyaoandLin, Yingyan(Celine)



研究问题:如何在数据有限的情况下,充分利用基础视觉变换器(FViTs)的潜力进行下游任务的微调。
动机:由于FViTs对数据的高需求,以及在小样本调整中常见的特征有限的问题,使得其在数据有限的场景下充分发挥其潜力成为一项挑战。
方法:提出了一种基于提示的数据增强(Hint-Aug)框架,通过使用预训练的FViTs学习到的特征来增强微调样本中的过拟合部分,以提高小样本FViT微调的效果。
效果:在五个数据集和三种参数高效微调技术上的大量实验和消融研究均验证了Hint-Aug的有效性,其在各种低样本设置下比最先进的数据增强方法高出0.04%至32.91%的准确率。例如,在Pet数据集上,Hint-Aug在只有一半训练数据的情况下,比最先进的数据增强方法提高了2.22%的准确率。

Despite the growing demand for tuning foundation vision transformers (FViTs) on downstream tasks, fully unleashing FViTs' potential under data-limited scenarios (e.g., few-shot tuning) remains a challenge due to FViTs' data-hungry nature. Common data augmentation techniques fall short in this context due to the limited features contained in the few-shot tuning data. To tackle this challenge, we first identify an opportunity for FViTs in few-shot tuning: pretrained FViTs themselves have already learned highly representative features from large-scale pretraining data, which are fully preserved during widely used parameter-efficient tuning. We thus hypothesize that leveraging those learned features to augment the tuning data can boost the effectiveness of few-shot FViT tuning. To this end, we propose a framework called Hint-based Data Augmentation (Hint-Aug), which aims to boost FViT in few-shot tuning by augmenting the over-fitted parts of tuning samples with the learned features of pretrained FViTs. Specifically, Hint-Aug integrates two key enablers: (1) an Attentive Over-fitting Detector (AOD) to detect over-confident patches of foundation ViTs for potentially alleviating their over-fitting on the few-shot tuning data and (2) a Confusion-based Feature Infusion (CFI) module to infuse easy-to-confuse features from the pretrained FViTs with the over-confident patches detected by the above AOD in order to enhance the feature diversity during tuning. Extensive experiments and ablation studies on five datasets and three parameter-efficient tuning techniques consistently validate Hint-Aug's effectiveness: 0.04% 32.91% higher accuracy over the state-of-the-art (SOTA) data augmentation method under various low-shot settings. For example, on the Pet dataset, Hint-Aug achieves a 2.22% higher accuracy with 50% less training data over SOTA data augmentation methods.

Explicit Visual Prompting for Low-Level Structure Segmentations
Liu, WeihuangandShen, XiandPun, Chi-ManandCun, Xiaodong



研究问题:本文旨在解决图像中低层次结构的通用检测问题,包括分割被操纵的部分、识别失焦像素、分离阴影区域和检测隐藏物体。
动机:虽然每个这样的问题通常都有专门的解决方案,但作者认为统一的处理方法在所有这些问题上都能表现良好。
方法:受到NLP中广泛使用的预训练和提示调优协议的启发,作者提出了一种新的视觉提示模型,名为显式视觉提示(EVP)。与以前的视觉提示(通常是数据集级别的隐式嵌入)不同,作者的关键洞察是强制可调参数关注每个单独图像的显式视觉内容,即来自冻结补丁嵌入的特征和输入的高频率组件。
效果:所提出的EVP在相同数量的可调参数下显著优于其他参数高效的调优协议(每个任务额外5.7%的可训练参数)。与特定任务的解决方案相比,EVP在各种低层次结构分割任务上也实现了最先进的性能。

We consider the generic problem of detecting low-level structures in images, which includes segmenting the manipulated parts, identifying out-of-focus pixels, separating shadow regions, and detecting concealed objects. Whereas each such topic has been typically addressed with a domain-specific solution, we show that a unified approach performs well across all of them. We take inspiration from the widely-used pre-training and then prompt tuning protocols in NLP and propose a new visual prompting model, named Explicit Visual Prompting (EVP). Different from the previous visual prompting which is typically a dataset-level implicit embedding, our key insight is to enforce the tunable parameters focusing on the explicit visual content from each individual image, i.e., the features from frozen patch embeddings and the input's high-frequency components. The proposed EVP significantly outperforms other parameter-efficient tuning protocols under the same amount of tunable parameters (5.7% extra trainable parameters of each task). EVP also achieves state-of-the-art performances on diverse low-level structure segmentation tasks compared to task-specific solutions. Our code is available at: https://github.com/NiFangBaAGe/Explicit-Visual-Prompt.

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection
Chen, AnthonyandZhang, KevinandZhang, RenruiandWang, ZihanandLu, YuhengandGuo, YandongandZhang, Shanghang



研究问题:本文旨在解决现有的遮蔽自动编码器在多模态设置中的能力问题,特别是在研究问题:本文旨在解决现有的遮蔽自动编码器在多模态设置中的能力问题,特别是在点云和RGB图像数据这两种经常在现实世界中一起出现的模态上。
动机:尽管遮蔽自动编码器已在几种独立模态中学习到强大的视觉表示并取得了最先进的结果,但在多模态设置中的应用却鲜有研究。
方法:本文提出了PiMAE,一个自我监督的预训练框架,通过三个方面促进3D和2D的交互。首先,我们注意到两种来源之间的遮蔽策略的重要性,并利用投影模块对两种模态的遮蔽和可见标记进行互补对齐。然后,我们利用精心设计的双分支MAE管道和一个新的共享解码器来促进遮蔽标记中的跨模态交互。最后,我们设计了一个独特的跨模态重建模块,以增强两种模态的表示学习。
效果:通过对大规模的RGB-D场景理解基准(SUN RGB-D和ScannetV2)进行大量实验,我们发现交互式地学习点图像特征并非易事。我们的模型大大提高了多种3D检测器、2D检测器和少样本分类器的精度,分别提高了2.9%、6.7%和2.4%。代码可在https://github.com/BLVLab/PiMAE获取。

Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities, yet very few works have addressed their capabilities in multi-modality settings. In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world and explore their meaningful interactions. To improve upon the cross-modal synergy in existing works, we propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects. Specifically, we first notice the importance of masking strategies between the two sources and utilize a projection module to complementarily align the mask and visible tokens of the two modalities. Then, we utilize a well-crafted two-branch MAE pipeline with a novel shared decoder to promote cross-modality interaction in the mask tokens. Finally, we design a unique cross-modal reconstruction module to enhance representation learning for both modalities. Through extensive experiments performed on large-scale RGB-D scene understanding benchmarks (SUN RGB-D and ScannetV2), we discover it is nontrivial to interactively learn point-image features, where we greatly improve multiple 3D detectors, 2D detectors and few-shot classifiers by 2.9%, 6.7%, and 2.4%, respectively. Code is available at https://github.com/BLVLab/PiMAE.

Referring Image Matting
Li, JizhiziandZhang, JingandTao, Dacheng



研究问题:本文提出了一种新的任务,名为参考图像蒙版(RIM),旨在提取与给定研究问题:本文提出了一种新的任务,名为参考图像蒙版(RIM),旨在提取与给定自然语言描述最匹配的特定对象的详细阿尔法蒙版,从而实现对图像蒙版的更自然、更简单的指令。
动机:与传统的图像蒙版不同,它要么需要用户定义的涂鸦/ trimap来提取特定的前景对象,要么直接无序地提取图像中的所有前景对象,本文引入了一个新的任务,即参考图像蒙版(RIM)。
方法:我们首先设计了一个全面的图像合成和表达生成引擎,以基于公共数据集自动产生高质量图像和多样化的文字属性,从而建立了一个大型的具有挑战性的数据集RefMatte。然后,我们构建了一个真实世界的测试集,包含100张高分辨率的自然图像,并手动注释复杂的短语,以评估RIM方法的领域外泛化能力。最后,我们提出了一种名为CLIPMat的新基线方法用于RIM,包括上下文嵌入提示、文本驱动的语义弹出窗口和多级细节提取器。
效果:在RefMatte上的大量实验,无论是关键词设置还是表达设置,都验证了CLIPMat优于代表性的方法。我们希望这项工作能为图像蒙版提供新的见解,并鼓励更多的后续研究。

Different from conventional image matting, which either requires user-defined scribbles/trimap to extract a specific foreground object or directly extracts all the foreground objects in the image indiscriminately, we introduce a new task named Referring Image Matting (RIM) in this paper, which aims to extract the meticulous alpha matte of the specific object that best matches the given natural language description, thus enabling a more natural and simpler instruction for image matting. First, we establish a large-scale challenging dataset RefMatte by designing a comprehensive image composition and expression generation engine to automatically produce high-quality images along with diverse text attributes based on public datasets. RefMatte consists of 230 object categories, 47,500 images, 118,749 expression-region entities, and 474,996 expressions. Additionally, we construct a real-world test set with 100 high-resolution natural images and manually annotate complex phrases to evaluate the out-of-domain generalization abilities of RIM methods. Furthermore, we present a novel baseline method CLIPMat for RIM, including a context-embedded prompt, a text-driven semantic pop-up, and a multi-level details extractor. Extensive experiments on RefMatte in both keyword and expression settings validate the superiority of CLIPMat over representative methods. We hope this work could provide novel insights into image matting and encourage more follow-up studies. The dataset, code and models are available at https://github.com/JizhiziLi/RIM.

ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
Smith, JamesSealeandCascante-Bonilla, PaolaandArbelle, AssafandKim, DonghyunandPanda, RameswarandCox, DavidandYang, DiyiandKira, ZsoltandFeris, RogerioandKarlinsky, Leonid



研究问题:大型预训练视觉-语言(VL)基础模型在零样本下游任务中表现出色,但在结构化视觉-语言概念(SVLC)推理方面仍存在脆弱性。
动机:解决VL模型在识别对象属性、状态和对象间关系等SVLC推理能力上的缺陷,避免因错误推理而需要使用私有数据进行修正的问题。
方法:提出首个持续无任务标记的结构化VL概念学习(ConStruct-VL)基准,并设计了一种基于对抗性伪重播(APR)的数据自由方法,通过生成过去任务模型的对抗性提醒来提高模型性能。同时,提出一种持续参数高效的分层LoRA(LaLo)神经网络架构,允许在训练时无需记忆成本地访问所有过去的模型。
效果:该方法在所有数据自由方法中表现最好,提高了7%的性能,甚至在某些经验回放水平上取得了匹配,这对于必须保护数据隐私的应用来说是不可能的。

Recently, large-scale pre-trained Vision-and-Language (VL) foundation models have demonstrated remarkable capabilities in many zero-shot downstream tasks, achieving competitive results for recognizing objects defined by as little as short text prompts. However, it has also been shown that VL models are still brittle in Structured VL Concept (SVLC) reasoning, such as the ability to recognize object attributes, states, and inter-object relations. This leads to reasoning mistakes, which need to be corrected as they occur by teaching VL models the missing SVLC skills; often this must be done using private data where the issue was found, which naturally leads to a data-free continual (no task-id) VL learning setting. In this work, we introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark and show it is challenging for many existing data-free CL strategies. We, therefore, propose a data-free method comprised of a new approach of Adversarial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models. To use this method efficiently, we also propose a continual parameter-efficient Layered-LoRA (LaLo) neural architecture allowing no-memory-cost access to all past models at train time. We show this approach outperforms all data-free methods by as much as 7% while even matching some levels of experience-replay (prohibitive for applications where data-privacy must be preserved). Our code is publicly available at https://github.com/jamessealesmith/ConStruct-VL

Delivering Arbitrary-Modal Semantic Segmentation
Zhang, JiamingandLiu, RuipingandShi, HaoandYang, KailunandRei{\ss



研究问题:如何融合任意数量的模态以提高语义分割的鲁棒性?
动机:目前的多模态融合方法在处理任意数量的模态上仍有待探索。
方法:创建了DeLiVER任意模态分割基准,包括深度、激光雷达、多视图、事件和RGB等五种模态。同时,提供了四种恶劣天气条件和五种传感器故障案例的数据,以利用模态互补性和解决部分停机问题。提出了一种名为CMNeXt的任意跨模态分割模型,该模型包含一个自查询中心(SQ-Hub),用于从任何模态中提取有效信息并与RGB表示进行后续融合,每增加一个附加模态仅增加微量的参数(0.01M)。此外,为了高效灵活地从辅助模态中获取判别性线索,引入了简单的并行池化混合器(PPX)。
效果:在六个基准测试集上进行了大量实验,CMNeXt实现了最先进的性能,可以在DeLiVER、KITTI-360、MFNet、NYU Depth V2、UrbanLF和MCubeS数据集上从1个模态扩展到81个模态。在最新收集的DeLiVER数据集中,四模态CMNeXt的mIoU达到了66.30%,比单模态基线提高了9.10%。

Multimodal fusion can make semantic segmentation more robust. However, fusing an arbitrary number of modalities remains underexplored. To delve into this problem, we create the DeLiVER arbitrary-modal segmentation benchmark, covering Depth, LiDAR, multiple Views, Events, and RGB. Aside from this, we provide this dataset in four severe weather conditions as well as five sensor failure cases to exploit modal complementarity and resolve partial outages. To facilitate this data, we present the arbitrary cross-modal segmentation model CMNeXt. It encompasses a Self-Query Hub (SQ-Hub) designed to extract effective information from any modality for subsequent fusion with the RGB representation and adds only negligible amounts of parameters ( 0.01M) per additional modality. On top, to efficiently and flexibly harvest discriminative cues from the auxiliary modalities, we introduce the simple Parallel Pooling Mixer (PPX). With extensive experiments on a total of six benchmarks, our CMNeXt achieves state-of-the-art performance, allowing to scale from 1 to 81 modalities on the DeLiVER, KITTI-360, MFNet, NYU Depth V2, UrbanLF, and MCubeS datasets. On the freshly collected DeLiVER, the quad-modal CMNeXt reaches up to 66.30% in mIoU with a +9.10% gain as compared to the mono-modal baseline.

Hyperbolic Contrastive Learning for Visual Representations Beyond Objects
Ge, SongweiandMishra, ShlokandKornblith, SimonandLi, Chun-LiangandJacobs, David



研究问题:现有的自监督/无监督方法在视觉表示学习中取得了快速进展,但这些方法通常使用相同的视角来处理对象和场景。
动机:观察到视觉上相似的对象在表示空间中接近,我们主张场景和对象应该根据它们的组成性遵循一种分层结构。
方法:提出一个对比学习框架,其中欧几里得损失用于学习对象表示,双曲损失用于鼓励场景的表示接近其构成对象的表示。这种新的双曲目标通过优化其范数的大小来鼓励表示之间的场景-对象超价关系。
效果:在COCO和OpenImages数据集上预训练时,双曲损失提高了多个基线在多个数据集和任务(包括图像分类、对象检测和语义分割)的下游性能。我们还发现,学到的表示的性质使我们能够以零样本的方式解决涉及场景和对象之间交互的各种视觉任务。

Although self-/un-supervised methods have led to rapid progress in visual representation learning, these methods generally treat objects and scenes using the same lens. In this paper, we focus on learning representations of objects and scenes that preserve the structure among them. Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure based on their compositionality. To exploit such a structure, we propose a contrastive learning framework where a Euclidean loss is used to learn object representations and a hyperbolic loss is used to encourage representations of scenes to lie close to representations of their constituent objects in hyperbolic space. This novel hyperbolic objective encourages the scene-object hypernymy among the representations by optimizing the magnitude of their norms. We show that when pretraining on the COCO and OpenImages datasets, the hyperbolic loss improves the downstream performance of several baselines across multiple datasets and tasks, including image classification, object detection, and semantic segmentation. We also show that the properties of the learned representations allow us to solve various vision tasks that involve the interaction between scenes and objects in a zero-shot fashion.

Non-Contrastive Learning Meets Language-Image Pre-Training
Zhou, JinghaoandDong, LiandGan, ZheandWang, LijuanandWei, Furu



研究问题:本文旨在探索非对比语言-图像预训练(nCLIP)的有效性,并研究视觉自监督模型中是否会出现良好的特性。
动机:虽然对比语言-图像预训练(CLIP)已成为将图像和文本对齐的事实标准,但网络爬取的数据中图像和文本之间的松散关联使得对比目标数据效率低下,且需要大的批量训练。
方法:通过实验观察非对比目标在表征学习中的作用,同时研究其在零样本识别下的不足。基于上述研究,进一步引入了xCLIP,这是一个结合CLIP和nCLIP的多任务框架,并展示了nCLIP如何帮助CLIP增强特征语义。两种目标之间的协同作用使xCLIP在零样本转移和表征学习方面都表现出色。
效果:通过一系列广泛的下游任务进行系统评估,包括零样本分类、跨领域分类、检索、视觉表示学习和文本表示学习,展示了一致的性能提升,验证了xCLIP的有效性。

Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align images and texts. Nonetheless, the loose correlation between images and texts of web-crawled data renders the contrastive objective data inefficient and craving for a large training batch size. In this work, we explore the validity of non-contrastive language-image pre-training (nCLIP) and study whether nice properties exhibited in visual self-supervised models can emerge. We empirically observe that the non-contrastive objective nourishes representation learning while sufficiently underperforming under zero-shot recognition. Based on the above study, we further introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics. The synergy between two objectives lets xCLIP enjoy the best of both worlds: superior performance in both zero-shot transfer and representation learning. Systematic evaluation is conducted spanning a wide variety of downstream tasks including zero-shot classification, out-of-domain classification, retrieval, visual representation learning, and textual representation learning, showcasing a consistent performance gain and validating the effectiveness of xCLIP.

Teaching Structured Vision \& Language Concepts to Vision \& Language Models
Doveh, SivanandArbelle, AssafandHarary, SivanandSchwartz, EliandHerzig, RoeiandGiryes, RajaandFeris, RogerioandPanda, RameswarandUllman, ShimonandKarlinsky, Leonid



研究问题:本文旨在解决视觉语言模型在理解复杂语言结构(如对象属性、关系和状态)方面的挑战。
动机:尽管视觉语言模型在各种任务上表现出色,但在理解复杂的语言结构方面仍有困难。现有的方法需要收集专门的数据集来教授每种语言结构,这既昂贵又耗时。
方法:本文提出了一种基于语言结构理解的数据驱动方法,利用现有的视觉语言预训练数据集,无需额外的数据。通过操纵现有配对的视觉语言数据集的文本部分,训练出的视觉语言模型在理解复杂语言结构方面有显著提高。
效果:实验结果表明,使用更新的数据训练的视觉语言模型在理解复杂语言结构方面提高了15%,同时在零样本能力方面只有轻微的下降。

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision & Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have shown that even the best VL models struggle with SVLC. A possible way of fixing this issue is by collecting dedicated datasets for teaching each SVLC type, yet this might be expensive and time-consuming. Instead, we propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs that makes more effective use of existing VL pre-training datasets and does not require any additional data. While automatic understanding of image structure still remains largely unsolved, language structure is much better modeled and understood, allowing for its effective utilization in teaching VL models. In this paper, we propose various techniques based on language structure understanding that can be used to manipulate the textual part of off-the-shelf paired VL datasets. VL models trained with the updated data exhibit a significant improvement of up to 15% in their SVLC understanding with only a mild degradation in their zero-shot capabilities both when training from scratch or fine-tuning a pre-trained model. Our code and pretrained models are available at: https://github.com/SivanDoveh/TSVLC

Open-Set Representation Learning Through Combinatorial Embedding
Kim, GeehoandKang, JunohandHan, Bohyung



研究问题:视觉识别任务通常只能处理一小部分类别,因为剩余类别的标签不可用。我们希望通过基于有标签和无标签示例的表示学习来识别数据集中的新概念,并将识别范围扩展到已知和新的类别。
动机:目前的视觉识别任务由于标签的限制,往往只能处理部分类别,对于剩余的类别无法进行有效的识别。因此,我们希望通过结合有标签和无标签的示例,通过表示学习来发现新的类别。
方法:我们提出了一种组合学习方法,该方法利用多个监督元分类器在异构标签空间上给出的组合知识,自然地将未见过类别的示例进行聚类。通过无监督的成对关系学习,使组合嵌入给出的表示更加鲁棒。
效果:我们在公开数据集上进行了广泛的实验,结果表明,我们的方法在图像检索和图像分类任务中取得了显著的性能提升,能够有效地发现新的类别。

Visual recognition tasks are often limited to dealing with a small subset of classes simply because the labels for the remaining classes are unavailable. We are interested in identifying novel concepts in a dataset through representation learning based on both labeled and unlabeled examples, and extending the horizon of recognition to both known and novel classes. To address this challenging task, we propose a combinatorial learning approach, which naturally clusters the examples in unseen classes using the compositional knowledge given by multiple supervised meta-classifiers on heterogeneous label spaces. The representations given by the combinatorial embedding are made more robust by unsupervised pairwise relation learning. The proposed algorithm discovers novel concepts via a joint optimization for enhancing the discrimitiveness of unseen classes as well as learning the representations of known classes generalizable to novel ones. Our extensive experiments demonstrate remarkable performance gains by the proposed approach on public datasets for image retrieval and image categorization with novel class discovery.

Top-Down Visual Attention From Analysis by Synthesis
Shi, BaifengandDarrell, TrevorandWang, Xin



研究问题:目前的视觉注意力算法主要刺激驱动,而人类智能体往往能根据高级任务来引导注意力。
动机:探索一种基于分析-合成(AbS)理论的自上而下的注意力机制,使模型能够更好地模拟人类的视觉注意力。
方法:提出一种分析-合成视觉变换器(AbSViT),通过优化稀疏重构目标并加入自上而下的信号,实现可控的自上而下注意力。
效果:实验表明,AbSViT在视觉语言任务上优于基线模型,并在分类、语义分割和模型鲁棒性等方面表现良好。

Current attention algorithms (e.g., self-attention) are stimulus-driven and highlight all the salient objects in an image. However, intelligent agents like humans often guide their attention based on the high-level task at hand, focusing only on task-related objects. This ability of task-guided top-down attention provides task-adaptive representation and helps the model generalize to various tasks. In this paper, we consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. Prior work indicates a functional equivalence between visual attention and sparse reconstruction; we show that an AbS visual system that optimizes a similar sparse reconstruction objective modulated by a goal-directed top-down signal naturally simulates top-down attention. We further propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and achieves controllable top-down attention. For real-world applications, AbSViT consistently improves over baselines on Vision-Language tasks such as VQA and zero-shot retrieval where language guides the top-down attention. AbSViT can also serve as a general backbone, improving performance on classification, semantic segmentation, and model robustness. Project page: https://sites.google.com/view/absvit.

topic-6

Topic words :  adversarial,  image,  images,  attacks,  noise,  robustness,  attack,  robust

TrojViT: Trojan Insertion in Vision Transformers
Zheng, MengxinandLou, QianandJiang, Lei



研究问题:本文旨在研究针对视觉转换器(ViTs)的后门攻击。
动机:虽然传统的CNNs对后门攻击的脆弱性是众所周知的,但对ViTs的后门攻击却鲜有研究。ViTs通过补丁和注意力提取全局上下文信息,与CNNs通过卷积捕获像素级局部特征的方式不同。
方法:本文提出了一种隐蔽且实用的针对ViTs的后门攻击——TrojViT。它生成了一种补丁级的触发器,通过补丁显著性排名和注意力目标损失在ViT的DRAM内存中存储的参数上构建了一个由一些易受攻击的位组成的木马。TrojViT进一步使用参数蒸馏来减少木马的位数。一旦攻击者通过翻转易受攻击的位将木马插入ViT模型,该模型仍然对良性输入产生正常的推理精度。但当攻击者将触发器嵌入输入时,ViT模型被迫将输入分类为预定义的目标类别。
效果:实验表明,通过对使用众所周知的RowHammer的ViT模型翻转少数由TrojViT识别的易受攻击的位,可以将该模型转换为一个后门化的模型。在多个数据集上对各种ViT模型进行大量实验,TrojViT可以在ImageNet上通过翻转345个位使99.64%的测试图像分类为目标类别。

Vision Transformers (ViTs) have demonstrated the state-of-the-art performance in various vision-related tasks. The success of ViTs motivates adversaries to perform backdoor attacks on ViTs. Although the vulnerability of traditional CNNs to backdoor attacks is well-known, backdoor attacks on ViTs are seldom-studied. Compared to CNNs capturing pixel-wise local features by convolutions, ViTs extract global context information through patches and attentions. Naively transplanting CNN-specific backdoor attacks to ViTs yields only a low clean data accuracy and a low attack success rate. In this paper, we propose a stealth and practical ViT-specific backdoor attack TrojViT. Rather than an area-wise trigger used by CNN-specific backdoor attacks, TrojViT generates a patch-wise trigger designed to build a Trojan composed of some vulnerable bits on the parameters of a ViT stored in DRAM memory through patch salience ranking and attention-target loss. TrojViT further uses parameter distillation to reduce the bit number of the Trojan. Once the attacker inserts the Trojan into the ViT model by flipping the vulnerable bits, the ViT model still produces normal inference accuracy with benign inputs. But when the attacker embeds a trigger into an input, the ViT model is forced to classify the input to a predefined target class. We show that flipping only few vulnerable bits identified by TrojViT on a ViT model using the well-known RowHammer can transform the model into a backdoored one. We perform extensive experiments of multiple datasets on various ViT models. TrojViT can classify 99.64% of test images to a target class by flipping 345 bits on a ViT for ImageNet.

WeatherStream: Light Transport Automation of Single Image Deweathering
Zhang, HowardandBa, YunhaoandYang, EthanandMehra, VaranandGella, BlakeandSuzuki, AkiraandPfahnl, ArnoldandChandrappa, ChethanChinderandWong, AlexandKadambi, Achuta



研究问题:现有的图像去雾方法受限于数据集类型,且在真实世界的各种天气效果上的表现有待提高。
动机:为了解决这一问题,我们提出了一种名为WeatherStream的自动管道,可以捕获所有真实世界的天气效果及其对应的清晰图像对。
方法:我们利用光传输物理原理和在初始种子数据集上训练的模型来拒绝大约99.6%的不需要的场景,并能够推广到新的场景和退化情况。
效果:通过在这个流程中收集的数据集进行训练,我们在一个精心收集的真实世界天气效果测试集上显著提高了现有天气去除方法的性能。

Today single image deweathering is arguably more sensitive to the dataset type, rather than the model. We introduce WeatherStream, an automatic pipeline capturing all real-world weather effects (rain, snow, and rain fog degradations), along with their clean image pairs. Previous state-of-the-art methods that have attempted the all-weather removal task train on synthetic pairs, and are thus limited by the Sim2Real domain gap. Recent work has attempted to manually collect time multiplexed pairs, but the use of human labor limits the scale of such a dataset. We introduce a pipeline that uses the power of light-transport physics and a model trained on a small, initial seed dataset to reject approximately 99.6% of unwanted scenes. The pipeline is able to generalize to new scenes and degradations that can, in turn, be used to train existing models just like fully human-labeled data. Training on a dataset collected through this procedure leads to significant improvements on multiple existing weather removal methods on a carefully human-collected test set of real-world weather effects. The dataset and code can be found in the following website: http://visual.ee.ucla.edu/wstream.htm/.

Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations
Hsiung, LeiandTsai, Yun-YunandChen, Pin-YuandHo, Tsung-Yi



研究问题:如何提高模型对多种语义扰动及其组合的鲁棒性,特别是在现实场景中。
动机:虽然现有的对抗性训练方法在处理单一类型的对抗性扰动(如Lp-norm)上表现良好,但在面对更复杂的多种语义扰动及其组合时,其效果往往不佳。
方法:本文提出了一种新的生成复合对抗性样本的方法,通过组件化的投影梯度下降和自动攻击顺序调度找到最优的攻击组合。同时,提出了广义对抗性训练(GAT)方法,将模型的鲁棒性从Lp-ball扩展到复合语义扰动,如色调、饱和度、亮度、对比度和旋转的组合。
效果:在ImageNet和CIFAR-10数据集上的实验结果表明,GAT不仅对所有测试过的单一攻击类型具有鲁棒性,而且对任何这类攻击的组合也具有鲁棒性。此外,GAT的性能明显优于基于L-infinity-norm的基线对抗性训练方法。

Model robustness against adversarial examples of single perturbation type such as the Lp-norm has been widely studied, yet its generalization to more realistic scenarios involving multiple semantic perturbations and their composition remains largely unexplored. In this paper, we first propose a novel method for generating composite adversarial examples. Our method can find the optimal attack composition by utilizing component-wise projected gradient descent and automatic attack-order scheduling. We then propose generalized adversarial training (GAT) to extend model robustness from Lp-ball to composite semantic perturbations, such as the combination of Hue, Saturation, Brightness, Contrast, and Rotation. Results obtained using ImageNet and CIFAR-10 datasets indicate that GAT can be robust not only to all the tested types of a single attack, but also to any combination of such attacks. GAT also outperforms baseline L-infinity-norm bounded adversarial training approaches by a significant margin.

Structured Kernel Estimation for Photon-Limited Deconvolution
Sanghvi, YashandMao, ZhiyuanandChan, StanleyH.



研究问题:如何改善在低光照和相机抖动条件下拍摄的图像,特别是在强光子散粒噪声存在的情况下。
动机:尽管现有的图像恢复网络在一些场景中表现出色,但它们主要适用于光照充足的环境,并且在光子散粒噪声强烈时性能会显著下降。
方法:提出一种新的模糊估计技术,专门针对光子限制条件进行优化。该方法采用基于梯度的反向传播方法来估计模糊核,并通过使用运动轨迹上关键点的低维表示来对模糊核进行建模,从而大大减小了搜索空间并提高了内核估计问题的规律性。
效果:当插入到迭代框架中时,这种新颖的低维表示提供了改进的内核估计,从而在与端到端训练的神经网络相比时,显著提高了去卷积性能。

Images taken in a low light condition with the presence of camera shake suffer from motion blur and photon shot noise. While state-of-the-art image restoration networks show promising results, they are largely limited to well-illuminated scenes and their performance drops significantly when photon shot noise is strong. In this paper, we propose a new blur estimation technique customized for photon-limited conditions. The proposed method employs a gradient-based backpropagation method to estimate the blur kernel. By modeling the blur kernel using a low-dimensional representation with the key points on the motion trajectory, we significantly reduce the search space and improve the regularity of the kernel estimation problem. When plugged into an iterative framework, our novel low-dimensional representation provides improved kernel estimates and hence significantly better deconvolution performance when compared to end-to-end trained neural networks.

Minimizing Maximum Model Discrepancy for Transferable Black-Box Targeted Attacks
Zhao, AnqiandChu, TongandLiu, YahaoandLi, WenandLi, JingjingandDuan, Lixin



研究问题:从模型差异性角度研究黑盒目标攻击问题。
动机:现有的黑盒目标攻击方法存在成功率不高的问题,需要一种能够保证攻击成功的理论分析方法。
方法:提出了一种新的基于模型差异性的黑盒目标攻击算法,通过最小化替代模型的最大模型差异性来生成对抗样本,从而提高攻击成功率。
效果:在ImageNet数据集上进行了广泛的实验,并与其他现有方法进行比较,结果表明该方法具有更高的成功率和更强的鲁棒性。

In this work, we study the black-box targeted attack problem from the model discrepancy perspective. On the theoretical side, we present a generalization error bound for black-box targeted attacks, which gives a rigorous theoretical analysis for guaranteeing the success of the attack. We reveal that the attack error on a target model mainly depends on empirical attack error on the substitute model and the maximum model discrepancy among substitute models. On the algorithmic side, we derive a new algorithm for black-box targeted attacks based on our theoretical analysis, in which we additionally minimize the maximum model discrepancy(M3D) of the substitute models when training the generator to generate adversarial examples. In this way, our model is capable of crafting highly transferable adversarial examples that are robust to the model variation, thus improving the success rate for attacking the black-box model. We conduct extensive experiments on the ImageNet dataset with different classification models, and our proposed approach outperforms existing state-of-the-art methods by a significant margin.

Deep Random Projector: Accelerated Deep Image Prior
Li, TaihuiandWang, HengkangandZhuang, ZhongandSun, Ju



研究问题:如何提高深度图像先验(DIP)在处理各种图像恢复和一般视觉逆问题的速度,以适应时间敏感的场景。
动机:尽管深度图像先验(DIP)在无需训练数据的情况下,已在解决各种图像恢复和一般视觉逆问题上表现出巨大潜力,但其优化过程通常非常缓慢,这不可避免地阻碍了DIP在时间敏感场景中的实际应用。
方法:本文针对图像恢复问题,对DIP进行了两项关键修改以实现显著的加速:1)在冻结随机初始化的网络权重的同时优化DIP种子;2)减少网络深度。此外,我们重新引入了明确的先验,如稀疏梯度先验——通过总变差正则化进行编码,以保持DIP的最佳性能。
效果:我们在三个图像恢复任务上评估了所提出的方法,包括图像去噪、图像超分辨率和图像修复,与原始DIP及其变体以及使用元学习从额外数据中学习良好初始化器的竞争对手metaDIP进行了比较。我们的方法在最短的时间内获得了具有竞争力的恢复质量,明显胜出。

Deep image prior (DIP) has shown great promise in tackling a variety of image restoration (IR) and general visual inverse problems, needing no training data. However, the resulting optimization process is often very slow, inevitably hindering DIP's practical usage for time-sensitive scenarios. In this paper, we focus on IR, and propose two crucial modifications to DIP that help achieve substantial speedup: 1) optimizing the DIP seed while freezing randomly-initialized network weights, and 2) reducing the network depth. In addition, we reintroduce explicit priors, such as sparse gradient prior---encoded by total-variation regularization, to preserve the DIP peak performance. We evaluate the proposed method on three IR tasks, including image denoising, image super-resolution, and image inpainting, against the original DIP and variants, as well as the competing metaDIP that uses meta-learning to learn good initializers with extra data. Our method is a clear winner in obtaining competitive restoration quality in a minimal amount of time. Our code is available at https://github.com/sun-umn/Deep-Random-Projector.

Revisiting Residual Networks for Adversarial Robustness
Huang, ShihuaandLu, ZhichaoandDeb, KalyanmoyandBoddeti, VishnuNaresh



研究问题:本研究旨在填补现有研究中对卷积神经网络对抗鲁棒性提升方法的大部分关注,以及在架构设计元素(如拓扑、深度和宽度)对其影响方面的研究不足。
动机:大多数的研究都集中在开发更有效的对抗训练方法来提高卷积神经网络的对抗鲁棒性,而对于架构设计元素如何影响对抗鲁棒性的关注却相对较少。
方法:本研究以残差网络为研究对象,分别从块级别和网络规模级别考虑架构设计。首先通过系统实验得出一些初步的见解,然后设计了一个名为RobustResBlock的鲁棒残差块和一个名为RobustScaling的复合缩放规则,用于在期望的浮点运算数下分配深度和宽度。最后,将RobustResBlock和RobustScaling结合起来,提出了一系列具有广泛模型容量的对抗鲁棒残差网络RobustResNets。
效果:通过对多个数据集和对抗攻击进行实验验证,RobustResNets在性能上始终优于标准的WRNs和其他现有的健壮架构,实现了最先进的AutoAttack鲁棒准确度63.7%,并且在参数数量上比其他方法紧凑2倍。

Efforts to improve the adversarial robustness of convolutional neural networks have primarily focused on developing more effective adversarial training methods. In contrast, little attention was devoted to analyzing the role of architectural elements (e.g., topology, depth, and width) on adversarial robustness. This paper seeks to bridge this gap and present a holistic study on the impact of architectural design on adversarial robustness. We focus on residual networks and consider architecture design at the block level as well as at the network scaling level. In both cases, we first derive insights through systematic experiments. Then we design a robust residual block, dubbed RobustResBlock, and a compound scaling rule, dubbed RobustScaling, to distribute depth and width at the desired FLOP count. Finally, we combine RobustResBlock and RobustScaling and present a portfolio of adversarially robust residual networks, RobustResNets, spanning a broad spectrum of model capacities. Experimental validation across multiple datasets and adversarial attacks demonstrate that RobustResNets consistently outperform both the standard WRNs and other existing robust architectures, achieving state-of-the-art AutoAttack robust accuracy 63.7% with 500K external data while being 2x more compact in terms of parameters. The code is available at https://github.com/zhichao-lu/robust-residual-network.

How to Backdoor Diffusion Models?
Chou, Sheng-YenandChen, Pin-YuandHo, Tsung-Yi



研究问题:本文旨在研究扩散模型在面对后门攻击时的鲁棒性。
动机:扩散模型是最先进的深度学习生成模型,但对其在面对后门攻击时可能存在的局限性和风险了解不足。
方法:提出了BadDiffusion攻击框架,通过在模型训练过程中植入后门来实施攻击。
效果:实验结果表明,BadDiffusion能有效地创建出具有高实用性和目标特异性的受损扩散模型,且只需对预训练的干净扩散模型进行微调即可实现后门植入。同时,还探索了一些可能的风险缓解对策。

Diffusion models are state-of-the-art deep learning empowered generative models that are trained based on the principle of learning forward and reverse diffusion processes via progressive noise-addition and denoising. To gain a better understanding of the limitations and potential risks, this paper presents the first study on the robustness of diffusion models against backdoor attacks. Specifically, we propose BadDiffusion, a novel attack framework that engineers compromised diffusion processes during model training for backdoor implantation. At the inference stage, the backdoored diffusion model will behave just like an untampered generator for regular data inputs, while falsely generating some targeted outcome designed by the bad actor upon receiving the implanted trigger signal. Such a critical risk can be dreadful for downstream tasks and applications built upon the problematic model. Our extensive experiments on various backdoor attack settings show that BadDiffusion can consistently lead to compromised diffusion models with high utility and target specificity. Even worse, BadDiffusion can be made cost-effective by simply finetuning a clean pre-trained diffusion model to implant backdoors. We also explore some possible countermeasures for risk mitigation. Our results call attention to potential risks and possible misuse of diffusion models.

Revisiting the Stack-Based Inverse Tone Mapping
Zhang, NingandYe, YuyaoandZhao, YangandWang, Ronggang



研究问题:现有的基于堆栈的逆色调映射(ITM)方法通过从单张低研究问题:现有的基于堆栈的逆色调映射(ITM)方法通过从单张低动态范围图像预测一组多曝光图像来恢复高动态范围(HDR)辐射度,但存在一些限制。
动机:一方面,这些方法估计固定数量的图像(例如,三张曝光上和三张曝光下),可能会引入不必要的计算成本或重建不正确的结果。另一方面,它们忽视了曝光上和曝光下的模型之间的连接,因此无法充分挖掘有效特征。
方法:我们重新审视了基于堆栈的ITM方法,并提出了一种从单张图像重建HDR辐射度的新方法,只需估计两张曝光图像。首先,我们设计了一个曝光自适应块,该块可以根据输入图像的亮度分布自适应调整曝光。其次,我们设计了一个跨模型注意块来连接曝光调整模型。第三,我们提出了一个端到端的ITM管道,通过集成多曝光融合模型。此外,我们还提出了并开放了一个多曝光数据集,该数据集指示了最优的曝光上/下水平。
效果:实验结果表明,我们提出的方法优于一些最先进的方法。

Current stack-based inverse tone mapping (ITM) methods can recover high dynamic range (HDR) radiance by predicting a set of multi-exposure images from a single low dynamic range image. However, there are still some limitations. On the one hand, these methods estimate a fixed number of images (e.g., three exposure-up and three exposure-down), which may introduce unnecessary computational cost or reconstruct incorrect results. On the other hand, they neglect the connections between the up-exposure and down-exposure models and thus fail to fully excavate effective features. In this paper, we revisit the stack-based ITM approaches and propose a novel method to reconstruct HDR radiance from a single image, which only needs to estimate two exposure images. At first, we design the exposure adaptive block that can adaptively adjust the exposure based on the luminance distribution of the input image. Secondly, we devise the cross-model attention block to connect the exposure adjustment models. Thirdly, we propose an end-to-end ITM pipeline by incorporating the multi-exposure fusion model. Furthermore, we propose and open a multi-exposure dataset that indicates the optimal exposure-up/down levels. Experimental results show that the proposed method outperforms some state-of-the-art methods.

Backdoor Defense via Deconfounded Representation Learning
Zhang, ZaixiandLiu, QiandWang, ZhicaiandLu, ZepuandHu, Qingyong



研究问题:深度神经网络易受后门攻击,尽管已有许多方法检测和移除后门,但研究问题:深度神经网络易受后门攻击,尽管已有许多方法检测和移除后门,但如何从被污染的数据集直接获取无后门的干净模型仍不清楚。
动机:通过构建一个因果关系图来模拟被污染数据的产生过程,发现后门攻击就像一个混淆因素,在输入图像和目标标签之间引入了虚假关联,使模型预测变得不可靠。
方法:受因果关系启发,提出因果关系启发的后门防御(CBD),通过前门调整学习去混淆表示。具体来说,故意训练一个带有后门的模型以捕获混淆效应,而另一个干净的模型则通过最小化与带有后门模型的混淆表示的互信息并采用样本加权方案来捕获期望的因果关系。
效果:在多个基准数据集上进行的大量实验表明,所提出的防御方法在减少后门威胁的同时保持对良性样本的高预测精度方面是有效的。进一步的分析还表明,CBD还可以抵抗潜在的适应性攻击。

Deep neural networks (DNNs) are recently shown to be vulnerable to backdoor attacks, where attackers embed hidden backdoors in the DNN model by injecting a few poisoned examples into the training dataset. While extensive efforts have been made to detect and remove backdoors from backdoored DNNs, it is still not clear whether a backdoor-free clean model can be directly obtained from poisoned datasets. In this paper, we first construct a causal graph to model the generation process of poisoned data and find that the backdoor attack acts as the confounder, which brings spurious associations between the input images and target labels, making the model predictions less reliable. Inspired by the causal understanding, we propose the Causality-inspired Backdoor Defense (CBD), to learn deconfounded representations by employing the front-door adjustment. Specifically, a backdoored model is intentionally trained to capture the confounding effects. The other clean model dedicates to capturing the desired causal effects by minimizing the mutual information with the confounding representations from the backdoored model and employing a sample-wise re-weighting scheme. Extensive experiments on multiple benchmark datasets against 6 state-of-the-art attacks verify that our proposed defense method is effective in reducing backdoor threats while maintaining high accuracy in predicting benign samples. Further analysis shows that CBD can also resist potential adaptive attacks.

Color Backdoor: A Robust Poisoning Attack in Color Space
Jiang, WenboandLi, HongweiandXu, GuowenandZhang, Tianwei



研究问题:本文旨在解决针对神经网络的后门攻击问题,特别是如何使攻击更难以察觉和防御。
动机:现有的后门攻击方法往往牺牲了模型的鲁棒性,容易被常见的预处理防御手段击败。
方法:本文提出了一种新的颜色后门攻击方法,通过对所有像素应用统一的色空间偏移作为触发器,实现了攻击的鲁棒性和隐蔽性。同时,利用PSNR、SSIM和LPIPS等指标定义自然度限制,并采用粒子群优化(PSO)算法寻找最优触发器。
效果:实验结果表明,该方法在多种主流后门防御手段下仍具有优越性和鲁棒性。

Backdoor attacks against neural networks have been intensively investigated, where the adversary compromises the integrity of the victim model, causing it to make wrong predictions for inference samples containing a specific trigger. To make the trigger more imperceptible and human-unnoticeable, a variety of stealthy backdoor attacks have been proposed, some works employ imperceptible perturbations as the backdoor triggers, which restrict the pixel differences of the triggered image and clean image. Some works use special image styles (e.g., reflection, Instagram filter) as the backdoor triggers. However, these attacks sacrifice the robustness, and can be easily defeated by common preprocessing-based defenses. This paper presents a novel color backdoor attack, which can exhibit robustness and stealthiness at the same time. The key insight of our attack is to apply a uniform color space shift for all pixels as the trigger. This global feature is robust to image transformation operations and the triggered samples maintain natural-looking. To find the optimal trigger, we first define naturalness restrictions through the metrics of PSNR, SSIM and LPIPS. Then we employ the Particle Swarm Optimization (PSO) algorithm to search for the optimal trigger that can achieve high attack effectiveness and robustness while satisfying the restrictions. Extensive experiments demonstrate the superiority of PSO and the robustness of color backdoor against different mainstream backdoor defenses.

Learning Distortion Invariant Representation for Image Restoration From a Causality Perspective
Li, XinandLi, BingchenandJin, XinandLan, CuilingandChen, Zhibo



研究问题:现有的深度神经网络在图像恢复中存在对不同类型或程度的实际退化泛化能力差的问题。
动机:为了提高深度神经网络对未知退化的泛化能力,提出了一种新的训练策略。
方法:提出一种名为“畸变不变表示学习”(DIL)的新方法,将每种畸变类型和程度视为一个特定的混杂因素,并通过消除每种退化的有害混杂效应来学习畸变不变的表示。通过从优化角度模拟不同畸变的干预来实现基于因果关系的后门准则。
效果:大量实验表明,我们的DIL在未见过畸变类型和程度上具有良好的泛化能力。

In recent years, we have witnessed the great advancement of Deep neural networks (DNNs) in image restoration. However, a critical limitation is that they cannot generalize well to real-world degradations with different degrees or types. In this paper, we are the first to propose a novel training strategy for image restoration from the causality perspective, to improve the generalization ability of DNNs for unknown degradations. Our method, termed Distortion Invariant representation Learning (DIL), treats each distortion type and degree as one specific confounder, and learns the distortion-invariant representation by eliminating the harmful confounding effect of each degradation. We derive our DIL with the back-door criterion in causality by modeling the interventions of different distortions from the optimization perspective. Particularly, we introduce counterfactual distortion augmentation to simulate the virtual distortion types and degrees as the confounders. Then, we instantiate the intervention of each distortion with a virtual model updating based on corresponding distorted images, and eliminate them from the meta-learning perspective. Extensive experiments demonstrate the generalization capability of our DIL on unseen distortion types and degrees. Our code will be available at https://github.com/lixinustc/Causal-IR-DIL.

Learning a Simple Low-Light Image Enhancer From Paired Low-Light Instances
Fu, ZhenqiandYang, YanandTu, XiaotongandHuang, YueandDing, XinghaoandMa, Kai-Kuang



研究问题:本文旨在解决在低光环境下拍摄的图像对比度差和细节丢失的问题。
动机:现有的低光图像增强算法通常使用单一输入图像和一些手工制作的先验来调整光照,但这些方法往往无法充分揭示图像细节,因为单一图像的信息有限,手工制作的先验适应性差。
方法:我们提出了PairLIE,一种无监督的方法,从低光图像对中学习自适应先验。首先,网络需要生成与两个输入相同的清晰图像,因为它们共享相同的图像内容。为了实现这一点,我们在网络中引入了Retinex理论,并使两个反射成分一致。其次,为了协助Retinex分解,我们提出了一个简单的自我监督机制来去除原始图像中的不适当特征。
效果:大量的实验表明,提出的PairLIE在公共数据集上取得了与最先进的方法相当的性能,同时具有更简单的网络和更少的手工制作先验。代码可在以下链接获取:https://github.com/zhenqifu/PairLIE。

Low-light Image Enhancement (LIE) aims at improving contrast and restoring details for images captured in low-light conditions. Most of the previous LIE algorithms adjust illumination using a single input image with several handcrafted priors. Those solutions, however, often fail in revealing image details due to the limited information in a single image and the poor adaptability of handcrafted priors. To this end, we propose PairLIE, an unsupervised approach that learns adaptive priors from low-light image pairs. First, the network is expected to generate the same clean images as the two inputs share the same image content. To achieve this, we impose the network with the Retinex theory and make the two reflectance components consistent. Second, to assist the Retinex decomposition, we propose to remove inappropriate features in the raw image with a simple self-supervised mechanism. Extensive experiments on public datasets show that the proposed PairLIE achieves comparable performance against the state-of-the-art approaches with a simpler network and fewer handcrafted priors. Code is available at: https://github.com/zhenqifu/PairLIE.

Backdoor Cleansing With Unlabeled Data
Pang, LuandSun, TaoandLing, HaibinandChen, Chao



研究问题:如何防御外部训练的深度神经网络可能受到的后门攻击,即如何在不破坏研究问题:如何防御外部训练的深度神经网络可能受到的后门攻击,即如何在不破坏模型对干净输入的正常预测能力的情况下,消除其异常的后门行为。
动机:由于深度神经网络的计算需求日益增加,公司和组织开始将训练过程外包。然而,外部训练的DNNs可能会受到后门攻击。因此,防御这种攻击至关重要。
方法:本文提出了一种新的防御方法,不需要训练标签。通过精心设计的逐层权重重新初始化和知识蒸馏,该方法可以有效地清除可疑网络的后门行为,同时在其正常行为上的影响可以忽略不计。
效果:实验表明,该方法在没有标签的情况下训练,与使用标签训练的最新防御方法相当。即使在分布外的数据上,也观察到了有希望的防御结果。这使得该方法非常实用。

Due to the increasing computational demand of Deep Neural Networks (DNNs), companies and organizations have begun to outsource the training process. However, the externally trained DNNs can potentially be backdoor attacked. It is crucial to defend against such attacks, i.e, to postprocess a suspicious model so that its backdoor behavior is mitigated while its normal prediction power on clean inputs remain uncompromised. To remove the abnormal backdoor behavior, existing methods mostly rely on additional labeled clean samples. However, such requirement may be unrealistic as the training data are often unavailable to end users. In this paper, we investigate the possibility of circumventing such barrier. We propose a novel defense method that does not require training labels. Through a carefully designed layer-wise weight re-initialization and knowledge distillation, our method can effectively cleanse backdoor behaviors of a suspicious network with negligible compromise in its normal behavior. In experiments, we show that our method, trained without labels, is on-par with state-of-the-art defense methods trained using labels. We also observe promising defense results even on out-of-distribution data. This makes our method very practical. Code is available at: https://github.com/luluppang/BCU.

On the Difficulty of Unpaired Infrared-to-Visible Video Translation: Fine-Grained Content-Rich Patches Transfer
Yu, ZhenjieandLi, ShuangandShen, YiruiandLiu, ChiHaroldandWang, Shuigen



研究问题:如何将红外视频转换为具有精细语义模式的清晰可见视频,以填补视觉差距。
动机:现有的视觉模型主要在大量清晰的可见数据上进行训练,当部署到红外成像场景时,面临巨大的视觉差距。
方法:提出一种新的CPTrans框架,通过平衡不同补丁的梯度来解决这个问题,实现内容丰富的补丁传输。具体来说,内容感知优化模块鼓励模型沿着目标补丁的梯度进行优化,确保视觉细节的改善。此外,内容感知的时间规范化模块强制生成器对目标补丁的运动具有鲁棒性。
效果:实验表明,所提出的CPTrans在各种场景下实现了最先进的性能,同时比竞争方法需要更少的训练时间。

Explicit visible videos can provide sufficient visual information and facilitate vision applications. Unfortunately, the image sensors of visible cameras are sensitive to light conditions like darkness or overexposure. To make up for this, recently, infrared sensors capable of stable imaging have received increasing attention in autonomous driving and monitoring. However, most prosperous vision models are still trained on massive clear visible data, facing huge visual gaps when deploying to infrared imaging scenarios. In such cases, transferring the infrared video to a distinct visible one with fine-grained semantic patterns is a worthwhile endeavor. Previous works improve the outputs by equally optimizing each patch on the translated visible results, which is unfair for enhancing the details on content-rich patches due to the long-tail effect of pixel distribution. Here we propose a novel CPTrans framework to tackle the challenge via balancing gradients of different patches, achieving the fine-grained Content-rich Patches Transferring. Specifically, the content-aware optimization module encourages model optimization along gradients of target patches, ensuring the improvement of visual details. Additionally, the content-aware temporal normalization module enforces the generator to be robust to the motions of target patches. Moreover, we extend the existing dataset InfraredCity to more challenging adverse weather conditions (rain and snow), dubbed as InfraredCity-Adverse. Extensive experiments show that the proposed CPTrans achieves state-of-the-art performance under diverse scenes while requiring less training time than competitive methods.

Nighttime Smartphone Reflective Flare Removal Using Optical Center Symmetry Prior
Dai, YuekunandLuo, YihangandZhou, ShangchenandLi, ChongyiandLoy, ChenChange



研究问题:如何有效地消除照片中的反射光斑。
动机:现有的消除反射光斑的方法往往依赖于手动设计的特征,无法准确识别由各种类型的光产生的反射光斑,且在多光源场景中可能会误删除光源。
方法:提出一种光学中心对称先验,即反射光斑和光源总是围绕镜头的光学中心对称,用于更准确地定位反射光斑的提议区域。基于此先验创建了首个反射光斑去除数据集BracketFlare,并使用连续包围来捕捉曝光不足图像中的反射光斑模式,结合正常曝光的图像合成一对有光斑和无光斑的图像。
效果:实验表明,该方法在合成和真实世界数据集上都能有效去除反射光斑。

Reflective flare is a phenomenon that occurs when light reflects inside lenses, causing bright spots or a "ghosting effect" in photos, which can impact their quality. Eliminating reflective flare is highly desirable but challenging. Many existing methods rely on manually designed features to detect these bright spots, but they often fail to identify reflective flares created by various types of light and may even mistakenly remove the light sources in scenarios with multiple light sources. To address these challenges, we propose an optical center symmetry prior, which suggests that the reflective flare and light source are always symmetrical around the lens's optical center. This prior helps to locate the reflective flare's proposal region more accurately and can be applied to most smartphone cameras. Building on this prior, we create the first reflective flare removal dataset called BracketFlare, which contains diverse and realistic reflective flare patterns. We use continuous bracketing to capture the reflective flare pattern in the underexposed image and combine it with a normally exposed image to synthesize a pair of flare-corrupted and flare-free images. With the dataset, neural networks can be trained to remove the reflective flares effectively. Extensive experiments demonstrate the effectiveness of our method on both synthetic and real-world datasets.

Overlooked Factors in Concept-Based Explanations: Dataset Choice, Concept Learnability, and Human Capability
Ramaswamy, VikramV.andKim, SunnieS.Y.andFong, RuthandRussakovsky, Olga



研究问题:本文旨在解决概念基础的解释方法在深度神经网络模型解释中存在的三个被忽视的因素。
动机:尽管概念基础的解释方法在深度学习模型解释中广泛应用,但其存在一些未被充分理解和阐述的限制。
方法:通过对训练好的模型在新“探针”数据集上进行评估,并将模型的输出与该数据集中标记的概念相关联,来识别和分析这三个常被忽视的因素。
效果:研究发现,选择不同的探针数据集会导致完全不同的解释,说明生成的解释不能推广到探针数据集之外;探针数据集中的概念通常比它们用于解释的目标类别更难学习,这引发了对解释正确性的质疑;人类研究表明,概念基础的解释最多只能使用32个或更少的概念,超过这个数量,解释的实用性大大降低。

Concept-based interpretability methods aim to explain a deep neural network model's components and predictions using a pre-defined set of semantic concepts. These methods evaluate a trained model on a new, "probe" dataset and correlate the model's outputs with concepts labeled in that dataset. Despite their popularity, they suffer from limitations that are not well-understood and articulated in the literature. In this work, we identify and analyze three commonly overlooked factors in concept-based explanations. First, we find that the choice of the probe dataset has a profound impact on the generated explanations. Our analysis reveals that different probe datasets lead to very different explanations, suggesting that the generated explanations are not generalizable outside the probe dataset. Second, we find that concepts in the probe dataset are often harder to learn than the target classes they are used to explain, calling into question the correctness of the explanations. We argue that only easily learnable concepts should be used in concept-based explanations. Finally, while existing methods use hundreds or even thousands of concepts, our human studies reveal a much stricter upper bound of 32 concepts or less, beyond which the explanations are much less practically useful. We discuss the implications of our findings and provide suggestions for future development of concept-based interpretability methods. Code for our analysis and user interface can be found at https://github.com/princetonvisualai/OverlookedFactors.

Jedi: Entropy-Based Localization and Removal of Adversarial Patches
Tarchoun, BilelandBenKhalifa, AnouarandMahjoub, MohamedAliandAbu-Ghazaleh, NaelandAlouani, Ihsen



研究问题:如何有效地防御现实世界中的对抗性物理补丁攻击,提高模型的检测和恢复能力?
动机:现有的基于输入梯度或特征分析的最有力的防御措施已被最近的基于GAN的自适应攻击所破坏,这些攻击能生成真实/自然主义的补丁。
方法:本文提出了Jedi,一种新的对抗性补丁防御方法,对真实的补丁攻击具有抵抗力,并且与最先进的技术相比,提高了检测和恢复能力。Jedi利用了两个新的想法:(1)通过熵分析改进潜在补丁区域的识别;(2)使用一个能够完成补丁区域并过滤掉非补丁的高熵正常区域的自动编码器,来改进对抗性补丁的定位。
效果:Jedi实现了高精度的对抗性补丁定位,这对于成功修复图像至关重要。由于Jedi依赖于输入熵分析,因此它是模型无关的,可以在不改变受保护模型的训练或推理的情况下应用于预先训练好的现成模型。Jedi在各种基准测试中平均检测到90%的对抗性补丁,并恢复了高达94%的成功补丁攻击(相比之下,LGS和Jujutsu分别为75%和65%)。即使在其他防御措施无法识别的自适应真实补丁存在的情况下,Jedi也能够继续进行检测。

Real-world adversarial physical patches were recently shown to be successful in compromising state-of-the-art models in a variety of computer vision applications. The most promising defenses that are based on either input gradient or features analyses have been shown to be compromised by recent GAN-based adaptive attacks that generate realistic/naturalistic patches. In this paper, we propose Jedi, a new defense against adversarial patches that is resilient to realistic patch attacks, and also improves detection and recovery compared to the state of the art. Jedi leverages two new ideas: (1) it improves the identification of potential patch regions using entropy analysis: we show that the entropy of adversarial patches is high, even in naturalistic patches; and (2) it improves the localization of adversarial patches, using an autoencoder that is able to complete patch regions and filter out normal regions with high entropy that are not part of a patch. Jedi achieves high precision adversarial patch localization, which we show is critical to successfully repair the images. Since Jedi relies on an input entropy analysis, it is model-agnostic, and can be applied on pre-trained off-the-shelf models without changes to the training or inference of the protected models. Jedi detects on average 90% of adversarial patches across different benchmarks and recovers up to 94% of successful patch attacks (Compared to 75% and 65% for LGS and Jujutsu, respectively). Jedi is also able to continue detection even in the presence of adaptive realistic patches that are able to fool other defenses.

Spatial-Temporal Concept Based Explanation of 3D ConvNets
Ji, YingandWang, YuandKato, Jien



研究问题:本文旨在解决卷积神经网络(CNN)在视频识别任务中的解释性和透明度问题。
动机:尽管CNN在各种任务上表现出色,但其决策过程缺乏透明度和可解释性,这限制了其性能的进一步提升。因此,近年来,人们对提供CNN的解释性和可解释性产生了极大的兴趣。
方法:本文提出了一种基于空间-时间概念的解释(STCE)框架,用于解释3D ConvNets。在这个框架中,(1) 视频被表示为高级超像素,相似的超像素被聚类为一个概念,这对人来说很容易理解;(2) 解释框架为每个概念计算一个分数,该分数反映了概念在ConvNet决策过程中的重要性。
效果:实验证明,该方法能够识别出具有不同重要性水平的概念,使我们能够深入调查这些概念对目标任务(如动作识别)的影响。

Convolutional neural networks (CNNs) have shown remarkable performance on various tasks. Despite its widespread adoption, the decision procedure of the network still lacks transparency and interpretability, making it difficult to enhance the performance further. Hence, there has been considerable interest in providing explanation and interpretability for CNNs over the last few years. Explainable artificial intelligence (XAI) investigates the relationship between input images or videos and output predictions. Recent studies have achieved outstanding success in explaining 2D image classification ConvNets. On the other hand, due to the high computation cost and complexity of video data, the explanation of 3D video recognition ConvNets is relatively less studied. And none of them are able to produce a high-level explanation. In this paper, we propose a STCE (Spatial-temporal Concept-based Explanation) framework for interpreting 3D ConvNets. In our approach: (1) videos are represented with high-level supervoxels, similar supervoxels are clustered as a concept, which is straightforward for human to understand; and (2) the interpreting framework calculates a score for each concept, which reflects its significance in the ConvNet decision procedure. Experiments on diverse 3D ConvNets demonstrate that our method can identify global concepts with different importance levels, allowing us to investigate the impact of the concepts on a target task, such as action recognition, in-depth. The source codes are publicly available at https://github.com/yingji425/STCE.

Metadata-Based RAW Reconstruction via Implicit Neural Functions
Li, LeyiandQiao, HuijieandYe, QiandYang, Qinmin



研究问题:如何充分利用元数据,将RAW图像从sRGB图像中重建出来。
动机:目前的工作中,虽然可以将RAW图像嵌入到sRGB图像中,但存在一些限制,不能完全利用元数据。
方法:本文提出了一种新的方法,将元数据的二维坐标映射到其对应的RAW值上,并使用隐式神经网络函数重建RAW图像。
效果:实验结果表明,该方法仅通过均匀采样即可实现显著的性能提升(平均PSNR超过10dB),并且不需要在不同相机ISP上进行预训练。此外,该方法还适用于引导超分辨率任务。

Many low-level computer vision tasks are desirable to utilize the unprocessed RAW image as input, which remains the linear relationship between pixel values and scene radiance. Recent works advocate to embed the RAW image samples into sRGB images at capture time, and reconstruct the RAW from sRGB by these metadata when needed. However, there still exist some limitations on taking full use of the metadata. In this paper, instead of following the perspective of sRGB-to-RAW mapping, we reformulate the problem as mapping the 2D coordinates of the metadata to its RAW values conditioned on the corresponding sRGB values. With this novel formulation, we propose to reconstruct the RAW image with an implicit neural function, which achieves significant performance improvement (more than 10dB average PSNR) only with the uniform sampling. Compared with most deep learning-based approaches, our method is trained in a self-supervised way that requiring no pre-training on different camera ISPs. We perform further experiments to demonstrate the effectiveness of our method, and show that our framework is also suitable for the task of guided super-resolution.

Towards Effective Adversarial Textured 3D Meshes on Physical Face Recognition
Yang, XiaoandLiu, ChangandXu, LonglongandWang, YikaiandDong, YinpengandChen, NingandSu, HangandZhu, Jun



研究问题:本文旨在开发一种更可靠的技术,对商业系统的对抗鲁棒性进行端到端评估。
动机:现有的物理攻击要么容易被检测,要么对商业识别系统无效。
方法:设计了一种具有复杂拓扑结构的人类面部的对抗性纹理3D网格(AT3D),可以打印并粘贴在攻击者的脸上以逃避防御。同时,提出了基于3D Morphable Model的低维系数空间扰动,显著提高了黑盒转移性,同时享受更快的搜索效率和更好的视觉质量。
效果:广泛的数字和物理实验表明,该方法有效地探索了多个流行的商业服务的安全漏洞,包括三个识别API、四个反欺骗API、两个流行的手机和两个自动化门禁系统。

Face recognition is a prevailing authentication solution in numerous biometric applications. Physical adversarial attacks, as an important surrogate, can identify the weaknesses of face recognition systems and evaluate their robustness before deployed. However, most existing physical attacks are either detectable readily or ineffective against commercial recognition systems. The goal of this work is to develop a more reliable technique that can carry out an end-to-end evaluation of adversarial robustness for commercial systems. It requires that this technique can simultaneously deceive black-box recognition models and evade defensive mechanisms. To fulfill this, we design adversarial textured 3D meshes (AT3D) with an elaborate topology on a human face, which can be 3D-printed and pasted on the attacker's face to evade the defenses. However, the mesh-based optimization regime calculates gradients in high-dimensional mesh space, and can be trapped into local optima with unsatisfactory transferability. To deviate from the mesh-based space, we propose to perturb the low-dimensional coefficient space based on 3D Morphable Model, which significantly improves black-box transferability meanwhile enjoying faster search efficiency and better visual quality. Extensive experiments in digital and physical scenarios show that our method effectively explores the security vulnerabilities of multiple popular commercial services, including three recognition APIs, four anti-spoofing APIs, two prevailing mobile phones and two automated access control systems.

Effective Ambiguity Attack Against Passport-Based DNN Intellectual Property Protection Schemes Through Fully Connected Layer Substitution
Chen, YimingandTian, JinyuandChen, XiangyuandZhou, Jiantao



研究问题:评估基于护照方法的深度模型知识产权保护的安全性。
动机:深度神经网络的训练成本高,训练好的模型被视为有价值的知识产权。近年来,与深度模型相关的知识产权保护越来越受到关注。
方法:提出一种新的有效模糊攻击方法,通过在护照参数前插入一个特殊设计的附件块,使用小于10%的训练数据成功伪造多个有效的护照。
效果:使用伪造的护照,模型的表现与授权护照几乎无法区分(差异小于2%)。此外,还表明该攻击策略可以容易地推广到其他基于水印嵌入的知识产权保护方法上。同时,也给出了潜在的解决方案方向。

Since training a deep neural network (DNN) is costly, the well-trained deep models can be regarded as valuable intellectual property (IP) assets. The IP protection associated with deep models has been receiving increasing attentions in recent years. Passport-based method, which replaces normalization layers with passport layers, has been one of the few protection solutions that are claimed to be secure against advanced attacks. In this work, we tackle the issue of evaluating the security of passport-based IP protection methods. We propose a novel and effective ambiguity attack against passport-based method, capable of successfully forging multiple valid passports with a small training dataset. This is accomplished by inserting a specially designed accessory block ahead of the passport parameters. Using less than 10% of training data, with the forged passport, the model exhibits almost indistinguishable performance difference (less than 2%) compared with that of the authorized passport. In addition, it is shown that our attack strategy can be readily generalized to attack other IP protection methods based on watermark embedding. Directions for potential remedy solutions are also given.

Sibling-Attack: Rethinking Transferable Adversarial Attacks Against Face Recognition
Li, ZexinandYin, BangjieandYao, TaipingandGuo, JunfengandDing, ShouhongandChen, SiminandLiu, Cong



研究问题:开发实用的人脸识别(FR)攻击面临的一个主要挑战是目标FR模型的“黑箱”特性,即攻击者无法获取其梯度和参数信息。
动机:尽管最近的一些研究通过利用转移性在攻击“黑箱”FR模型方面取得了重要进展,但其性能仍然有限,尤其是在对抗在线商业FR系统时,这些系统的防御性能可能较差(例如,平均ASR攻击成功率低于50%)。
方法:受此启发,我们提出了一种新的FR攻击技术——Sibling-Attack,首次从多任务角度探索了新的攻击策略(即利用来自多关联任务的额外信息来提高攻击的转移性)。具体来说,Sibling-Attack选择一组与FR相关联的任务,并根据理论和定量分析选择了属性识别(AR)任务作为攻击任务。然后,它开发了一个优化框架,通过以下方式融合对抗性梯度信息:(1)约束跨任务特征在同一空间内;(2)一个增强任务间梯度兼容性的联合任务元优化框架;(3)一个减小攻击过程中振荡效应的跨任务梯度稳定化方法。
效果:大量实验表明,Sibling-Attack的性能优于最先进的FR攻击技术,对两个知名且广泛使用的商业FR系统的平均ASR攻击成功率分别提高了12.61%和55.77%。

A hard challenge in developing practical face recognition (FR) attacks is due to the black-box nature of the target FR model, i.e., inaccessible gradient and parameter information to attackers. While recent research took an important step towards attacking black-box FR models through leveraging transferability, their performance is still limited, especially against online commercial FR systems that can be pessimistic (e.g., a less than 50% ASR--attack success rate on average). Motivated by this, we present Sibling-Attack, a new FR attack technique for the first time explores a novel multi-task perspective (i.e., leveraging extra information from multi-correlated tasks to boost attacking transferability). Intuitively, Sibling-Attack selects a set of tasks correlated with FR and picks the Attribute Recognition (AR) task as the task used in Sibling-Attack based on theoretical and quantitative analysis. Sibling-Attack then develops an optimization framework that fuses adversarial gradient information through (1) constraining the cross-task features to be under the same space, (2) a joint-task meta optimization framework that enhances the gradient compatibility among tasks, and (3) a cross-task gradient stabilization method which mitigates the oscillation effect during attacking. Extensive experiments demonstrate that Sibling-Attack outperforms state-of-the-art FR attack techniques by a non-trivial margin, boosting ASR by 12.61% and 55.77% on average on state-of-the-art pre-trained FR models and two well-known, widely used commercial FR systems.

Breaching FedMD: Image Recovery via Paired-Logits Inversion Attack
Takahashi, HideakiandLiu, JingjingandLiu, Yang



研究问题:本文旨在解决联邦学习中存在的数据暴露风险,特别是在模型蒸馏的联邦学习(FedMD)中。
动机:尽管在联邦学习中只共享公共数据集的输出逻辑斯蒂值得到的知识比直接共享易受梯度反转攻击的私有模型参数更安全,但精心设计的恶意攻击仍可能导致数据暴露。
方法:通过训练一个利用服务器和客户端模型之间置信度差距的反转神经网络,恶意服务器可以对FedMD及其变体进行配对逻辑斯蒂转换(PLI)攻击。
效果:实验证明,在类似FedMD的方案下,恶意服务器仅使用公共数据集的配对服务器-客户端逻辑斯蒂,就可以在所有测试基准上以高成功率重建私人图像。

Federated Learning with Model Distillation (FedMD) is a nascent collaborative learning paradigm, where only output logits of public datasets are transmitted as distilled knowledge, instead of passing on private model parameters that are susceptible to gradient inversion attacks, a known privacy risk in federated learning. In this paper, we found that even though sharing output logits of public datasets is safer than directly sharing gradients, there still exists a substantial risk of data exposure caused by carefully designed malicious attacks. Our study shows that a malicious server can inject a PLI (Paired-Logits Inversion) attack against FedMD and its variants by training an inversion neural network that exploits the confidence gap between the server and client models. Experiments on multiple facial recognition datasets validate that under FedMD-like schemes, by using paired server-client logits of public datasets only, the malicious server is able to reconstruct private images on all tested benchmarks with a high success rate.

Megahertz Light Steering Without Moving Parts
Pediredla, AdithyaandNarasimhan, SrinivasaG.andChamanzar, MaysamrezaandGkioulekas, Ioannis



研究问题:如何实现高速、低成本的光转向技术,以应用于多种需要高速、可靠、低成本和波长无关的光转向的投影和成像系统。
动机:目前的光转向技术大多需要昂贵的设备和复杂的操作,且速度较慢。我们希望通过开发新的光转向技术,解决这些问题。
方法:我们引入了一种轻型操控技术,该技术在兆赫频率下运行,无需移动部件,成本低于一百美元。通过使用超声波波在可压缩介质(如水)中产生时空变化的折射率场,将介质转化为动态移动透镜,从而控制电输入生成波的超声波换能器来改变透镜,以声速(水中为1.5公里/秒)引导光线。
效果:我们构建了这项技术的物理原型,并用于实现兆赫速率下的不同的扫描技术(比商用替代品如振镜扫描器快三个数量级),并展示了概念验证的投影和激光雷达应用。我们还推导出其基本限制的理论,并开发了一个物理准确的模拟器进行虚拟设计。这项技术为各种应用中的高速和低成本光转向提供了有前景的解决方案。

We introduce a light steering technology that operates at megahertz frequencies, has no moving parts, and costs less than a hundred dollars. Our technology can benefit many projector and imaging systems that critically rely on high-speed, reliable, low-cost, and wavelength-independent light steering, including laser scanning projectors, LiDAR sensors, and fluorescence microscopes. Our technology uses ultrasound waves to generate a spatiotemporally-varying refractive index field inside a compressible medium, such as water, turning the medium into a dynamic traveling lens. By controlling the electrical input of the ultrasound transducers that generate the waves, we can change the lens, and thus steer light, at the speed of sound (1.5 km/s in water). We build a physical prototype of this technology, use it to realize different scanning techniques at megahertz rates (three orders of magnitude faster than commercial alternatives such as galvo mirror scanners), and demonstrate proof-of-concept projector and LiDAR applications. To encourage further innovation towards this new technology, we derive the theory for its fundamental limits and develop a physically-accurate simulator for virtual design. Our technology offers a promising solution for achieving high-speed and low-cost light steering in a variety of applications.

Learning Bottleneck Concepts in Image Classification
Wang, BowenandLi, LiangzhiandNakashima, YutaandNagahara, Hajime



研究问题:如何解释和理解深度神经网络的行为,提高其可解释性。
动机:现有的AI解释方法主要提供像素级别的相关性,但可能需要专家知识来解释这些解释。
方法:本文提出了瓶颈概念学习器(BotCL),通过在目标任务上进行训练,无需对概念进行显式监督,仅通过概念的存在/缺失来表示图像。使用自我监督和定制的正则化器,使学习到的概念易于人类理解。
效果:通过在一些图像分类任务上进行测试,证明了BotCL重建神经网络以提高其可解释性的潜力。

Interpreting and explaining the behavior of deep neural networks is critical for many tasks. Explainable AI provides a way to address this challenge, mostly by providing per-pixel relevance to the decision. Yet, interpreting such explanations may require expert knowledge. Some recent attempts toward interpretability adopt a concept-based framework, giving a higher-level relationship between some concepts and model decisions. This paper proposes Bottleneck Concept Learner (BotCL), which represents an image solely by the presence/absence of concepts learned through training over the target task without explicit supervision over the concepts. It uses self-supervision and tailored regularizers so that learned concepts can be human-understandable. Using some image classification tasks as our testbed, we demonstrate BotCL's potential to rebuild neural networks for better interpretability.

Physically Adversarial Infrared Patches With Learnable Shapes and Locations
Wei, XingxingandYu, JieandHuang, Yao



研究问题:如何评估红外对象探测器在现实世界中对抗对抗性示例的鲁棒性。
动机:由于红外对象探测器在安全关键任务中的广泛应用,有必要评估它们在实际世界中对抗对抗性示例的鲁棒性。然而,目前很少有物理红外攻击在实践中实施,因为它们从数字世界到物理世界的复杂转换。
方法:本文提出了一种名为“对抗性红外补丁”的物理上可行的红外攻击方法。考虑到红外摄像机通过捕捉物体的热辐射成像的机制,对抗性红外补丁通过在目标物体上附加一块隔热材料补丁来操纵其热分布来进行攻击。为了增强对抗性攻击,我们提出了一种新的聚合正则化方法来指导目标物体上的补丁形状和位置的同步学习。因此,一个简单的基于梯度的优化可以用来解决它们。
效果:我们在不同对象检测任务中使用各种对象探测器验证了对抗性红外补丁。实验结果表明,我们的方法在物理环境中对行人检测器和车辆检测器实现了超过90%的攻击成功率(ASR),其中物体在不同的角度、距离、姿势和场景下被捕获。更重要的是,对抗性红外补丁易于实施,仅需要0.5小时即可在物理世界中构建,这验证了其有效性和效率。

Owing to the extensive application of infrared object detectors in the safety-critical tasks, it is necessary to evaluate their robustness against adversarial examples in the real world. However, current few physical infrared attacks are complicated to implement in practical application because of their complex transformation from digital world to physical world. To address this issue, in this paper, we propose a physically feasible infrared attack method called "adversarial infrared patches". Considering the imaging mechanism of infrared cameras by capturing objects' thermal radiation, adversarial infrared patches conduct attacks by attaching a patch of thermal insulation materials on the target object to manipulate its thermal distribution. To enhance adversarial attacks, we present a novel aggregation regularization to guide the simultaneous learning for the patch' shape and location on the target object. Thus, a simple gradient-based optimization can be adapted to solve for them. We verify adversarial infrared patches in different object detection tasks with various object detectors. Experimental results show that our method achieves more than 90% Attack Success Rate (ASR) versus the pedestrian detector and vehicle detector in the physical environment, where the objects are captured in different angles, distances, postures, and scenes. More importantly, adversarial infrared patch is easy to implement, and it only needs 0.5 hour to be constructed in the physical world, which verifies its effectiveness and efficiency.

Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective
Zhang, WeixiaandZhai, GuangtaoandWei, YingandYang, XiaokangandMa, Kede



研究问题:本文旨在推进盲图像质量评估(BIQA),预测在没有任何参考信息的情况下,人类对图像质量的感知。
动机:目前的盲图像质量评估方法需要借助其他任务的辅助知识,但模型参数共享和损失加权通常需要手动设定。
方法:我们开发了一种通用且自动化的多任务学习方案,通过自动确定模型参数共享和损失加权,从其他任务中挖掘辅助知识。具体来说,我们首先使用文本模板描述所有候选标签组合(来自多个任务),然后计算视觉-文本嵌入的余弦相似性的联合概率。每个任务的预测可以从联合分布中推断出来,并通过精心设计的损失函数进行优化。
效果:我们在场景分类、失真类型识别和盲图像质量评估这三个任务上进行了全面的实验,验证了我们的盲图像质量评估方法1)从场景分类和失真类型识别任务中受益,并在多个IQA数据集上超越了最先进的方法;2)在组最大差异化竞争中更稳健;3)能更有效地重新调整不同IQA数据集的质量注释。

We aim at advancing blind image quality assessment (BIQA), which predicts the human perception of image quality without any reference information. We develop a general and automated multitask learning scheme for BIQA to exploit auxiliary knowledge from other tasks, in a way that the model parameter sharing and the loss weighting are determined automatically. Specifically, we first describe all candidate label combinations (from multiple tasks) using a textual template, and compute the joint probability from the cosine similarities of the visual-textual embeddings. Predictions of each task can be inferred from the joint distribution, and optimized by carefully designed loss functions. Through comprehensive experiments on learning three tasks - BIQA, scene classification, and distortion type identification, we verify that the proposed BIQA method 1) benefits from the scene classification and distortion type identification tasks and outperforms the state-of-the-art on multiple IQA datasets, 2) is more robust in the group maximum differentiation competition, and 3) realigns the quality annotations from different IQA datasets more effectively. The source code is available at https://github.com/zwx8981/LIQE.

SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries
Humayun, AhmedImtiazandBalestriero, RandallandBalakrishnan, GuhaandBaraniuk, RichardG.



研究问题:目前的深度网络可视化和解释性方法主要依赖于数据空间的可视化,如评分哪些研究问题:目前的深度网络可视化和解释性方法主要依赖于数据空间的可视化,如评分哪些维度的数据负责其相关的预测或生成与给定深度网络单元或表示最匹配的新数据特征或样本。
动机:本文的动机是开发第一种可证明精确的方法来计算深度网络在数据空间特定区域上的映射几何,包括其决策边界。
方法:通过利用连续分段线性(CPWL)样条深度网络的理论,SplineCam精确计算了深度网络的几何形状,而无需进行诸如采样或架构简化的近似。SplineCam适用于基于CPWL激活非线性的任何深度网络架构,包括(泄漏)ReLU、绝对值、maxout和最大池化,也可以应用于回归深度网络,如隐式神经网络表示。
效果:除了决策边界的可视化和描述外,SplineCam还能够比较架构、测量泛化能力,并在数据流形上或下从决策边界进行采样。

Current Deep Network (DN) visualization and interpretability methods rely heavily on data space visualizations such as scoring which dimensions of the data are responsible for their associated prediction or generating new data features or samples that best match a given DN unit or representation. In this paper, we go one step further by developing the first provably exact method for computing the geometry of a DN's mapping -- including its decision boundary -- over a specified region of the data space. By leveraging the theory of Continuous Piecewise Linear (CPWL) spline DNs, SplineCam exactly computes a DN's geometry without resorting to approximations such as sampling or architecture simplification. SplineCam applies to any DN architecture based on CPWL activation nonlinearities, including (leaky) ReLU, absolute value, maxout, and max-pooling and can also be applied to regression DNs such as implicit neural representations. Beyond decision boundary visualization and characterization, SplineCam enables one to compare architectures, measure generalizability, and sample from the decision boundary on or off the data manifold. Project website: https://bit.ly/splinecam

Physical-World Optical Adversarial Attacks on 3D Face Recognition
Li, YanjieandLi, YiquanandDai, XuelongandGuo, SongtaoandXiao, Bin



研究问题:现有的对抗性攻击在真实世界的3D面部识别任务上的成功率较低,因为3D打印攻击需要生成的点与表面相邻,这限制了对抗性示例的搜索空间。
动机:为了解决真实世界的挑战,我们提出了一种新的基于结构光的对抗性攻击方法。
方法:我们将3D重建过程和皮肤的反射率纳入优化过程,以实现端到端的攻击,并提出了3D变换不变损失和敏感性图来提高鲁棒性。
效果:实验表明,我们的新方法可以使用较少的扰动来攻击点云和深度图像的3D面部识别系统,成功率较高。

The success rate of current adversarial attacks remains low on real-world 3D face recognition tasks because the 3D-printing attacks need to meet the requirement that the generated points should be adjacent to the surface, which limits the adversarial example' searching space. Additionally, they have not considered unpredictable head movements or the non-homogeneous nature of skin reflectance in the real world. To address the real-world challenges, we propose a novel structured-light attack against structured-light-based 3D face recognition. We incorporate the 3D reconstruction process and skin's reflectance in the optimization process to get the end-to-end attack and present 3D transform invariant loss and sensitivity maps to improve robustness. Our attack enables adversarial points to be placed in any position and is resilient to random head movements while maintaining the perturbation unnoticeable. Experiments show that our new method can attack point-cloud-based and depth-image-based 3D face recognition systems with a high success rate, using fewer perturbations than previous physical 3D adversarial attacks.

Adversarial Counterfactual Visual Explanations
Jeanneret, GuillaumeandSimon, Lo{\"\i



研究问题:如何将对抗性攻击转化为语义上有意义的扰动,以进行反事实解释?
动机:目前的对抗性攻击在反事实解释中被视为噪声,无法直接使用。
方法:提出一种利用去噪扩散概率模型生成对抗性攻击的方法,通过扩散模型对攻击进行优化,使其成为语义上有意义的扰动。
效果:实验证明,该方法在多个测试集上都优于当前最先进的技术。

Counterfactual explanations and adversarial attacks have a related goal: flipping output labels with minimal perturbations regardless of their characteristics. Yet, adversarial attacks cannot be used directly in a counterfactual explanation perspective, as such perturbations are perceived as noise and not as actionable and understandable image modifications. Building on the robust learning literature, this paper proposes an elegant method to turn adversarial attacks into semantically meaningful perturbations, without modifying the classifiers to explain. The proposed approach hypothesizes that Denoising Diffusion Probabilistic Models are excellent regularizers for avoiding high-frequency and out-of-distribution perturbations when generating adversarial attacks. The paper's key idea is to build attacks through a diffusion model to polish them. This allows studying the target model regardless of its robustification level. Extensive experimentation shows the advantages of our counterfactual explanation approach over current State-of-the-Art in multiple testbeds.

Superclass Learning With Representation Enhancement
Gan, ZeyuandZhao, SuyunandKang, JinlongandShang, LiyuanandChen, HongandLi, Cuiping



研究问题:在许多真实场景中,数据通常根据专家知识划分为几个人工超类别,而不是图像表示。现有的分类技术由于缺乏通用语义特征,无法识别没有原始类别标签的超类别,导致性能严重受损或需要巨大的注释成本。
动机:为了缩小这个差距,本文提出了一个名为SuperClass Learning with Representation Enhancement(SCLRE)的超类学习框架,通过利用增强的表示来识别超类别。
方法:具体来说,SCLRE通过利用批量中的自我注意力技术,消除了原始类别的界限,增强了每个超类的表示。在增强的表示空间上,然后重建了一个超类感知的决策边界。
效果:实验结果表明,SCLRE在CIFAR-100数据集和四个高分辨率数据集上优于基线和其他对比方法。

In many real scenarios, data are often divided into a handful of artificial super categories in terms of expert knowledge rather than the representations of images. Concretely, a superclass may contain massive and various raw categories, such as refuse sorting. Due to the lack of common semantic features, the existing classification techniques are intractable to recognize superclass without raw class labels, thus they suffer severe performance damage or require huge annotation costs. To narrow this gap, this paper proposes a superclass learning framework, called SuperClass Learning with Representation Enhancement(SCLRE), to recognize super categories by leveraging enhanced representation. Specifically, by exploiting the self-attention technique across the batch, SCLRE collapses the boundaries of those raw categories and enhances the representation of each superclass. On the enhanced representation space, a superclass-aware decision boundary is then reconstructed. Theoretically, we prove that by leveraging attention techniques the generalization error of SCLRE can be bounded under superclass scenarios. Experimentally, extensive results demonstrate that SCLRE outperforms the baseline and other contrastive-based methods on CIFAR-100 datasets and four high-resolution datasets.

Shortcomings of Top-Down Randomization-Based Sanity Checks for Evaluations of Deep Neural Network Explanations
Binder, AlexanderandWeber, LeanderandLapuschkin, SebastianandMontavon, Gr\'egoireandM\"uller, Klaus-RobertandSamek, Wojciech



研究问题:如何准确评估模型的解释能力?
动机:当前的模型解释评估方法可能存在问题,需要更严谨的评估方式。
方法:通过实验发现随机化测试在评估解释方法时存在局限性,提出了观察实验差距和识别限制的方法。
效果:证明了基于模型随机化的健全性检查不能作为选择或丢弃解释方法的主要标准,揭示了其对解释方法排名的不适当性。

While the evaluation of explanations is an important step towards trustworthy models, it needs to be done carefully, and the employed metrics need to be well-understood. Specifically model randomization testing can be overinterpreted if regarded as a primary criterion for selecting or discarding explanation methods. To address shortcomings of this test, we start by observing an experimental gap in the ranking of explanation methods between randomization-based sanity checks [1] and model output faithfulness measures (e.g. [20]). We identify limitations of model-randomization-based sanity checks for the purpose of evaluating explanations. Firstly, we show that uninformative attribution maps created with zero pixel-wise covariance easily achieve high scores in this type of checks. Secondly, we show that top-down model randomization preserves scales of forward pass activations with high probability. That is, channels with large activations have a high probility to contribute strongly to the output, even after randomization of the network on top of them. Hence, explanations after randomization can only be expected to differ to a certain extent. This explains the observed experimental gap. In summary, these results demonstrate the inadequacy of model-randomization-based sanity checks as a criterion to rank attribution methods.

Towards Trustable Skin Cancer Diagnosis via Rewriting Model's Decision
Yan, SiyuanandYu, ZhenandZhang, XuelinandMahapatra, DwarikanathandChandra, ShekharS.andJanda, MonikaandSoyer, PeterandGe, Zongyuan



研究问题:深度神经网络在图像识别任务上表现出色,但可能过度依赖数据集内的误导因素,导致模型决策不可靠,并在真实场景中产生灾难性后果。
动机:针对皮肤癌诊断问题,探索并解决深度神经网络的误导行为。
方法:引入人类参与的模型训练框架,允许用户观察和纠正模型的决策逻辑。通过分析样本的共现行为自动发现误导因素,利用易获取的概念示例学习误导概念。将黑箱模型的特征表示映射到可解释的概念空间,使用户能够理解概念并通过一阶逻辑指令进行干预。
效果:在自制的皮肤损伤数据集和几个公共皮肤损伤数据集上进行了系统评估。实验表明,该方法能有效检测和去除数据集中的误导因素,无需事先了解类别分布,也不需要完全标注的概念标签。同时,该方法使模型专注于临床相关概念,提高了模型推理的性能和可信度。

Deep neural networks have demonstrated promising performance on image recognition tasks. However, they may heavily rely on confounding factors, using irrelevant artifacts or bias within the dataset as the cue to improve performance. When a model performs decision-making based on these spurious correlations, it can become untrustable and lead to catastrophic outcomes when deployed in the real-world scene. In this paper, we explore and try to solve this problem in the context of skin cancer diagnosis. We introduce a human-in-the-loop framework in the model training process such that users can observe and correct the model's decision logic when confounding behaviors happen. Specifically, our method can automatically discover confounding factors by analyzing the co-occurrence behavior of the samples. It is capable of learning confounding concepts using easily obtained concept exemplars. By mapping the blackbox model's feature representation onto an explainable concept space, human users can interpret the concept and intervene via first order-logic instruction. We systematically evaluate our method on our newly crafted, well-controlled skin lesion dataset and several public skin lesion datasets. Experiments show that our method can effectively detect and remove confounding factors from datasets without any prior knowledge about the category distribution and does not require fully annotated concept labels. We also show that our method enables the model to focus on clinicalrelated concepts, improving the model's performance and trustworthiness during model inference.

Visibility Constrained Wide-Band Illumination Spectrum Design for Seeing-in-the-Dark
Niu, MuyaoandLi, ZhuoxiaoandZhong, ZhihangandZheng, Yinqiang



研究问题:计算机视觉中的暗光视觉任务是最重要且最具挑战性的任务之一,其广泛应用和野外环境的极端复杂性使其成为一项重要任务。
动机:现有的方法主要分为两类:1)依赖RGB的方法仅使用退化的RGB输入恢复信息(例如,低光增强);2)不依赖RGB的方法将辅助近红外(NIR)光源下捕获的图像转换为RGB域(例如,NIR2RGB转换)。后者在完全黑暗中工作并且光源对肉眼友好,但由于其内在的模糊性而往往不稳定。
方法:本文通过设计可见-近红外范围的辅助照明的最佳光谱来强化NIR2RGB转换,同时保持视觉友好性。我们的核心思想是量化人类视觉系统隐含的可见性约束并将其纳入设计流程。通过模拟VIS-NIR范围内的图像形成过程,在可见性约束定义的可行区域内以全可微的方式自动设计LED的最优复用。我们还使用定制的50波段滤波轮收集了大幅扩展的VIS-NIR高光谱图像数据集进行实验。
效果:实验结果表明,与仅使用NIR相比,使用优化的宽波段照明可以显著提高任务性能。

Seeing-in-the-dark is one of the most important and challenging computer vision tasks due to its wide applications and extreme complexities of in-the-wild scenarios. Existing arts can be mainly divided into two threads: 1) RGB-dependent methods restore information using degraded RGB inputs only (e.g., low-light enhancement), 2) RGB-independent methods translate images captured under auxiliary near-infrared (NIR) illuminants into RGB domain (e.g., NIR2RGB translation). The latter is very attractive since it works in complete darkness and the illuminants are visually friendly to naked eyes, but tends to be unstable due to its intrinsic ambiguities. In this paper, we try to robustify NIR2RGB translation by designing the optimal spectrum of auxiliary illumination in the wide-band VIS-NIR range, while keeping visual friendliness. Our core idea is to quantify the visibility constraint implied by the human vision system and incorporate it into the design pipeline. By modeling the formation process of images in the VIS-NIR range, the optimal multiplexing of a wide range of LEDs is automatically designed in a fully differentiable manner, within the feasible region defined by the visibility constraint. We also collect a substantially expanded VIS-NIR hyperspectral image dataset for experiments by using a customized 50-band filter wheel. Experimental results show that the task can be significantly improved by using the optimized wide-band illumination than using NIR only. Codes Available: https://github.com/MyNiuuu/VCSD.

GamutMLP: A Lightweight MLP for Color Loss Recovery
Le, HoangM.andPrice, BrianandCohen, ScottandBrown, MichaelS.



研究问题:如何在图像编码过程中恢复丢失的颜色信息?
动机:现有的图像处理软件在将图片编码为sRGB颜色空间时,会裁剪掉大部分可见颜色,导致颜色信息丢失。
方法:提出一种优化轻量多层感知器(MLP)模型的方法,在色域缩小步骤中预测被裁剪的值,从而恢复丢失的颜色信息。
效果:该方法有效且高效,只需2秒即可完成优化,且仅需23KB的存储空间。通过与预训练的DNN-based gamut扩展网络和其他隐式神经表示方法进行比较,证明了其优越性。

Cameras and image-editing software often process images in the wide-gamut ProPhoto color space, encompassing 90% of all visible colors. However, when images are encoded for sharing, this color-rich representation is transformed and clipped to fit within the small-gamut standard RGB (sRGB) color space, representing only 30% of visible colors. Recovering the lost color information is challenging due to the clipping procedure. Inspired by neural implicit representations for 2D images, we propose a method that optimizes a lightweight multi-layer-perceptron (MLP) model during the gamut reduction step to predict the clipped values. GamutMLP takes approximately 2 seconds to optimize and requires only 23 KB of storage. The small memory footprint allows our GamutMLP model to be saved as metadata in the sRGB image---the model can be extracted when needed to restore wide-gamut color values. We demonstrate the effectiveness of our approach for color recovery and compare it with alternative strategies, including pre-trained DNN-based gamut expansion networks and other implicit neural representation methods. As part of this effort, we introduce a new color gamut dataset of 2200 wide-gamut/small-gamut images for training and testing.

RIATIG: Reliable and Imperceptible Adversarial Text-to-Image Generation With Natural Prompts
Liu, HanandWu, YuhaoandZhai, ShixuanandYuan, BoandZhang, Ning



研究问题:本文旨在探索文本到图像生成模型的鲁棒性,特别是在对抗性攻击下的表现。
动机:随着文本到图像生成技术的进步和普及,其可能带来的安全风险引起了人们的关注。然而,目前的研究主要集中在非定向设置上,对于可靠性(攻击成功率)和隐蔽性(难以察觉)的全面考虑还不够。
方法:本文提出了RIATIG,一种通过难以察觉的例子对文本到图像模型进行可靠且难以察觉的对抗性攻击。通过将例子的制作过程形式化为优化过程,并使用基于遗传的方法来解决,提出的攻击可以以可靠的方式为文本到图像生成模型生成难以察觉的提示。
效果:对六种流行的文本到图像生成模型的评估表明,无论是在白盒还是黑盒设置中,RIATIG攻击都具有高效和难以察觉的特性。为了方便社区在此基础上进行进一步的研究,作者已经公开了相关的实验结果。

The field of text-to-image generation has made remarkable strides in creating high-fidelity and photorealistic images. As this technology gains popularity, there is a growing concern about its potential security risks. However, there has been limited exploration into the robustness of these models from an adversarial perspective. Existing research has primarily focused on untargeted settings, and lacks holistic consideration for reliability (attack success rate) and stealthiness (imperceptibility). In this paper, we propose RIATIG, a reliable and imperceptible adversarial attack against text-to-image models via inconspicuous examples. By formulating the example crafting as an optimization process and solving it using a genetic-based method, our proposed attack can generate imperceptible prompts for text-to-image generation models in a reliable way. Evaluation of six popular text-to-image generation models demonstrates the efficiency and stealthiness of our attack in both white-box and black-box settings. To allow the community to build on top of our findings, we've made the artifacts available.

Proximal Splitting Adversarial Attack for Semantic Segmentation
Rony, J\'er\^omeandPesquet, Jean-ChristopheandBenAyed, Ismail



研究问题:现有的对抗性攻击方法主要针对分类任务,对于密集预测任务如语义分割的研究较少。
动机:目前针对语义分割的对抗性攻击方法不能准确解决问题,对模型进行欺骗所需的扰动大小估计过大。
方法:提出一种基于邻近分裂的白盒攻击方法,通过增广拉格朗日方法和自适应约束缩放和屏蔽策略,在非凸最小化框架内处理大量约束,生成具有更小l_infinity范数的对抗性扰动。
效果:实验证明,该方法显著优于先前提出的攻击方法,以及适应分割的分类攻击方法,为这一密集任务提供了第一个全面的基准。

Classification has been the focal point of research on adversarial attacks, but only a few works investigate methods suited to denser prediction tasks, such as semantic segmentation. The methods proposed in these works do not accurately solve the adversarial segmentation problem and, therefore, overestimate the size of the perturbations required to fool models. Here, we propose a white-box attack for these models based on a proximal splitting to produce adversarial perturbations with much smaller l_infinity norms. Our attack can handle large numbers of constraints within a nonconvex minimization framework via an Augmented Lagrangian approach, coupled with adaptive constraint scaling and masking strategies. We demonstrate that our attack significantly outperforms previously proposed ones, as well as classification attacks that we adapted for segmentation, providing a first comprehensive benchmark for this dense task.

Towards Transferable Targeted Adversarial Examples
Wang, ZhiboandYang, HongshanandFeng, YunheandSun, PengandGuo, HengchangandZhang, ZhifeiandRen, Kui



研究问题:如何生成能误导模型预测特定类别的可转移的目标对抗性示例。
动机:现有的可转移目标对抗性攻击通常无法充分描述目标类别分布,因此其转移能力有限。
方法:提出一种可转移的目标对抗性攻击(TTAA),从标签和特征两个角度捕捉目标类别的分布信息,以生成高度可转移的目标对抗性示例。设计了一个生成对抗训练框架,包括一个生成器来产生目标对抗性示例,以及一个特征-标签双重判别器来区分生成的对抗性示例和目标类别图像。
效果:实验表明,该方法在目标对抗性示例的转移能力上表现出色。当从VGG-19转移到DenseNet-121时,目标欺骗率达到了95.13%,显著优于最先进的方法。

Transferability of adversarial examples is critical for black-box deep learning model attacks. While most existing studies focus on enhancing the transferability of untargeted adversarial attacks, few of them studied how to generate transferable targeted adversarial examples that can mislead models into predicting a specific class. Moreover, existing transferable targeted adversarial attacks usually fail to sufficiently characterize the target class distribution, thus suffering from limited transferability. In this paper, we propose the Transferable Targeted Adversarial Attack (TTAA), which can capture the distribution information of the target class from both label-wise and feature-wise perspectives, to generate highly transferable targeted adversarial examples. To this end, we design a generative adversarial training framework consisting of a generator to produce targeted adversarial examples, and feature-label dual discriminators to distinguish the generated adversarial examples from the target class images. Specifically, we design the label discriminator to guide the adversarial examples to learn label-related distribution information about the target class. Meanwhile, we design a feature discriminator, which extracts the feature-wise information with strong cross-model consistency, to enable the adversarial examples to learn the transferable distribution information. Furthermore, we introduce the random perturbation dropping to further enhance the transferability by augmenting the diversity of adversarial examples used in the training process. Experiments demonstrate that our method achieves excellent performance on the transferability of targeted adversarial examples. The targeted fooling rate reaches 95.13% when transferred from VGG-19 to DenseNet-121, which significantly outperforms the state-of-the-art methods.

Improving Robustness of Vision Transformers by Reducing Sensitivity To Patch Corruptions
Guo, YongandStutz, DavidandSchiele, Bernt



研究问题:视觉转换器对图像的噪声或模糊等损坏仍然敏感,主要源于其基于补丁输入的不稳定自注意力机制。
动机:为了提高视觉转换器的鲁棒性,减少对补丁损坏的敏感性。
方法:提出一种新的训练方法——降低对补丁损坏的敏感性(RSPC)。首先识别并遮挡/破坏最易受损的补丁,然后通过将干净和损坏示例之间的中间特征对齐,明确减少对其的敏感性。
效果:实验证明,RSPC大大提高了注意力层的稳定性,并在各种基准测试中持续展现出更好的鲁棒性,包括CIFAR-10/100-C、ImageNet-A、ImageNet-C和ImageNet-P。

Despite their success, vision transformers still remain vulnerable to image corruptions, such as noise or blur. Indeed, we find that the vulnerability mainly stems from the unstable self-attention mechanism, which is inherently built upon patch-based inputs and often becomes overly sensitive to the corruptions across patches. For example, when we only occlude a small number of patches with random noise (e.g., 10%), these patch corruptions would lead to severe accuracy drops and greatly distract intermediate attention layers. To address this, we propose a new training method that improves the robustness of transformers from a new perspective -- reducing sensitivity to patch corruptions (RSPC). Specifically, we first identify and occlude/corrupt the most vulnerable patches and then explicitly reduce sensitivity to them by aligning the intermediate features between clean and corrupted examples. We highlight that the construction of patch corruptions is learned adversarially to the following feature alignment process, which is particularly effective and essentially different from existing methods. In experiments, our RSPC greatly improves the stability of attention layers and consistently yields better robustness on various benchmarks, including CIFAR-10/100-C, ImageNet-A, ImageNet-C, and ImageNet-P.

All-in-One Image Restoration for Unknown Degradations Using Adaptive Discriminative Filters for Specific Degradations
Park, DongwonandLee, ByungHyunandChun, SeYoung



研究问题:现有的图像恢复方法在面对未知的多种退化时,无法有效应对。
动机:为了解决这一问题,我们提出了一种自适应判别滤波器模型(ADMS)。
方法:我们的模型通过分类退化的方式,让网络只使用约3%的网络参数来专门处理每种退化,并自适应地应用这些滤波器。
效果:实验证明,我们的方法在雨-噪声-模糊和雨-雪-霾等多种退化的图像恢复基准数据集上,都取得了最先进的性能。

Image restorations for single degradations have been widely studied, demonstrating excellent performance for each degradation, but can not reflect unpredictable realistic environments with unknown multiple degradations, which may change over time. To mitigate this issue, image restorations for known and unknown multiple degradations have recently been investigated, showing promising results, but require large networks or have sub-optimal architectures for potential interference among different degradations. Here, inspired by the filter attribution integrated gradients (FAIG), we propose an adaptive discriminative filter-based model for specific degradations (ADMS) to restore images with unknown degradations. Our method allows the network to contain degradation-dedicated filters only for about 3% of all network parameters per each degradation and to apply them adaptively via degradation classification (DC) to explicitly disentangle the network for multiple degradations. Our proposed method has demonstrated its effectiveness in comparison studies and achieved state-of-the-art performance in all-in-one image restoration benchmark datasets of both Rain-Noise-Blur and Rain-Snow-Haze.

Turning Strengths Into Weaknesses: A Certified Robustness Inspired Attack Framework Against Graph Neural Networks
Wang, BinghuiandPang, MengandDong, Yun



研究问题:本文旨在设计一种攻击框架,以显著增强现有的逃避和中毒攻击。
动机:尽管现有的攻击方法已经显示出良好的攻击性能,但图神经网络(GNNs)在测试和训练阶段对图结构的改变非常敏感。
方法:我们的攻击框架受到认证鲁棒性的启发,首先利用随机平滑推导出节点的认证扰动大小,然后根据这个属性来重点攻击那些在图结构改变后更容易被攻击的节点。
效果:我们将这种攻击框架应用到现有的攻击中,实验结果显示,它可以显著提高现有攻击的性能。

Graph neural networks (GNNs) have achieved state-of-the-art performance in many graph-related tasks such as node classification. However, recent studies show that GNNs are vulnerable to both test-time and training-time attacks that perturb the graph structure. While the existing attack methods have shown promising attack performance, we would like to design an attack framework that can significantly enhance both the existing evasion and poisoning attacks. In particular, our attack framework is inspired by certified robustness. Certified robustness was originally used by defenders to defend against adversarial attacks. We are the first, from the attacker perspective, to leverage its properties to better attack GNNs. Specifically, we first leverage and derive nodes' certified perturbation sizes against evasion and poisoning attacks based on randomized smoothing. A larger certified perturbation size of a node indicates this node is theoretically more robust to graph perturbations. Such a property motivates us to focus more on nodes with smaller certified perturbation sizes, as they are easier to be attacked after graph perturbations. Accordingly, we design a certified robustness inspired attack loss, when incorporated into (any) existing attacks, produces our certified robustness inspired attack framework. We apply our attack framework to the existing attacks and results show it can significantly enhance the existing attacks' performance.

Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression
Kim, JunhoandLee, Byung-KwanandRo, YongMan



研究问题:本文旨在解决对抗性训练网络中异常脆弱性的来源问题,并从因果关系的角度进行深入探讨。
动机:尽管已有大量研究,但对抗性示例的起源仍然不明,引发了各种观点的争论。因此,本文提出通过因果视角来探索这一问题。
方法:本文提出了一种名为“对抗性工具变量回归”的方法,通过在无偏环境中估计对抗性预测的因果关系,以揭示其内在的因果特征。
效果:实验结果表明,所估计的因果特征与正确的对抗性鲁棒性预测高度相关,而最坏情况下的反事实表现出显著偏离正确预测的特征。此外,本文还展示了如何将CAusal FEatures(CAFE)有效地注入防御网络以提高对抗性鲁棒性。

The origin of adversarial examples is still inexplicable in research fields, and it arouses arguments from various viewpoints, albeit comprehensive investigations. In this paper, we propose a way of delving into the unexpected vulnerability in adversarially trained networks from a causal perspective, namely adversarial instrumental variable (IV) regression. By deploying it, we estimate the causal relation of adversarial prediction under an unbiased environment dissociated from unknown confounders. Our approach aims to demystify inherent causal features on adversarial examples by leveraging a zero-sum optimization game between a casual feature estimator (i.e., hypothesis model) and worst-case counterfactuals (i.e., test function) disturbing to find causal features. Through extensive analyses, we demonstrate that the estimated causal features are highly related to the correct prediction for adversarial robustness, and the counterfactuals exhibit extreme features significantly deviating from the correct prediction. In addition, we present how to effectively inoculate CAusal FEatures (CAFE) into defense networks for improving adversarial robustness.

MEDIC: Remove Model Backdoors via Importance Driven Cloning
Xu, QiulingandTao, GuanhongandHonorio, JeanandLiu, YingqiandAn, ShengweiandShen, GuangyuandCheng, SiyuanandZhang, Xiangyu



研究问题:如何去除深度学习模型中的注入式后门?
动机:目前,预训练的语言模型并未充分利用知识图谱中的结构化知识。
方法:我们开发了一种新方法,通过克隆受后门影响模型的良性行为到一个新的、结构相同的模型中来去除后门。
效果:实验结果表明,这种方法在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We develop a novel method to remove injected backdoors in deep learning models. It works by cloning the benign behaviors of a trojaned model to a new model of the same structure. It trains the clone model from scratch on a very small subset of samples and aims to minimize a cloning loss that denotes the differences between the activations of important neurons across the two models. The set of important neurons varies for each input, depending on their magnitude of activations and their impact on the classification result. We theoretically show our method can better recover benign functions of the backdoor model. Meanwhile, we prove our method can be more effective in removing backdoors compared with fine-tuning. Our experiments show that our technique can effectively remove nine different types of backdoors with minor benign accuracy degradation, outperforming the state-of-the-art backdoor removal techniques that are based on fine-tuning, knowledge distillation, and neuron pruning.

Transferable Adversarial Attacks on Vision Transformers With Token Gradient Regularization
Zhang, JianpingandHuang, YizhanandWu, WeibinandLyu, MichaelR.



研究问题:视觉变换器(ViTs)在计算机视觉任务中表现出色,但易受对抗性样本攻击。
动机:转移式攻击利用局部模型生成对抗性样本并直接转移到攻击目标黑盒模型,其高效性对基于ViT的应用构成严重安全威胁。
方法:提出Token Gradient Regularization(TGR)方法,根据ViT的结构特点,以token-wise的方式降低每个内部块中反向传播梯度的方差,并利用规范化的梯度生成对抗性样本。
效果:在针对ViT和CNN的攻击实验中,TGR方法表现出优越性。与最先进的转移式攻击相比,TGR方法平均性能提高了8.8%。

Vision transformers (ViTs) have been successfully deployed in a variety of computer vision tasks, but they are still vulnerable to adversarial samples. Transfer-based attacks use a local model to generate adversarial samples and directly transfer them to attack a target black-box model. The high efficiency of transfer-based attacks makes it a severe security threat to ViT-based applications. Therefore, it is vital to design effective transfer-based attacks to identify the deficiencies of ViTs beforehand in security-sensitive scenarios. Existing efforts generally focus on regularizing the input gradients to stabilize the updated direction of adversarial samples. However, the variance of the back-propagated gradients in intermediate blocks of ViTs may still be large, which may make the generated adversarial samples focus on some model-specific features and get stuck in poor local optima. To overcome the shortcomings of existing approaches, we propose the Token Gradient Regularization (TGR) method. According to the structural characteristics of ViTs, TGR reduces the variance of the back-propagated gradient in each internal block of ViTs in a token-wise manner and utilizes the regularized gradient to generate adversarial samples. Extensive experiments on attacking both ViTs and CNNs confirm the superiority of our approach. Notably, compared to the state-of-the-art transfer-based attacks, our TGR offers a performance improvement of 8.8 % on average.

Architectural Backdoors in Neural Networks
Bober-Irizar, MikelandShumailov, IliaandZhao, YirenandMullins, RobertandPapernot, Nicolas



研究问题:机器学习易受对抗性操纵,特别是在训练阶段,攻击者可以通过操控数据和采样过程来控制模型行为。
动机:本文介绍了一种新的隐藏在模型架构中的后门攻击,即通过改变训练函数的归纳偏置来实现。这种攻击方式易于实施,例如发布一个包含后门的开源模型架构供他人无意识地重用。
方法:我们提出了一种基于模型架构的后门攻击方法,该方法通过在输入和输出之间建立连接来实现。同时,我们还描述了一些可能的防护措施。
效果:我们在不同规模的计算机视觉基准测试上评估了我们的攻击方法,并发现这种潜在的脆弱性普遍存在于各种常见的训练设置中。

Machine learning is vulnerable to adversarial manipulation. Previous literature has demonstrated that at the training stage attackers can manipulate data (Gu et al.) and data sampling procedures (Shumailov et al.) to control model behaviour. A common attack goal is to plant backdoors i.e. force the victim model to learn to recognise a trigger known only by the adversary. In this paper, we introduce a new class of backdoor attacks that hide inside model architectures i.e. in the inductive bias of the functions used to train. These backdoors are simple to implement, for instance by publishing open-source code for a backdoored model architecture that others will reuse unknowingly. We demonstrate that model architectural backdoors represent a real threat and, unlike other approaches, can survive a complete re-training from scratch. We formalise the main construction principles behind architectural backdoors, such as a connection between the input and the output, and describe some possible protections against them. We evaluate our attacks on computer vision benchmarks of different scales and demonstrate the underlying vulnerability is pervasive in a variety of common training settings.

3D Semantic Segmentation in the Wild: Learning Generalized Models for Adverse-Condition Point Clouds
Xiao, AoranandHuang, JiaxingandXuan, WeihaoandRen, RuijieandLiu, KangchengandGuan, DayanandElSaddik, AbdulmotalebandLu, ShijianandXing, EricP.



研究问题:如何学习通用的3D语义分割模型,以应对各种恶劣天气条件下的自动驾驶。
动机:现有的基准测试主要基于正常天气下的点云数据,而忽略了在各种恶劣天气下学习通用3D语义分割模型的重要性。
方法:我们引入了SemanticSTF,这是一个不利天气下的点云数据集,提供密集的点级注释,允许我们在各种不利天气条件下研究3D语义分割。我们通过两个任务来研究通用的3D语义分割模型:1)从正常天气数据适应到不利天气数据的领域自适应3D语义分割;2)从正常天气数据中学习可泛化的模型。
效果:我们的研究揭示了现有3D语义分割方法在遇到不利天气数据时的挑战,显示了SemanticSTF在这个有意义的研究方向上的巨大价值。此外,我们设计了一个领域随机化技术,交替地随机化点云的几何样式并聚合它们的编码嵌入,最终导致一个可以有效改善各种不利天气下的3D语义分割的可泛化模型。

Robust point cloud parsing under all-weather conditions is crucial to level-5 autonomy in autonomous driving. However, how to learn a universal 3D semantic segmentation (3DSS) model is largely neglected as most existing benchmarks are dominated by point clouds captured under normal weather. We introduce SemanticSTF, an adverse-weather point cloud dataset that provides dense point-level annotations and allows to study 3DSS under various adverse weather conditions. We investigate universal 3DSS modeling with two tasks: 1) domain adaptive 3DSS that adapts from normal-weather data to adverse-weather data; 2) domain generalized 3DSS that learns a generalizable model from normal-weather data. Our studies reveal the challenge while existing 3DSS methods encounter adverse-weather data, showing the great value of SemanticSTF in steering the future endeavor along this very meaningful research direction. In addition, we design a domain randomization technique that alternatively randomizes the geometry styles of point clouds and aggregates their encoded embeddings, ultimately leading to a generalizable model that effectively improves 3DSS under various adverse weather. The SemanticSTF and related codes are available at https://github.com/xiaoaoran/SemanticSTF.

Robust Single Image Reflection Removal Against Adversarial Attacks
Song, ZhenboandZhang, ZhenyuanandZhang, KaihaoandLuo, WenhanandFan, ZhaoxinandRen, WenqiandLu, Jianfeng



研究问题:本文旨在解决深度单图像反射去除(SIRR)在对抗性攻击下的鲁棒性问题。
动机:当前基于深度学习的SIRR方法由于输入图像的微小扭曲和扰动,性能会显著下降。
方法:我们首先针对SIRR问题进行多样化的对抗性攻击,然后提出一个鲁棒的SIRR模型,该模型整合了跨尺度注意力模块、多尺度融合模块和对抗性图像判别器。通过利用多尺度机制,模型缩小了干净图像和对抗性图像特征之间的差距。图像判别器能自适应地区分清洁或噪声输入,从而获得可靠的鲁棒性。
效果:我们在Nature、SIR^2和Real数据集上进行的大量实验表明,我们的模型显著提高了SIRR在不同场景下的鲁棒性。

This paper addresses the problem of robust deep single-image reflection removal (SIRR) against adversarial attacks. Current deep learning based SIRR methods have shown significant performance degradation due to unnoticeable distortions and perturbations on input images. For a comprehensive robustness study, we first conduct diverse adversarial attacks specifically for the SIRR problem, i.e. towards different attacking targets and regions. Then we propose a robust SIRR model, which integrates the cross-scale attention module, the multi-scale fusion module, and the adversarial image discriminator. By exploiting the multi-scale mechanism, the model narrows the gap between features from clean and adversarial images. The image discriminator adaptively distinguishes clean or noisy inputs, and thus further gains reliable robustness. Extensive experiments on Nature, SIR^2, and Real datasets demonstrate that our model remarkably improves the robustness of SIRR across disparate scenes.

TrojDiff: Trojan Attacks on Diffusion Models With Diverse Targets
Chen, WeixinandSong, DawnandLi, Bo



研究问题:扩散模型在各种任务中取得了巨大成功,但其训练数据的可信度难以控制或审查。本研究旨在探索扩散模型在潜在训练数据操纵下的脆弱性,并尝试回答:对训练良好的扩散模型进行Trojan攻击的难度有多大?这种Trojan攻击可以实现哪些对抗性目标?
动机:由于扩散模型的成功取决于从不同来源收集的大规模训练数据,因此这些收集的数据的可信度很难控制或审查。
方法:我们提出了一种针对扩散模型的有效Trojan攻击——TrojDiff,该攻击在训练过程中优化了Trojan扩散和生成过程。我们设计了新的过渡过程,使对抗性目标能够扩散到有偏的高斯分布中,并提出了一种新的Trojan生成过程参数化方法,为攻击提供了有效的训练目标。
效果:我们在CIFAR-10和CelebA数据集上评估了TrojDiff对DDPM和DDIM扩散模型的攻击效果。结果显示,TrojDiff在不同的对抗性目标和使用不同类型的触发器的情况下,始终能实现高攻击性能,同时保持了良性环境下的性能。

Diffusion models have achieved great success in a range of tasks, such as image synthesis and molecule design. As such successes hinge on large-scale training data collected from diverse sources, the trustworthiness of these collected data is hard to control or audit. In this work, we aim to explore the vulnerabilities of diffusion models under potential training data manipulations and try to answer: How hard is it to perform Trojan attacks on well-trained diffusion models? What are the adversarial targets that such Trojan attacks can achieve? To answer these questions, we propose an effective Trojan attack against diffusion models, TrojDiff, which optimizes the Trojan diffusion and generative processes during training. In particular, we design novel transitions during the Trojan diffusion process to diffuse adversarial targets into a biased Gaussian distribution and propose a new parameterization of the Trojan generative process that leads to an effective training objective for the attack. In addition, we consider three types of adversarial targets: the Trojaned diffusion models will always output instances belonging to a certain class from the in-domain distribution (In-D2D attack), out-of-domain distribution (Out-D2D-attack), and one specific instance (D2I attack). We evaluate TrojDiff on CIFAR-10 and CelebA datasets against both DDPM and DDIM diffusion models. We show that TrojDiff always achieves high attack performance under different adversarial targets using different types of triggers, while the performance in benign environments is preserved. The code is available at https://github.com/chenweixin107/TrojDiff.

Cooperation or Competition: Avoiding Player Domination for Multi-Target Robustness via Adaptive Budgets
Wang, YimuandZhang, DinghuaiandWu, YihanandHuang, HengandZhang, Hongyang



研究问题:尽管深度学习取得了令人瞩目的进步,但其易受对抗性攻击的影响。大多数现有的防御方法只能抵御一种类型的攻击,而最新的工作则致力于抵御多种攻击。
动机:为了理解多目标鲁棒性,我们将此问题视为一个讨价还价博弈,其中不同的玩家(对手)通过协商达成参数更新的联合方向。我们识别出博弈中存在的一种现象,即玩家主导,并发现这种现象导致一些现有的基于最大值的方法如MAX和MSD无法收敛。
方法:基于我们的理论结果,我们设计了一个新颖的框架,通过调整不同对手的预算来避免玩家主导。我们在两个基准测试上进行实验,结果表明将所提出的框架应用于现有方法可以显著提高多目标鲁棒性。
效果:我们在CIFAR-10和CelebA数据集上评估了TrojDiff对DDPM和DDIM扩散模型的攻击效果。结果显示,TrojDiff在不同的对抗性目标和使用不同类型的触发器的情况下,始终能实现高攻击性能,同时保持了良性环境下的性能。

Despite incredible advances, deep learning has been shown to be susceptible to adversarial attacks. Numerous approaches were proposed to train robust networks both empirically and certifiably. However, most of them defend against only a single type of attack, while recent work steps forward at defending against multiple attacks. In this paper, to understand multi-target robustness, we view this problem as a bargaining game in which different players (adversaries) negotiate to reach an agreement on a joint direction of parameter updating. We identify a phenomenon named player domination in the bargaining game, and show that with this phenomenon, some of the existing max-based approaches such as MAX and MSD do not converge. Based on our theoretical results, we design a novel framework that adjusts the budgets of different adversaries to avoid player domination. Experiments on two benchmarks show that employing the proposed framework to the existing approaches significantly advances multi-target robustness.

Quality-Aware Pre-Trained Models for Blind Image Quality Assessment
Zhao, KaiandYuan, KunandSun, MingandLi, MadingandWen, Xing



研究问题:本文旨在解决盲图像质量评估(BIQA)中深度学习方法因标注数据稀缺而无法充分发挥潜力的问题。
动机:由于标注数据的稀缺,深度学习在盲图像质量评估(BIQA)中的应用受到了限制。
方法:本文提出了一种针对BIQA的预训练任务,采用自我监督学习的方式,使模型能够从多得多的数据中学习表示。同时,提出了一种基于质量感知对比损失的方法来约束学习过程。
效果:实验结果表明,该方法在流行的BIQA数据集上取得了显著的改进。

Blind image quality assessment (BIQA) aims to automatically evaluate the perceived quality of a single image, whose performance has been improved by deep learning-based methods in recent years. However, the paucity of labeled data somewhat restrains deep learning-based BIQA methods from unleashing their full potential. In this paper, we propose to solve the problem by a pretext task customized for BIQA in a self-supervised learning manner, which enables learning representations from orders of magnitude more data. To constrain the learning process, we propose a quality-aware contrastive loss based on a simple assumption: the quality of patches from a distorted image should be similar, but vary from patches from the same image with different degradations and patches from different images. Further, we improve the existing degradation process and form a degradation space with the size of roughly 2x10^7. After pre-trained on ImageNet using our method, models are more sensitive to image quality and perform significantly better on downstream BIQA tasks. Experimental results show that our method obtains remarkable improvements on popular BIQA datasets.

Privacy-Preserving Adversarial Facial Features
Wang, ZhiboandWang, HeandJin, ShuaifanandZhang, WenwenandHu, JiahuiandWang, YanandSun, PengandYuan, WeiandLiu, KaixinandRen, Kui



研究问题:如何通过提取紧凑和判别性的人脸特征来保护人脸隐私,同时防止重建网络利用这些特征恢复原始人脸的外观。
动机:现有的面部识别服务提供商通过从图像中提取紧凑和判别性的人脸特征并存储这些特征以进行实时识别来保护人脸隐私。然而,这些特征仍可能被用来通过构建重建网络恢复原始人脸的外观。尽管已经提出了几种保护隐私的方法,但增强人脸隐私保护的代价是准确性的降低。
方法:本文提出了一种基于对抗性特征的人脸隐私保护(AdvFace)方法,生成保护隐私的对抗性特征,通过干扰从对抗性特征到人脸图像的映射来防御重建攻击。为此,设计了一个模拟攻击者行为的阴影模型,捕获从人脸特征到图像的映射函数,并生成对抗性潜在噪声以干扰映射。将对抗性特征而不是原始特征存储在服务器的数据库中,以防止泄露的特征暴露面部信息。此外,AdvFace不需要对人脸识别网络进行任何更改,可以作为已部署的人脸识别系统中的隐私增强插件实施。
效果:大量实验结果表明,AdvFace在防御重建攻击的同时保持了人脸识别的准确性,优于最先进的人脸隐私保护方法。

Face recognition service providers protect face privacy by extracting compact and discriminative facial features (representations) from images, and storing the facial features for real-time recognition. However, such features can still be exploited to recover the appearance of the original face by building a reconstruction network. Although several privacy-preserving methods have been proposed, the enhancement of face privacy protection is at the expense of accuracy degradation. In this paper, we propose an adversarial features-based face privacy protection (AdvFace) approach to generate privacy-preserving adversarial features, which can disrupt the mapping from adversarial features to facial images to defend against reconstruction attacks. To this end, we design a shadow model which simulates the attackers' behavior to capture the mapping function from facial features to images and generate adversarial latent noise to disrupt the mapping. The adversarial features rather than the original features are stored in the server's database to prevent leaked features from exposing facial information. Moreover, the AdvFace requires no changes to the face recognition network and can be implemented as a privacy-enhancing plugin in deployed face recognition systems. Extensive experimental results demonstrate that AdvFace outperforms the state-of-the-art face privacy-preserving methods in defending against reconstruction attacks while maintaining face recognition accuracy.

Physics-Guided ISO-Dependent Sensor Noise Modeling for Extreme Low-Light Photography
Cao, YueandLiu, MingandLiu, ShuaiandWang, XiaotaoandLei, LeiandZuo, Wangmeng



研究问题:尽管深度神经网络在许多视觉任务中取得了惊人的性能,但现有的研究问题:尽管深度神经网络在许多视觉任务中取得了惊人的性能,但现有的基于学习的方法在极端低光环境下的传感器噪声建模方面远不如物理模型。
动机:为了挖掘基于学习的传感器噪声建模的潜力,我们研究了典型成像过程中的噪声形成,并提出了一种新的物理指导的ISO依赖性传感器噪声建模方法。
方法:我们构建了一个基于正则化流的框架来表示CMOS相机传感器的复杂噪声特性。每个噪声模型组件都专门针对一种特定类型的噪声,并在物理模型的指导下进行。此外,我们还考虑了噪声模型中的ISO依赖性,这是现有基于学习的方法没有完全考虑到的。
效果:与现有方法相比,所提出的噪声模型得益于其灵活的结构和准确的建模能力,可以在极端低光场景中实现更好的去噪性能。我们将公开源代码和收集的数据集。

Although deep neural networks have achieved astonishing performance in many vision tasks, existing learning-based methods are far inferior to the physical model-based solutions in extreme low-light sensor noise modeling. To tap the potential of learning-based sensor noise modeling, we investigate the noise formation in a typical imaging process and propose a novel physics-guided ISO-dependent sensor noise modeling approach. Specifically, we build a normalizing flow-based framework to represent the complex noise characteristics of CMOS camera sensors. Each component of the noise model is dedicated to a particular kind of noise under the guidance of physical models. Moreover, we take into consideration of the ISO dependence in the noise model, which is not completely considered by the existing learning-based methods. For training the proposed noise model, a new dataset is further collected with paired noisy-clean images, as well as flat-field and bias frames covering a wide range of ISO settings. Compared to existing methods, the proposed noise model benefits from the flexible structure and accurate modeling capabilities, which can help achieve better denoising performance in extreme low-light scenes. The source code and collected dataset will be publicly available.

CAP: Robust Point Cloud Classification via Semantic and Structural Modeling
Ding, DaizongandJiang, ErlingandHuang, YuanminandZhang, MiandLi, WenxuanandYang, Min



研究问题:如何提高3D点云分类模型的对抗攻击防御能力。
动机:深度神经网络在3D点云分类任务上取得了成功,但同时也引发了对抗攻击的问题,这对真实世界的应用造成了严重损害。
方法:设计了一种基于注意力机制和动态对比学习的防御框架,该框架能够提高现有分类模型的鲁棒性。
效果:通过在两个数据集和三种分类模型上的大量实验,证明了该方法对各种攻击具有强大的防御能力,例如,PointNet在ModelNet40数据集上的受攻击成功率从70.2%降低到2.7%。

Recently, deep neural networks have shown great success on 3D point cloud classification tasks, which simultaneously raises the concern of adversarial attacks that cause severe damage to real-world applications. Moreover, defending against adversarial examples in point cloud data is extremely difficult due to the emergence of various attack strategies. In this work, with the insight of the fact that the adversarial examples in this task still preserve the same semantic and structural information as the original input, we design a novel defense framework for improving the robustness of existing classification models, which consists of two main modules: the attention-based pooling and the dynamic contrastive learning. In addition, we also develop an algorithm to theoretically certify the robustness of the proposed framework. Extensive empirical results on two datasets and three classification models show the robustness of our approach against various attacks, e.g., the averaged attack success rate of PointNet decreases from 70.2% to 2.7% on the ModelNet40 dataset under 9 common attacks.

StyLess: Boosting the Transferability of Adversarial Examples
Liang, KaishengandXiao, Bin



研究问题:对抗性攻击通过向良性示例添加难以察觉的扰动来误导深度神经网络,这种研究问题:对抗性攻击通过向良性示例添加难以察觉的扰动来误导深度神经网络,这种攻击的转移性使得对抗性示例能够攻击未知架构或参数的黑箱DNN,对许多实际应用构成威胁。
动机:现有的可转移性攻击在优化过程中没有区分风格和内容特征,限制了其攻击的转移性。
方法:我们提出了一种新的攻击方法,称为无风格扰动(StyLess)。具体来说,我们不使用普通的网络作为替代模型,而是使用经过风格化处理的网络,通过微调自适应实例归一化来编码不同的风格特征。
效果:我们的实验表明,这种方法可以显著提高对抗性示例的转移性。此外,我们的方法具有通用性,当与其他攻击技术结合时,可以超越最先进的可转移性攻击。

Adversarial attacks can mislead deep neural networks (DNNs) by adding imperceptible perturbations to benign examples. The attack transferability enables adversarial examples to attack black-box DNNs with unknown architectures or parameters, which poses threats to many real-world applications. We find that existing transferable attacks do not distinguish between style and content features during optimization, limiting their attack transferability. To improve attack transferability, we propose a novel attack method called style-less perturbation (StyLess). Specifically, instead of using a vanilla network as the surrogate model, we advocate using stylized networks, which encode different style features by perturbing an adaptive instance normalization. Our method can prevent adversarial examples from using non-robust style features and help generate transferable perturbations. Comprehensive experiments show that our method can significantly improve the transferability of adversarial examples. Furthermore, our approach is generic and can outperform state-of-the-art transferable attacks when combined with other attack techniques.

Non-Contrastive Unsupervised Learning of Physiological Signals From Video
Speth, JeremyandVance, NathanandFlynn, PatrickandCzajka, Adam



研究问题:如何从RGB视频中提取微妙的周期性信号,如血容量脉搏和呼吸,实现低成本的非接触健康监测?
动机:现有的远程脉冲估计(rPPG)方法主要依赖于深度学习解决方案,但需要依赖带有接触式PPG传感器生成的标签数据进行训练和评估。
方法:提出了一种非对比无监督学习框架用于信号回归,以减少对标记视频数据的需求。该方法在最小化周期性和有限带宽假设的情况下,直接从未标记的视频中发现了血容量脉搏。
效果:通过鼓励正常生理带限内的稀疏功率谱和功率谱的方差,该方法能够学习到周期性信号的视觉特征。首次使用未特别为rPPG创建的未标记视频数据进行实验,训练出稳健的脉搏率估计器。由于其有限的归纳偏置和令人印象深刻的经验结果,该方法理论上能够从视频中发现其他周期性信号,实现无需真实信号的多种生理测量。

Subtle periodic signals such as blood volume pulse and respiration can be extracted from RGB video, enabling noncontact health monitoring at low cost. Advancements in remote pulse estimation -- or remote photoplethysmography (rPPG) -- are currently driven by deep learning solutions. However, modern approaches are trained and evaluated on benchmark datasets with ground truth from contact-PPG sensors. We present the first non-contrastive unsupervised learning framework for signal regression to mitigate the need for labelled video data. With minimal assumptions of periodicity and finite bandwidth, our approach discovers the blood volume pulse directly from unlabelled videos. We find that encouraging sparse power spectra within normal physiological bandlimits and variance over batches of power spectra is sufficient for learning visual features of periodic signals. We perform the first experiments utilizing unlabelled video data not specifically created for rPPG to train robust pulse rate estimators. Given the limited inductive biases and impressive empirical results, the approach is theoretically capable of discovering other periodic signals from video, enabling multiple physiological measurements without the need for ground truth signals.

Adversarially Robust Neural Architecture Search for Graph Neural Networks
Xie, BeiniandChang, HengandZhang, ZiweiandWang, XinandWang, DaixinandZhang, ZhiqiangandYing, RexandZhu, Wenwu



研究问题:如何提高图神经网络(GNN)在面对对抗性攻击时的鲁棒性。
动机:现有的防御方法无法保证在新的数据/任务或对抗性攻击下的性能,也无法从架构角度理解GNN的鲁棒性。
方法:提出一种新的针对GNN的鲁棒神经架构搜索框架(G-RNA)。通过在搜索空间中添加图结构掩码操作来设计一个鲁棒的消息传递机制,包括各种防御操作候选,并允许我们搜索防御GNN。此外,定义了一个鲁棒性度量标准来指导搜索过程,帮助筛选出鲁棒的架构。
效果:实验结果表明,G-RNA在基准数据集上的表现优于手动设计的鲁棒GNN和普通的图NAS基线,在对抗性攻击下的提高幅度为12.1%至23.4%。

Graph Neural Networks (GNNs) obtain tremendous success in modeling relational data. Still, they are prone to adversarial attacks, which are massive threats to applying GNNs to risk-sensitive domains. Existing defensive methods neither guarantee performance facing new data/tasks or adversarial attacks nor provide insights to understand GNN robustness from an architectural perspective. Neural Architecture Search (NAS) has the potential to solve this problem by automating GNN architecture designs. Nevertheless, current graph NAS approaches lack robust design and are vulnerable to adversarial attacks. To tackle these challenges, we propose a novel Robust Neural Architecture search framework for GNNs (G-RNA). Specifically, we design a robust search space for the message-passing mechanism by adding graph structure mask operations into the search space, which comprises various defensive operation candidates and allows us to search for defensive GNNs. Furthermore, we define a robustness metric to guide the search procedure, which helps to filter robust architectures. In this way, G-RNA helps understand GNN robustness from an architectural perspective and effectively searches for optimal adversarial robust GNNs. Extensive experimental results on benchmark datasets show that G-RNA significantly outperforms manually designed robust GNNs and vanilla graph NAS baselines by 12.1% to 23.4% under adversarial attacks.

DR2: Diffusion-Based Robust Degradation Remover for Blind Face Restoration
Wang, ZhixinandZhang, ZiyingandZhang, XiaoyunandZheng, HuangjieandZhou, MingyuanandZhang, YaandWang, Yanfeng



研究问题:本文旨在解决盲脸恢复中训练数据与真实世界情况不符的问题,即假设的退化模型和实际的退化效果之间的差距。
动机:在盲脸恢复中,由于训练数据中的退化模型与真实世界的退化效果存在差距,导致恢复效果不佳,输出结果中常常出现伪影。然而,包含所有类型的退化以覆盖真实世界情况的训练数据既昂贵又不可行。
方法:本文提出了基于扩散的稳健退化消除器(DR2),首先将退化图像转换为粗略但退化不变的预测,然后使用增强模块将粗糙预测恢复到高质量图像。通过利用表现良好的去噪扩散概率模型,DR2将输入图像扩散到各种类型的退化变为高斯噪声的噪声状态,并通过迭代去噪步骤捕获语义信息。
效果:实验表明,DR2对常见的退化(如模糊、缩放、噪声和压缩)具有鲁棒性,并且可以与不同的增强模块设计兼容。在严重退化的合成和真实世界数据集上,DR2的性能超过了最先进的方法。

Blind face restoration usually synthesizes degraded low-quality data with a pre-defined degradation model for training, while more complex cases could happen in the real world. This gap between the assumed and actual degradation hurts the restoration performance where artifacts are often observed in the output. However, it is expensive and infeasible to include every type of degradation to cover real-world cases in the training data. To tackle this robustness issue, we propose Diffusion-based Robust Degradation Remover (DR2) to first transform the degraded image to a coarse but degradation-invariant prediction, then employ an enhancement module to restore the coarse prediction to a high-quality image. By leveraging a well-performing denoising diffusion probabilistic model, our DR2 diffuses input images to a noisy status where various types of degradation give way to Gaussian noise, and then captures semantic information through iterative denoising steps. As a result, DR2 is robust against common degradation (e.g. blur, resize, noise and compression) and compatible with different designs of enhancement modules. Experiments in various settings show that our framework outperforms state-of-the-art methods on heavily degraded synthetic and real-world datasets.

T-SEA: Transfer-Based Self-Ensemble Attack on Object Detection
Huang, HaoandChen, ZiyanandChen, HuanranandWang, YongtaoandZhang, Kevin



研究问题:如何提高黑盒攻击的转移性,同时降低时间和资源消耗。
动机:现有的基于转移的黑盒攻击方法需要多个模型进行集成,既耗时又耗资源。
方法:提出一种只使用单一模型进行转移式黑盒攻击的方法,通过调整训练策略和利用有限的信息防止攻击补丁过拟合。
效果:实验证明,该方法可以显著提高攻击补丁在多个主流探测器上的黑盒转移性,同时也提高了白盒性能。

Compared to query-based black-box attacks, transfer-based black-box attacks do not require any information of the attacked models, which ensures their secrecy. However, most existing transfer-based approaches rely on ensembling multiple models to boost the attack transferability, which is time- and resource-intensive, not to mention the difficulty of obtaining diverse models on the same task. To address this limitation, in this work, we focus on the single-model transfer-based black-box attack on object detection, utilizing only one model to achieve a high-transferability adversarial attack on multiple black-box detectors. Specifically, we first make observations on the patch optimization process of the existing method and propose an enhanced attack framework by slightly adjusting its training strategies. Then, we analogize patch optimization with regular model optimization, proposing a series of self-ensemble approaches on the input data, the attacked model, and the adversarial patch to efficiently make use of the limited information and prevent the patch from overfitting. The experimental results show that the proposed framework can be applied with multiple classical base attack methods (e.g., PGD and MIM) to greatly improve the black-box transferability of the well-optimized patch on multiple mainstream detectors, meanwhile boosting white-box performance.

Dual-Bridging With Adversarial Noise Generation for Domain Adaptive rPPG Estimation
Du, JingdaandLiu, Si-QiandZhang, BochaoandYuen, PongC.



研究问题:如何提高远程光电容积脉搏波(rPPG)技术在未知噪声和失真情况下的泛化能力。
动机:尽管最新的深度rPPG方法可以处理由于头部运动、视频压缩等造成的同分布噪声,但在目标测试领域对未见过噪声和失真的泛化能力可能不足。
方法:提出一种双桥接网络来减少领域差异,通过对准中间领域和在源领域中合成目标噪声进行更好的噪声降低。同时,提出一种新的对抗性噪声生成方法,让噪声生成器间接地与降噪器竞争,以提高降噪器的鲁棒性。
效果:在具有不同类型干扰的三个公共数据集上评估了该方法,不同的跨领域场景下的综合结果表明了该方法的有效性。

The remote photoplethysmography (rPPG) technique can estimate pulse-related metrics (e.g. heart rate and respiratory rate) from facial videos and has a high potential for health monitoring. The latest deep rPPG methods can model in-distribution noise due to head motion, video compression, etc., and estimate high-quality rPPG signals under similar scenarios. However, deep rPPG models may not generalize well to the target test domain with unseen noise and distortions. In this paper, to improve the generalization ability of rPPG models, we propose a dual-bridging network to reduce the domain discrepancy by aligning intermediate domains and synthesizing the target noise in the source domain for better noise reduction. To comprehensively explore the target domain noise, we propose a novel adversarial noise generation in which the noise generator indirectly competes with the noise reducer. To further improve the robustness of the noise reducer, we propose hard noise pattern mining to encourage the generator to learn hard noise patterns contained in the target domain features. We evaluated the proposed method on three public datasets with different types of interferences. Under different cross-domain scenarios, the comprehensive results show the effectiveness of our method.

Trade-Off Between Robustness and Accuracy of Vision Transformers
Li, YanxiandXu, Chang



研究问题:深度神经网络在计算机视觉任务上表现出色,但对输入的微小改动敏感,存在自然准确性和对此类改动的鲁棒性之间的权衡。
动机:尽管视觉转换器(ViTs)被证明对各种类型的干扰具有固有的鲁棒性,但上述权衡仍然存在。
方法:提出一种名为“视觉转换器的鲁棒性和准确性权衡”(TORA-ViTs)的方法,通过一对准确性和鲁棒性适配器提取预测性和鲁棒性特征,并通过一个门控融合模块调整权衡。
效果:在ImageNet上进行的实验表明,TORA-ViTs可以在保持有竞争力的自然准确性的同时,有效地提高自然预训练的ViTs的鲁棒性。

Although deep neural networks (DNNs) have shown great successes in computer vision tasks, they are vulnerable to perturbations on inputs, and there exists a trade-off between the natural accuracy and robustness to such perturbations, which is mainly caused by the existence of robust non-predictive features and non-robust predictive features. Recent empirical analyses find Vision Transformers (ViTs) are inherently robust to various kinds of perturbations, but the aforementioned trade-off still exists for them. In this work, we propose Trade-off between Robustness and Accuracy of Vision Transformers (TORA-ViTs), which aims to efficiently transfer ViT models pretrained on natural tasks for both accuracy and robustness. TORA-ViTs consist of two major components, including a pair of accuracy and robustness adapters to extract predictive and robust features, respectively, and a gated fusion module to adjust the trade-off. The gated fusion module takes outputs of a pretrained ViT block as queries and outputs of our adapters as keys and values, and tokens from different adapters at different spatial locations are compared with each other to generate attention scores for a balanced mixing of predictive and robust features. Experiments on ImageNet with various robust benchmarks show that our TORA-ViTs can efficiently improve the robustness of naturally pretrained ViTs while maintaining competitive natural accuracy. Our most balanced setting (TORA-ViTs with lambda = 0.5) can maintain 83.7% accuracy on clean ImageNet and reach 54.7% and 38.0% accuracy under FGSM and PGD white-box attacks, respectively. In terms of various ImageNet variants, it can reach 39.2% and 56.3% accuracy on ImageNet-A and ImageNet-R and reach 34.4% mCE on ImageNet-C.

Rate Gradient Approximation Attack Threats Deep Spiking Neural Networks
Bu, TongandDing, JianhaoandHao, ZechengandYu, Zhaofei



研究问题:深度尖峰神经网络(SNNs)的鲁棒性尚未完全揭示。
动机:尖峰神经网络由于其能量效率特性和在神经形态硬件上的潜在应用而引起了广泛关注。
方法:基于尖峰神经网络的速率编码特性,开发了一种名为“速率梯度近似攻击”(RGA)的新型特定于SNN的攻击方法。
效果:实验结果表明,提出的RGA攻击比先前的攻击更有效,对神经元超参数不敏感。同时,实验也表明,由LIF神经元组成的速率编码SNN并不安全,需要探索由复杂神经元和其他神经元编码组成的SNN的训练方法。

Spiking Neural Networks (SNNs) have attracted significant attention due to their energy-efficient properties and potential application on neuromorphic hardware. State-of-the-art SNNs are typically composed of simple Leaky Integrate-and-Fire (LIF) neurons and have become comparable to ANNs in image classification tasks on large-scale datasets. However, the robustness of these deep SNNs has not yet been fully uncovered. In this paper, we first experimentally observe that layers in these SNNs mostly communicate by rate coding. Based on this rate coding property, we develop a novel rate coding SNN-specified attack method, Rate Gradient Approximation Attack (RGA). We generalize the RGA attack to SNNs composed of LIF neurons with different leaky parameters and input encoding by designing surrogate gradients. In addition, we develop the time-extended enhancement to generate more effective adversarial examples. The experiment results indicate that our proposed RGA attack is more effective than the previous attack and is less sensitive to neuron hyperparameters. We also conclude from the experiment that rate-coded SNN composed of LIF neurons is not secure, which calls for exploring training methods for SNNs composed of complex neurons and other neuronal codings. Code is available at https://github.com/putshua/SNN_attack_RGA

Enhancing the Self-Universality for Transferable Targeted Attacks
Wei, ZhipengandChen, JingjingandWu, ZuxuanandJiang, Yu-Gang



研究问题:深度尖峰神经网络(SNNs)的鲁棒性尚未完全揭示。
动机:尖峰神经网络由于其能量效率特性和在神经形态硬件上的潜在应用而引起了广泛关注。
方法:基于尖峰神经网络的速率编码特性,开发了一种名为“速率梯度近似攻击”(RGA)的新型特定于SNN的攻击方法。
效果:实验结果表明,提出的RGA攻击比先前的攻击更有效,对神经元超参数不敏感。同时,实验也表明,由LIF神经元组成的速率编码SNN并不安全,需要探索由复杂神经元和其他神经元编码组成的SNN的训练方法。

In this paper, we propose a novel transfer-based targeted attack method that optimizes the adversarial perturbations without any extra training efforts for auxiliary networks on training data. Our new attack method is proposed based on the observation that highly universal adversarial perturbations tend to be more transferable for targeted attacks. Therefore, we propose to make the perturbation to be agnostic to different local regions within one image, which we called as self-universality. Instead of optimizing the perturbations on different images, optimizing on different regions to achieve self-universality can get rid of using extra data. Specifically, we introduce a feature similarity loss that encourages the learned perturbations to be universal by maximizing the feature similarity between adversarial perturbed global images and randomly cropped local regions. With the feature similarity loss, our method makes the features from adversarial perturbations to be more dominant than that of benign images, hence improving targeted transferability. We name the proposed attack method as Self-Universality (SU) attack. Extensive experiments demonstrate that SU can achieve high success rates for transfer-based targeted attacks. On ImageNet-compatible dataset, SU yields an improvement of 12% compared with existing state-of-the-art methods. Code is available at https://github.com/zhipeng-wei/Self-Universality.

Randomized Adversarial Training via Taylor Expansion
Jin, GaojieandYi, XinpingandWu, DengyuandMu, RonghuiandHuang, Xiaowei



研究问题:如何同时提高神经网络对对抗性例子的鲁棒性和对清洁样本的准确性。
动机:目前的对抗性训练方法可以有效提高神经网络的鲁棒性,但可能会牺牲一些准确性。
方法:通过在训练过程中向确定性权重添加随机噪声,提出了一种新的对抗性训练方法。这种方法可以通过泰勒展开小的高斯噪声来设计,并能够使损失景观平坦,找到平坦的极小值。
效果:实验证明,这种方法可以提高最先进的对抗性训练方法的性能,同时提高鲁棒性和清洁准确性。

In recent years, there has been an explosion of research into developing more robust deep neural networks against adversarial examples. Adversarial training appears as one of the most successful methods. To deal with both the robustness against adversarial examples and the accuracy over clean examples, many works develop enhanced adversarial training methods to achieve various trade-offs between them. Leveraging over the studies that smoothed update on weights during training may help find flat minima and improve generalization, we suggest reconciling the robustness-accuracy trade-off from another perspective, i.e., by adding random noise into deterministic weights. The randomized weights enable our design of a novel adversarial training method via Taylor expansion of a small Gaussian noise, and we show that the new adversarial training method can flatten loss landscape and find flat minima. With PGD, CW, and Auto Attacks, an extensive set of experiments demonstrate that our method enhances the state-of-the-art adversarial training methods, boosting both robustness and clean accuracy. The code is available at https://github.com/Alexkael/Randomized-Adversarial-Training.

Explaining Image Classifiers With Multiscale Directional Image Representation
Kolek, StefanandWindesheim, RobertandAndrade-Loarca, HectorandKutyniok, GittaandLevie, Ron



研究问题:图像分类器的解释性差,需要解释方法来理解其决策。
动机:现有的掩码解释方法通过平滑性约束进行正则化,防止不良的细粒度解释伪影,但这限制了掩码分离影响分类器的附近干扰模式和相关细粒度模式的能力。
方法:提出ShearletX,一种基于剪切变换的新型图像分类器掩码解释方法,避免了平滑性约束,用剪切稀疏约束代替。
效果:ShearletX在各种情况下都优于先前的基于掩码的解释方法,并展示了分离细粒度模式可以解释以前无法解释的现象。

Image classifiers are known to be difficult to interpret and therefore require explanation methods to understand their decisions. We present ShearletX, a novel mask explanation method for image classifiers based on the shearlet transform -- a multiscale directional image representation. Current mask explanation methods are regularized by smoothness constraints that protect against undesirable fine-grained explanation artifacts. However, the smoothness of a mask limits its ability to separate fine-detail patterns, that are relevant for the classifier, from nearby nuisance patterns, that do not affect the classifier. ShearletX solves this problem by avoiding smoothness regularization all together, replacing it by shearlet sparsity constraints. The resulting explanations consist of a few edges, textures, and smooth parts of the original image, that are the most relevant for the decision of the classifier. To support our method, we propose a mathematical definition for explanation artifacts and an information theoretic score to evaluate the quality of mask explanations. We demonstrate the superiority of ShearletX over previous mask based explanation methods using these new metrics, and present exemplary situations where separating fine-detail patterns allows explaining phenomena that were not explainable before.

Causally-Aware Intraoperative Imputation for Overall Survival Time Prediction
Li, XiangandQian, XuelinandLiang, LitianandKong, LingjieandDong, QiaoleandChen, JiejunandLiu, DingxiaandYao, XiuzhongandFu, Yanwei



研究问题:本文旨在解决早期原发性肝癌手术中生存时间预测的挑战,由于图像模式不明显,使得早期生存时间的预测变得困难。
动机:为了解决这个问题,本文提出了一种基于因果关系推理的系统,利用手术过程中的属性及其之间的相关性作为中间监督,以弥补图像和最终生存时间之间的差距。
方法:构建了一个因果图,并训练图像来估计用于最终生存时间预测的手术过程中的属性。提出了一种新的“Causally-aware Intraoperative Imputation Model”(CAWIM),该模型可以顺序地使用估计的因果图中的父节点来预测每个属性。为了确定因果关系的方向,提出了一个分割投票机制,该机制通过从异质性中进行因果发现,对每对相邻节点进行多次预测,从而为每对节点投票决定方向。
效果:通过在具有长期观察的361名肝癌患者的数据集上进行的实验,证明了该方法的实用性和有效性。

Previous efforts in vision community are mostly made on learning good representations from visual patterns. Beyond this, this paper emphasizes the high-level ability of causal reasoning. We thus present a case study of solving the challenging task of Overall Survival (OS) time in primary liver cancers. Critically, the prediction of OS time at the early stage remains challenging, due to the unobvious image patterns of reflecting the OS. To this end, we propose a causal inference system by leveraging the intraoperative attributes and the correlation among them, as an intermediate supervision to bridge the gap between the images and the final OS. Particularly, we build a causal graph, and train the images to estimate the intraoperative attributes for final OS prediction. We present a novel Causally-aware Intraoperative Imputation Model (CAWIM) that can sequentially predict each attribute using its parent nodes in the estimated causal graph. To determine the causal directions, we propose a splitting-voting mechanism, which votes for the direction for each pair of adjacent nodes among multiple predictions obtained via causal discovery from heterogeneity. The practicability and effectiveness of our method are demonstrated by the promising result on liver cancer dataset of 361 patients with long-term observations.

K3DN: Disparity-Aware Kernel Estimation for Dual-Pixel Defocus Deblurring
Yang, YanandPan, LiyuanandLiu, LiuandLiu, Miaomiao



研究问题:如何通过双像素传感器捕获的图像对进行去模糊处理。
动机:双像素传感器可以捕获两个视图的图像对,利用这种特性,我们提出了一种用于DP图像对去模糊的K3DN框架。
方法:该框架包含三个模块:1)一个深度感知去模糊模块,它估计一个视差特征图,并使用这个特征图查询可训练的核集以估计描述空间变化的模糊的最佳模糊核;2)一个重模糊正则化模块,它在训练阶段重新使用模糊核,执行简单的卷积进行重模糊,并对估计的内核和视差特征进行无监督正则化;3)一个锐利区域保护模块,它识别DP图像之间零视差的聚焦区域,避免在去模糊过程中引入噪声,提高图像恢复性能。
效果:在四个标准的DP数据集上进行的实验表明,所提出的K3DN优于最先进的方法,同时具有更少的参数和运算量。

The dual-pixel (DP) sensor captures a two-view image pair in a single snapshot by splitting each pixel in half. The disparity occurs in defocus blurred regions between the two views of the DP pair, while the in-focus sharp regions have zero disparity. This motivates us to propose a K3DN framework for DP pair deblurring, and it has three modules: i) a disparity-aware deblur module. It estimates a disparity feature map, which is used to query a trainable kernel set to estimate a blur kernel that best describes the spatially-varying blur. The kernel is constrained to be symmetrical per the DP formulation. A simple Fourier transform is performed for deblurring that follows the blur model; ii) a reblurring regularization module. It reuses the blur kernel, performs a simple convolution for reblurring, and regularizes the estimated kernel and disparity feature unsupervisedly, in the training stage; iii) a sharp region preservation module. It identifies in-focus regions that correspond to areas with zero disparity between DP images, aims to avoid the introduction of noises during the deblurring process, and improves image restoration performance. Experiments on four standard DP datasets show that the proposed K3DN outperforms state-of-the-art methods, with fewer parameters and flops at the same time.

DartBlur: Privacy Preservation With Detection Artifact Suppression
Jiang, BaoweiandBai, BingandLin, HaozheandWang, YuandGuo, YuchenandFang, Lu



研究问题:如何在保护面部隐私的同时,减少训练过程中引入的对下游任务性能有害的训练伪影。
动机:随着AI算法的发展,如何有效保护个人隐私,特别是面部信息,成为一大关注焦点。现有的模糊和替换方法在实际应用中各有优势和不足。
方法:提出一种新颖的去伪影模糊(DartBlur)隐私保护方法,利用深度神经网络生成模糊面部图像,同时抑制检测伪影。设计了四个特定的训练目标,以提高审查便利性和最大化检测伪影抑制。
效果:实验证明,DartBlur在审查便利性、访问性以及抑制训练伪影方面均优于现有的替换方法和传统的模糊方法。

Nowadays, privacy issue has become a top priority when training AI algorithms. Machine learning algorithms are expected to benefit our daily life, while personal information must also be carefully protected from exposure. Facial information is particularly sensitive in this regard. Multiple datasets containing facial information have been taken offline, and the community is actively seeking solutions to remedy the privacy issues. Existing methods for privacy preservation can be divided into blur-based and face replacement-based methods. Owing to the advantages of review convenience and good accessibility, blur-based based methods have become a dominant choice in practice. However, blur-based methods would inevitably introduce training artifacts harmful to the performance of downstream tasks. In this paper, we propose a novel De-artifact Blurring(DartBlur) privacy-preserving method, which capitalizes on a DNN architecture to generate blurred faces. DartBlur can effectively hide facial privacy information while detection artifacts are simultaneously suppressed. We have designed four training objectives that particularly aim to improve review convenience and maximize detection artifact suppression. We associate the algorithm with an adversarial training strategy with a second-order optimization pipeline. Experimental results demonstrate that DartBlur outperforms the existing face-replacement method from both perspectives of review convenience and accessibility, and also shows an exclusive advantage in suppressing the training artifact compared to traditional blur-based methods. Our implementation is available at https://github.com/JaNg2333/DartBlur.

IDGI: A Framework To Eliminate Explanation Noise From Integrated Gradients
Yang, RuoandWang, BinghuiandBilgic, Mustafa



研究问题:如何降低深度神经网络决策解释中的噪音,提高其可解释性。
动机:尽管基于积分梯度(IG)的方法在解释深度神经网络决策方面达到了最先进的性能,但其解释显著图常常包含噪音,降低了其可解释性。
方法:通过分析噪音的来源,提出了一种新的减少解释噪音的方法,即重要方向梯度集成(IDGI)框架。该框架可以很容易地整合到任何使用黎曼积分进行积分梯度计算的基于IG的方法中。
效果:通过在三种基于IG的方法上进行大量实验,发现IDGI能显著改善它们在众多可解释性度量上的表现。

Integrated Gradients (IG) as well as its variants are well-known techniques for interpreting the decisions of deep neural networks. While IG-based approaches attain state-of-the-art performance, they often integrate noise into their explanation saliency maps, which reduce their interpretability. To minimize the noise, we examine the source of the noise analytically and propose a new approach to reduce the explanation noise based on our analytical findings. We propose the Important Direction Gradient Integration (IDGI) framework, which can be easily incorporated into any IG-based method that uses the Reimann Integration for integrated gradient computation. Extensive experiments with three IG-based methods show that IDGI improves them drastically on numerous interpretability metrics.

PCT-Net: Full Resolution Image Harmonization Using Pixel-Wise Color Transformations
Guerreiro, JulianJorgeAndradeandNakazawa, MitsuruandStenger, Bj\"orn



研究问题:本文提出了一种简单通用的高分辨率图像和谐化方法。
动机:目前的图像和谐化方法通常在低分辨率下进行,而本文旨在直接在全分辨率图像上应用和谐化。
方法:提出参数网络来预测全分辨率图像中每个像素的像素级颜色转换(PCTs)参数。通过实验发现仿射颜色转换既高效又有效,并探索了CNNs和Transformers作为参数网络,结果显示Transformers效果更好。
效果:在公开的全分辨率iHarmony4数据集上评估该方法,结果显示,前景MSE和MSE值减少了20%以上,PSNR值增加了1.4dB,同时保持了轻量级的架构。在涉及20人的用户研究中,该方法实现了比最近两种其他方法更高的B-T分数。

In this paper, we present PCT-Net, a simple and general image harmonization method that can be easily applied to images at full resolution. The key idea is to learn a parameter network that uses downsampled input images to predict the parameters for pixel-wise color transforms (PCTs) which are applied to each pixel in the full-resolution image. We show that affine color transforms are both efficient and effective, resulting in state-of-the-art harmonization results. Moreover, we explore both CNNs and Transformers as the parameter network and show that Transformers lead to better results. We evaluate the proposed method on the public full-resolution iHarmony4 dataset, which is comprised of four datasets, and show a reduction of the foreground MSE (fMSE) and MSE values by more than 20% and an increase of the PSNR value by 1.4dB while keeping the architecture light-weight. In a user study with 20 people, we show that the method achieves a higher B-T score than two other recent methods.

Rawgment: Noise-Accounted RAW Augmentation Enables Recognition in a Wide Variety of Environments
Yoshimura, MasakazuandOtsuka, JunjiandIrie, AtsushiandOhashi, Takeshi



研究问题:如何在具有挑战性的环境中(如极暗、模糊或高动态范围条件)训练有效的图像识别模型,而无需获取困难的数据。
动机:创建此类环境的培训数据集既昂贵又困难,因此需要一种无需难以获得的数据集的稳健模型。
方法:提出一种考虑噪声的RAW图像增强方法,将颜色抖动和模糊增强应用于RAW图像,然后应用非线性ISP,以产生真实的强度。此外,还引入了噪声量对齐方法,校准由增强引起的噪声属性的领域差距。
效果:实验表明,仅使用简单的训练数据,所提出的考虑噪声的RAW图像增强方法就可以使图像识别精度在具有挑战性的环境中提高一倍。

Image recognition models that work in challenging environments (e.g., extremely dark, blurry, or high dynamic range conditions) must be useful. However, creating training datasets for such environments is expensive and hard due to the difficulties of data collection and annotation. It is desirable if we could get a robust model without the need for hard-to-obtain datasets. One simple approach is to apply data augmentation such as color jitter and blur to standard RGB (sRGB) images in simple scenes. Unfortunately, this approach struggles to yield realistic images in terms of pixel intensity and noise distribution due to not considering the non-linearity of Image Signal Processors (ISPs) and noise characteristics of image sensors. Instead, we propose a noise-accounted RAW image augmentation method. In essence, color jitter and blur augmentation are applied to a RAW image before applying non-linear ISP, resulting in realistic intensity. Furthermore, we introduce a noise amount alignment method that calibrates the domain gap in the noise property caused by the augmentation. We show that our proposed noise-accounted RAW augmentation method doubles the image recognition accuracy in challenging environments only with simple training data.

The Dark Side of Dynamic Routing Neural Networks: Towards Efficiency Backdoor Injection
Chen, SiminandChen, HanlinandHaque, MirazulandLiu, CongandYang, Wei



研究问题:现有的动态神经网络在资源有限的设备上部署时,是否存在效率被操控的漏洞。
动机:近年来,深度神经网络在资源受限的设备上的部署取得了进展,但我们发现这些网络可能存在效率被操控的漏洞。
方法:我们提出了一种名为EfficFrog的对抗性攻击方法,通过向动态神经网络注入通用的效率后门来操纵其计算成本。
效果:实验结果表明,EfficFrog能够有效地降低受攻击的动态神经网络在触发输入样本上的效率,同时保持清洁样本的效率基本不变。

Recent advancements in deploying deep neural networks (DNNs) on resource-constrained devices have generated interest in input-adaptive dynamic neural networks (DyNNs). DyNNs offer more efficient inferences and enable the deployment of DNNs on devices with limited resources, such as mobile devices. However, we have discovered a new vulnerability in DyNNs that could potentially compromise their efficiency. Specifically, we investigate whether adversaries can manipulate DyNNs' computational costs to create a false sense of efficiency. To address this question, we propose EfficFrog, an adversarial attack that injects universal efficiency backdoors in DyNNs. To inject a backdoor trigger into DyNNs, EfficFrog poisons only a minimal percentage of the DyNNs' training data. During the inference phase, EfficFrog can slow down the backdoored DyNNs and abuse the computational resources of systems running DyNNs by adding the trigger to any input. To evaluate EfficFrog, we tested it on three DNN backbone architectures (based on VGG16, MobileNet, and ResNet56) using two popular datasets (CIFAR-10 and Tiny ImageNet). Our results demonstrate that EfficFrog reduces the efficiency of DyNNs on triggered input samples while keeping the efficiency of clean samples almost the same.

Better ''CMOS'' Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution
Chen, XuhaiandZhang, JiangningandXu, ChaoandWang, YabiaoandWang, ChengjieandLiu, Yong



研究问题:现有的盲图像超分辨率(SR)方法大多假设模糊核是空间不变的,但实际中的模糊由于物体运动、失焦等因素通常是空间变化的,导致高级SR方法性能下降。
动机:为了解决这个问题,我们首先引入了两个新的具有离焦模糊的数据集,即NYUv2-BSR和Cityscapes-BSR,以支持空间变化模糊的盲SR的进一步研究。
方法:基于这些数据集,我们设计了一个新颖的跨模态融合网络(CMOS),同时估计模糊和语义,从而提高了SR结果。它包含一个特征组交互注意力(GIA)模块,使两种模态更有效地互动并避免不一致。GIA的结构具有通用性,也可以用于其他特征的互动。
效果:在上述数据集和真实图像上与最先进的方法进行定性和定量实验比较,例如在NYUv2-BSR上获得的PSNR/SSIM比MANet高出+1.91/+0.0048,证明了我们的方法的优越性。

Most of the existing blind image Super-Resolution (SR) methods assume that the blur kernels are space-invariant. However, the blur involved in real applications are usually space-variant due to object motion, out-of-focus, etc., resulting in severe performance drop of the advanced SR methods. To address this problem, we firstly introduce two new datasets with out-of-focus blur, i.e., NYUv2-BSR and Cityscapes-BSR, to support further researches of blind SR with space-variant blur. Based on the datasets, we design a novel Cross-MOdal fuSion network (CMOS) that estimate both blur and semantics simultaneously, which leads to improved SR results. It involves a feature Grouping Interactive Attention (GIA) module to make the two modals interact more effectively and avoid inconsistency. GIA can also be used for the interaction of other features because of the universality of its structure. Qualitative and quantitative experiments compared with state-of-the-art methods on above datasets and real-world images demonstrate the superiority of our method, e.g., obtaining PSNR/SSIM by +1.91/+0.0048 on NYUv2-BSR than MANet.

Backdoor Defense via Adaptively Splitting Poisoned Dataset
Gao, KuofengandBai, YangandGu, JindongandYang, YongandXia, Shu-Tao



研究问题:针对深度神经网络(DNN)易受后门攻击的问题,提出一种有效的训练阶段防御策略。
动机:由于DNN通常使用来自不可信第三方的外部训练数据,因此在训练阶段实施强大的后门防御策略至关重要。
方法:提出了一种基于自适应数据集分割的防御策略(ASD)。具体来说,我们采用损失引导的分割和元学习启发的分割来动态更新两个数据池。通过将干净的数据池和污染的数据池进行分割,ASD成功地在训练过程中抵御了后门攻击。
效果:在多个基准数据集和DNN模型上进行的大量实验表明,我们的ASD在对抗六种最先进的后门攻击方面具有优越性。

Backdoor defenses have been studied to alleviate the threat of deep neural networks (DNNs) being backdoor attacked and thus maliciously altered. Since DNNs usually adopt some external training data from an untrusted third party, a robust backdoor defense strategy during the training stage is of importance. We argue that the core of training-time defense is to select poisoned samples and to handle them properly. In this work, we summarize the training-time defenses from a unified framework as splitting the poisoned dataset into two data pools. Under our framework, we propose an adaptively splitting dataset-based defense (ASD). Concretely, we apply loss-guided split and meta-learning-inspired split to dynamically update two data pools. With the split clean data pool and polluted data pool, ASD successfully defends against backdoor attacks during training. Extensive experiments on multiple benchmark datasets and DNN models against six state-of-the-art backdoor attacks demonstrate the superiority of our ASD.

Wide-Angle Rectification via Content-Aware Conformal Mapping
Zhang, QiandLi, HongdongandWang, Qing



研究问题:智能手机相机的超广角镜头经常会产生严重的图像扭曲,如曲线线性结构、不自然的歪斜面孔等。
动机:尽管现有的大多数矫正方法都采用全局变形转换来纠正输入的广角图像,但其效果并不完全令人满意,许多不需要的残余畸变未被纠正或以牺牲预期的宽视场为代价。
方法:本文提出了一种新的方法来解决这些挑战。具体来说,我们推导出了一种局部自适应极域保形映射来矫正广角图像。映射的参数通过深度神经网络分析图像内容自动找到。
效果:通过对大量照片进行实验,证实了所提出的方法与所有现有方法相比具有优越的性能。

Despite the proliferation of ultra wide-angle lenses on smartphone cameras, such lenses often come with severe image distortion (e.g. curved linear structure, unnaturally skewed faces). Most existing rectification methods adopt a global warping transformation to undistort the input wide-angle image, yet their performances are not entirely satisfactory, leaving many unwanted residue distortions uncorrected or at the sacrifice of the intended wide FoV (field-of-view). This paper proposes a new method to tackle these challenges. Specifically, we derive a locally-adaptive polar-domain conformal mapping to rectify a wide-angle image. Parameters of the mapping are found automatically by analyzing image contents via deep neural networks. Experiments on large number of photos have confirmed the superior performance of the proposed method compared with all available previous methods.

Zero-Shot Noise2Noise: Efficient Image Denoising Without Any Data
Mansour, YoussefandHeckel, Reinhard



研究问题:如何利用无监督神经网络进行高质量的图像去噪,同时降低计算成本?
动机:现有的无监督去噪方法或者需要复杂的计算,或者需要对噪声分布有了解,或者无法达到理想的图像质量。
方法:本文提出了一种简单的两层网络ZS-N2N(零射击Noise2Noise),无需任何训练数据或噪声分布知识,即可实现高质量的图像去噪,且计算成本低。
效果:在人工、真实世界相机和显微镜噪声的实验中,ZS-N2N方法在减少成本的同时,往往优于现有的无监督去噪方法,适用于数据稀缺和计算能力有限的应用场景。

Recently, self-supervised neural networks have shown excellent image denoising performance. However, current dataset free methods are either computationally expensive, require a noise model, or have inadequate image quality. In this work we show that a simple 2-layer network, without any training data or knowledge of the noise distribution, can enable high-quality image denoising at low computational cost. Our approach is motivated by Noise2Noise and Neighbor2Neighbor and works well for denoising pixel-wise independent noise. Our experiments on artificial, real-world camera, and microscope noise show that our method termed ZS-N2N (Zero Shot Noise2Noise) often outperforms existing dataset-free methods at a reduced cost, making it suitable for use cases with scarce data availability and limited compute.

Generating Anomalies for Video Anomaly Detection With Prompt-Based Feature Mapping
Liu, ZuhaoandWu, Xiao-MingandZheng, DianandLin, Kun-YuandZheng, Wei-Shi



研究问题:在监控视频中进行异常检测是一项具有挑战性的计算机视觉任务,因为训练时只有正常视频可用。
动机:虽然最近的工作发布了第一个虚拟异常检测数据集以帮助现实世界的检测,但存在一个异常差距,因为虚拟数据集中的限制与现实世界中的无限制形成了对比,这降低了虚拟数据集的泛化能力。同时,虚拟和现实场景之间也存在一个场景差距,包括特定于场景的异常(在一个场景中异常但在另一个场景中正常的事件)和特定于场景的属性,如监控摄像头的视角。
方法:本文提出了一种基于提示的特征映射框架(PFMF),该框架包含一个由异常提示引导的映射网络,用于生成现实中未见过的类型无限的异常,以及一个映射适应分支,通过应用领域分类器和异常分类器来缩小场景差距。
效果:所提出的框架在三个基准数据集上的表现优于最先进的技术。广泛的消融实验也证明了我们框架设计的有效性。

Anomaly detection in surveillance videos is a challenging computer vision task where only normal videos are available during training. Recent work released the first virtual anomaly detection dataset to assist real-world detection. However, an anomaly gap exists because the anomalies are bounded in the virtual dataset but unbounded in the real world, so it reduces the generalization ability of the virtual dataset. There also exists a scene gap between virtual and real scenarios, including scene-specific anomalies (events that are abnormal in one scene but normal in another) and scene-specific attributes, such as the viewpoint of the surveillance camera. In this paper, we aim to solve the problem of the anomaly gap and scene gap by proposing a prompt-based feature mapping framework (PFMF). The PFMF contains a mapping network guided by an anomaly prompt to generate unseen anomalies with unbounded types in the real scenario, and a mapping adaptation branch to narrow the scene gap by applying domain classifier and anomaly classifier. The proposed framework outperforms the state-of-the-art on three benchmark datasets. Extensive ablation experiments also show the effectiveness of our framework design.

RIDCP: Revitalizing Real Image Dehazing via High-Quality Codebook Priors
Wu, Rui-QiandDuan, Zheng-PengandGuo, Chun-LeandChai, ZhiandLi, Chongyi



研究问题:现有的去雾方法由于缺乏真实数据对和强大的先验知识,难以处理现实世界的模糊图像。
动机:我们提出了一种新的真实图像去雾方法,通过合成更真实的模糊数据和在网络中引入更强大的先验知识。
方法:(1) 我们重新思考了真实模糊图像的退化过程,并提出了考虑多种退化类型的现象学管道。(2) 我们提出了一种通过高质量码本先验(RIDCP)的真实图像去雾网络。首先,我们在大规模高质量数据集上预训练一个VQGAN以获取离散码本,封装高质量的先验知识。然后,我们使用新的归一化特征对齐模块来有效地利用高质量的特征并生成清晰的结果。
效果:我们的实验表明,我们的退化管道大大减少了合成数据和真实数据之间的领域差距,但我们仍然难以避免这个问题,这挑战了野外HQPs匹配。因此,我们通过可控的匹配操作重新计算特征与HQPs匹配的距离,以找到更好的对应关系。我们还提供了一个基于可解释解决方案的建议,用户可以根据他们的偏好灵活调整增强程度。

Existing dehazing approaches struggle to process real-world hazy images owing to the lack of paired real data and robust priors. In this work, we present a new paradigm for real image dehazing from the perspectives of synthesizing more realistic hazy data and introducing more robust priors into the network. Specifically, (1) instead of adopting the de facto physical scattering model, we rethink the degradation of real hazy images and propose a phenomenological pipeline considering diverse degradation types. (2) We propose a Real Image Dehazing network via high-quality Codebook Priors (RIDCP). Firstly, a VQGAN is pre-trained on a large-scale high-quality dataset to obtain the discrete codebook, encapsulating high-quality priors (HQPs). After replacing the negative effects brought by haze with HQPs, the decoder equipped with a novel normalized feature alignment module can effectively utilize high-quality features and produce clean results. However, although our degradation pipeline drastically mitigates the domain gap between synthetic and real data, it is still intractable to avoid it, which challenges HQPs matching in the wild. Thus, we re-calculate the distance when matching the features to the HQPs by a controllable matching operation, which facilitates finding better counterparts. We provide a recommendation to control the matching based on an explainable solution. Users can also flexibly adjust the enhancement degree as per their preference. Extensive experiments verify the effectiveness of our data synthesis pipeline and the superior performance of RIDCP in real image dehazing. Code and data will be released.

Backdoor Attacks Against Deep Image Compression via Adaptive Frequency Trigger
Yu, YiandWang, YufeiandYang, WenhanandLu, ShijianandTan, Yap-PengandKot, AlexC.



研究问题:本文旨在提出一种新型的针对学习型图像压缩模型的多触发器后门攻击方法。
动机:现有的压缩系统和标准广泛使用离散余弦变换(DCT),因此,我们提出了一种在DCT域中添加触发器的基于频率的触发器注入模型。
方法:设计了几种针对不同攻击场景的攻击目标,包括:1)通过比特率和重建质量攻击压缩质量;2)针对任务驱动的测量,如下游人脸识别和语义分割进行攻击。同时,设计了一种新颖的简单动态损失函数,以自适应地平衡不同损失项的影响,从而实现更有效的训练。
效果:实验表明,通过我们的训练触发器注入模型和对压缩模型编码器参数的简单修改,所提出的方法可以在单个图像压缩模型中成功注入多个带有相应触发器的后门。

Recent deep-learning-based compression methods have achieved superior performance compared with traditional approaches. However, deep learning models have proven to be vulnerable to backdoor attacks, where some specific trigger patterns added to the input can lead to malicious behavior of the models. In this paper, we present a novel backdoor attack with multiple triggers against learned image compression models. Motivated by the widely used discrete cosine transform (DCT) in existing compression systems and standards, we propose a frequency-based trigger injection model that adds triggers in the DCT domain. In particular, we design several attack objectives for various attacking scenarios, including: 1) attacking compression quality in terms of bit-rate and reconstruction quality; 2) attacking task-driven measures, such as down-stream face recognition and semantic segmentation. Moreover, a novel simple dynamic loss is designed to balance the influence of different loss terms adaptively, which helps achieve more efficient training. Extensive experiments show that with our trained trigger injection models and simple modification of encoder parameters (of the compression model), the proposed attack can successfully inject several backdoors with corresponding triggers in a single image compression model.

Ensemble-Based Blackbox Attacks on Dense Prediction
Cai, ZikuiandTan, YaotengandAsif, M.Salman



研究问题:如何对稠密预测模型(如物体检测器和分割器)进行对抗性攻击?
动机:现有的对抗性攻击方法,由单一的代理模型生成的攻击并不能转移到任意的(黑箱)受害者模型上。
方法:我们提出了一种精心设计的集成方法,可以对多个受害者模型进行有效的攻击。特别是,我们发现个体模型权重的归一化在攻击的成功中起着关键作用。然后,我们通过根据受害者模型调整集成权重来进一步提高攻击的性能。
效果:我们在物体检测和分割上进行了一系列的实验,以突出我们提出的方法的重要性。我们的基于集成的方法在物体检测和分割的黑盒攻击方法上表现优于现有的方法。最后,我们展示出我们的这种方法还可以生成一个单一的扰动,同时欺骗多个黑箱检测和分割模型。

We propose an approach for adversarial attacks on dense prediction models (such as object detectors and segmentation). It is well known that the attacks generated by a single surrogate model do not transfer to arbitrary (blackbox) victim models. Furthermore, targeted attacks are often more challenging than the untargeted attacks. In this paper, we show that a carefully designed ensemble can create effective attacks for a number of victim models. In particular, we show that normalization of the weights for individual models plays a critical role in the success of the attacks. We then demonstrate that by adjusting the weights of the ensemble according to the victim model can further improve the performance of the attacks. We performed a number of experiments for object detectors and segmentation to highlight the significance of the our proposed methods. Our proposed ensemble-based method outperforms existing blackbox attack methods for object detection and segmentation. Finally we show that our proposed method can also generate a single perturbation that can fool multiple blackbox detection and segmentation models simultaneously.

sRGB Real Noise Synthesizing With Neighboring Correlation-Aware Noise Model
Fu, ZixuanandGuo, LanqingandWen, Bihan



研究问题:在标准RGB(sRGB)域中,由于复杂的噪声分布,对真实噪声进行建模和合成具有挑战性。
动机:虽然大多数深度噪声生成器使用端到端训练的模型来合成sRGB的真实噪声,但由于缺乏明确的噪声建模,其合成噪声的质量会降低。
方法:我们提出将真实噪声不仅建模为依赖于底层清洁图像像素强度,而且还与其在局部区域内的相邻噪声实现高度相关。相应地,我们在信号依赖性的基础上,提出了一种学习其相邻相关性的新型噪声合成框架。
效果:通过提出的噪声模型,我们的框架大大缩小了合成噪声和真实噪声之间的分布差距。我们的实验表明,我们生成的“真实”sRGB有噪图像可以用于训练有监督的深度去噪器,从而大大提高其真实的去噪结果,与流行的经典去噪器或在其他sRGB噪声生成器上训练的深度去噪器相比,提高了很大程度。

Modeling and synthesizing real noise in the standard RGB (sRGB) domain is challenging due to the complicated noise distribution. While most of the deep noise generators proposed to synthesize sRGB real noise using an end-to-end trained model, the lack of explicit noise modeling degrades the quality of their synthesized noise. In this work, we propose to model the real noise as not only dependent on the underlying clean image pixel intensity, but also highly correlated to its neighboring noise realization within the local region. Correspondingly, we propose a novel noise synthesizing framework by explicitly learning its neighboring correlation on top of the signal dependency. With the proposed noise model, our framework greatly bridges the distribution gap between synthetic noise and real noise. We show that our generated "real" sRGB noisy images can be used for training supervised deep denoisers, thus to improve their real denoising results with a large margin, comparing to the popular classic denoisers or the deep denoisers that are trained on other sRGB noise generators. The code will be available at https://github.com/xuan611/sRGB-Real-Noise-Synthesizing.

BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning
Oh, ChangdaeandHwang, HyejiandLee, Hee-youngandLim, YongTaekandJung, GeunyoungandJung, JiyoungandChoi, HosikandSong, Kyungwoo



研究问题:如何有效地对大规模预训练模型进行参数高效的迁移学习,特别是在无法直接访问模型参数和内存容量有限的情况下。
动机:现有的迁移学习方法需要完全的模型参数和大量的内存,但在实际应用中,预训练模型往往是黑箱API或专有软件,无法直接访问参数,且内存需求大。
方法:提出一种名为BlackVIP的方法,通过设计输入相关的图像形状视觉提示来改进少样本适应和分布/位置偏移的鲁棒性,并使用SPSA-GC有效估计目标模型的梯度以更新Coordinator。
效果:在16个数据集上的大量实验表明,BlackVIP能够在不访问模型参数的情况下实现对不同领域的鲁棒适应,且内存需求最小。

With the surge of large-scale pre-trained models (PTMs), fine-tuning these models to numerous downstream tasks becomes a crucial problem. Consequently, parameter efficient transfer learning (PETL) of large models has grasped huge attention. While recent PETL methods showcase impressive performance, they rely on optimistic assumptions: 1) the entire parameter set of a PTM is available, and 2) a sufficiently large memory capacity for the fine-tuning is equipped. However, in most real-world applications, PTMs are served as a black-box API or proprietary software without explicit parameter accessibility. Besides, it is hard to meet a large memory requirement for modern PTMs. In this work, we propose black-box visual prompting (BlackVIP), which efficiently adapts the PTMs without knowledge about model architectures and parameters. BlackVIP has two components; 1) Coordinator and 2) simultaneous perturbation stochastic approximation with gradient correction (SPSA-GC). The Coordinator designs input-dependent image-shaped visual prompts, which improves few-shot adaptation and robustness on distribution/location shift. SPSA-GC efficiently estimates the gradient of a target model to update Coordinator. Extensive experiments on 16 datasets demonstrate that BlackVIP enables robust adaptation to diverse domains without accessing PTMs' parameters, with minimal memory requirements. Code: https://github.com/changdaeoh/BlackVIP

Joint HDR Denoising and Fusion: A Real-World Mobile HDR Image Dataset
Liu, ShuaizhengandZhang, XindongandSun, LingchenandLiang, ZhetongandZeng, HuiandZhang, Lei



研究问题:如何提高手机拍摄的HDR图像质量。
动机:现有的HDR图像数据集大多由DSLR相机在白天拍摄,限制了其在手机HDR成像研究中的应用。
方法:首次使用手机摄像头构建HDR图像数据集Mobile-HDR,并设计了一种基于transformer和金字塔交叉注意力对齐模块的模型,用于聚合不同曝光帧的高度相关特征,进行联合HDR去噪和融合。
效果:实验证明,该方法在手机HDR成像上具有优势,且数据集有效可用。

Mobile phones have become a ubiquitous and indispensable photographing device in our daily life, while the small aperture and sensor size make mobile phones more susceptible to noise and over-saturation, resulting in low dynamic range (LDR) and low image quality. It is thus crucial to develop high dynamic range (HDR) imaging techniques for mobile phones. Unfortunately, the existing HDR image datasets are mostly constructed by DSLR cameras in daytime, limiting their applicability to the study of HDR imaging for mobile phones. In this work, we develop, for the first time to our best knowledge, an HDR image dataset by using mobile phone cameras, namely Mobile-HDR dataset. Specifically, we utilize three mobile phone cameras to collect paired LDR-HDR images in the raw image domain, covering both daytime and nighttime scenes with different noise levels. We then propose a transformer based model with a pyramid cross-attention alignment module to aggregate highly correlated features from different exposure frames to perform joint HDR denoising and fusion. Experiments validate the advantages of our dataset and our method on mobile HDR imaging. Dataset and codes are available at https://github.com/shuaizhengliu/Joint-HDRDN.

Detecting Backdoors During the Inference Stage Based on Corruption Robustness Consistency
Liu, XiaogengandLi, MinghuiandWang, HaoyuandHu, ShengshanandYe, DengpanandJin, HaiandWu, LibingandXiao, Chaowei



研究问题:深度神经网络容易受到后门攻击,如何有效地在推理阶段检测触发样本是一个重要的问题。
动机:现有的检测方法需要对受害者模型有高度的可访问性、额外的干净数据或对后门触发器外观的知识,这限制了它们的实用性。
方法:本文提出了一种新的测试时触发样本检测方法——TeCo,它只需要受害者模型的硬标签输出,而不需要任何额外的信息。该方法通过计算不同损坏下预测结果转变的严重程度偏差来评估测试时的鲁棒性一致性。
效果:实验表明,与最先进的防御方法相比,TeCo在不同的后门攻击、数据集和模型架构上表现更好,AUROC提高了10%,稳定性提高了5倍。

Deep neural networks are proven to be vulnerable to backdoor attacks. Detecting the trigger samples during the inference stage, i.e., the test-time trigger sample detection, can prevent the backdoor from being triggered. However, existing detection methods often require the defenders to have high accessibility to victim models, extra clean data, or knowledge about the appearance of backdoor triggers, limiting their practicality. In this paper, we propose the test-time corruption robustness consistency evaluation (TeCo), a novel test-time trigger sample detection method that only needs the hard-label outputs of the victim models without any extra information. Our journey begins with the intriguing observation that the backdoor-infected models have similar performance across different image corruptions for the clean images, but perform discrepantly for the trigger samples. Based on this phenomenon, we design TeCo to evaluate test-time robustness consistency by calculating the deviation of severity that leads to predictions' transition across different corruptions. Extensive experiments demonstrate that compared with state-of-the-art defenses, which even require either certain information about the trigger types or accessibility of clean data, TeCo outperforms them on different backdoor attacks, datasets, and model architectures, enjoying a higher AUROC by 10% and 5 times of stability. The code is available at https://github.com/CGCL-codes/TeCo

Black-Box Sparse Adversarial Attack via Multi-Objective Optimisation
Williams, PhoenixNealeandLi, Ke



研究问题:深度神经网络易受对抗性图像影响,在安全关键任务中的可靠性引发关注。
动机:现有的稀疏对抗攻击方法往往难以同时最小化修改像素的数量和修改的大小,且需要大量查询并假设可以无限制地访问目标神经网络。
方法:提出一种新的多目标稀疏攻击算法,该算法在攻击过程中有效降低修改像素的数量和大小。算法借鉴了进化计算的思想,并引入了一种机制来优先处理与攻击者目标一致的目标。
效果:该方法在CIFAR-10和ImageNet训练的DNN分类器上优于现有的稀疏攻击方法,仅需要少量查询预算,在达到有竞争力的攻击成功率的同时,对像素的干扰更少。总的来说,这种新的攻击算法通过同时最小化修改像素的数量和大小,解决了当前稀疏攻击方法的限制,其结果展示了该方法在受限场景中的有效性,强调了其在增强DNN安全性方面的潜力。

Deep neural networks (DNNs) are susceptible to adversarial images, raising concerns about their reliability in safety-critical tasks. Sparse adversarial attacks, which limit the number of modified pixels, have shown to be highly effective in causing DNNs to misclassify. However, existing methods often struggle to simultaneously minimize the number of modified pixels and the size of the modifications, often requiring a large number of queries and assuming unrestricted access to the targeted DNN. In contrast, other methods that limit the number of modified pixels often permit unbounded modifications, making them easily detectable. To address these limitations, we propose a novel multi-objective sparse attack algorithm that efficiently minimizes the number of modified pixels and their size during the attack process. Our algorithm draws inspiration from evolutionary computation and incorporates a mechanism for prioritizing objectives that aligns with an attacker's goals. Our approach outperforms existing sparse attacks on CIFAR-10 and ImageNet trained DNN classifiers while requiring only a small query budget, attaining competitive attack success rates while perturbing fewer pixels. Overall, our proposed attack algorithm provides a solution to the limitations of current sparse attack methods by jointly minimizing the number of modified pixels and their size. Our results demonstrate the effectiveness of our approach in restricted scenarios, highlighting its potential to enhance DNN security.

HDR Imaging With Spatially Varying Signal-to-Noise Ratios
Chi, YihengandZhang, XingguangandChan, StanleyH.



研究问题:现有的HDR图像融合算法和去噪算法无法处理在低光环境下动态范围巨大且噪声空间变化的情况。
动机:由于在低光环境下,一个曝光的动态范围可能非常大,并且噪声是空间变化的,这导致现有的图像去噪算法和HDR融合算法都无法很好地处理这种情况。
方法:我们提出了一种新的方法,称为空间变化高动态范围(SV-HDR)融合网络,用于同时去噪和融合图像。我们在自定义设计的多尺度转换器框架中引入了一个新的曝光共享块。
效果:在各种测试条件下,所提出的SV-HDR的性能优于现有方法。

While today's high dynamic range (HDR) image fusion algorithms are capable of blending multiple exposures, the acquisition is often controlled so that the dynamic range within one exposure is narrow. For HDR imaging in photon-limited situations, the dynamic range can be enormous and the noise within one exposure is spatially varying. Existing image denoising algorithms and HDR fusion algorithms both fail to handle this situation, leading to severe limitations in low-light HDR imaging. This paper presents two contributions. Firstly, we identify the source of the problem. We find that the issue is associated with the co-existence of (1) spatially varying signal-to-noise ratio, especially the excessive noise due to very dark regions, and (2) a wide luminance range within each exposure. We show that while the issue can be handled by a bank of denoisers, the complexity is high. Secondly, we propose a new method called the spatially varying high dynamic range (SV-HDR) fusion network to simultaneously denoise and fuse images. We introduce a new exposure-shared block within our custom-designed multi-scale transformer framework. In a variety of testing conditions, the performance of the proposed SV-HDR is better than the existing methods.

Progressive Backdoor Erasing via Connecting Backdoor and Adversarial Attacks
Mu, BingxuandNiu, ZhenxingandWang, LeandWang, XueandMiao, QiguangandJin, RongandHua, Gang



研究问题:深度神经网络(DNN)容易受到后门攻击和对抗性攻击,这两种攻击通常被视为不同的问题并分别解决。
动机:本文发现后门攻击和对抗性攻击之间存在有趣的联系,即植入后门的模型其对抗性示例与触发样本的行为相似,都会激活DNN的同一子集神经元。
方法:基于此观察,提出了一种新的渐进式后门擦除(PBE)算法,通过利用无目标对抗性攻击逐步净化受感染的模型。
效果:在5种最先进的后门攻击下,实验表明,该方法可以在不显著降低干净样本性能的情况下有效擦除后门触发器,并显著优于现有的防御方法。

Deep neural networks (DNNs) are known to be vulnerable to both backdoor attacks as well as adversarial attacks. In the literature, these two types of attacks are commonly treated as distinct problems and solved separately, since they belong to training-time and inference-time attacks respectively. However, in this paper we find an intriguing connection between them: for a model planted with backdoors, we observe that its adversarial examples have similar behaviors as its triggered samples, i.e., both activate the same subset of DNN neurons. It indicates that planting a backdoor into a model will significantly affect the model's adversarial examples. Based on this observations, a novel Progressive Backdoor Erasing (PBE) algorithm is proposed to progressively purify the infected model by leveraging untargeted adversarial attacks. Different from previous backdoor defense methods, one significant advantage of our approach is that it can erase backdoor even when the additional clean dataset is unavailable. We empirically show that, against 5 state-of-the-art backdoor attacks, our AFT can effectively erase the backdoor triggers without obvious performance degradation on clean samples and significantly outperforms existing defense methods.

DAA: A Delta Age AdaIN Operation for Age Estimation via Binary Code Transformer
Chen, PingandZhang, XingpengandLi, YeandTao, JuandXiao, BinandWang, BingandJiang, Zongjie



研究问题:如何通过计算机任务实现裸眼识别年龄。
动机:由于难以获取每个年龄的代表性对比图像,计算机任务通常忽略了与其他人的年龄进行比较的想法。
方法:设计Delta Age AdaIN(DAA)操作以获取每个年龄的特征差异,并通过学习代表均值和标准差的值来获取每个年龄的风格图。将迁移学习输入作为年龄自然数的二进制代码以获取连续的年龄特征信息。
效果:在多个面部年龄数据集上,与最先进的方法相比,该方法具有更好的性能和更少的参数。

Naked eye recognition of age is usually based on comparison with the age of others. However, this idea is ignored by computer tasks because it is difficult to obtain representative contrast images of each age. Inspired by the transfer learning, we designed the Delta Age AdaIN (DAA) operation to obtain the feature difference with each age, which obtains the style map of each age through the learned values representing the mean and standard deviation. We let the input of transfer learning as the binary code of age natural number to obtain continuous age feature information. The learned two groups of values in Binary code mapping are corresponding to the mean and standard deviation of the comparison ages. In summary, our method consists of four parts: FaceEncoder, DAA operation, Binary code mapping, and AgeDecoder modules. After getting the delta age via AgeDecoder, we take the average value of all comparison ages and delta ages as the predicted age. Compared with state-of-the-art methods, our method achieves better performance with fewer parameters on multiple facial age datasets. Code is available at https://github.com/redcping/Delta_Age_AdaIN

Can't Steal? Cont-Steal! Contrastive Stealing Attacks Against Image Encoders
Sha, ZeyangandHe, XinleiandYu, NingandBackes, MichaelandZhang, Yang



研究问题:现有的预训练语言模型缺乏对丰富的结构化知识的利用,如何通过结合大规模文本语料库和知识图谱来训练一种增强的语言表示模型。
动机:知识图谱中的有信息量的实体可以通过外部知识来增强语言表示,提升模型在各种NLP任务上的性能。
方法:采用大规模文本语料库和知识图谱进行联合训练,训练出ERNIE模型,该模型能同时充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Self-supervised representation learning techniques have been developing rapidly to make full use of unlabeled images. They encode images into rich features that are oblivious to downstream tasks. Behind their revolutionary representation power, the requirements for dedicated model designs and a massive amount of computation resources expose image encoders to the risks of potential model stealing attacks - a cheap way to mimic the well-trained encoder performance while circumventing the demanding requirements. Yet conventional attacks only target supervised classifiers given their predicted labels and/or posteriors, which leaves the vulnerability of unsupervised encoders unexplored. In this paper, we first instantiate the conventional stealing attacks against encoders and demonstrate their severer vulnerability compared with downstream classifiers. To better leverage the rich representation of encoders, we further propose Cont-Steal, a contrastive-learning-based attack, and validate its improved stealing effectiveness in various experiment settings. As a takeaway, we appeal to our community's attention to the intellectual property protection of representation learning techniques, especially to the defenses against encoder stealing attacks like ours.

Edges to Shapes to Concepts: Adversarial Augmentation for Robust Vision
Tripathi, AditayandSingh, RishubhandChakraborty, AnirbanandShenoy, Pradeep



研究问题:深度视觉模型过于依赖低层次的纹理特征,导致泛化能力差。
动机:提出一种简单、轻量级的对抗性增强技术,以解决深度神经网络中的纹理偏差问题。
方法:通过将一张图像的边缘图叠加到另一张打乱补丁的图像上,并使用随机确定的比例混合,生成增强的图像。然后让模型对增强的图像进行分类,从而学习整体形状以进行准确的预测。
效果:实验结果表明,这种增强技术在各种数据集和神经网络结构上都显著提高了分类准确性和鲁棒性。例如,对于ViT-S模型,分类准确性提高了6%。在自然对抗性和分布外数据集(如ImageNet-A和ImageNet-R)上,分别获得了28%和8.5%的增益。

Recent work has shown that deep vision models tend to be overly dependent on low-level or "texture" features, leading to poor generalization. Various data augmentation strategies have been proposed to overcome this so-called texture bias in DNNs. We propose a simple, lightweight adversarial augmentation technique that explicitly incentivizes the network to learn holistic shapes for accurate prediction in an object classification setting. Our augmentations superpose edgemaps from one image onto another image with shuffled patches, using a randomly determined mixing proportion, with the image label of the edgemap image. To classify these augmented images, the model needs to not only detect and focus on edges but distinguish between relevant and spurious edges. We show that our augmentations significantly improve classification accuracy and robustness measures on a range of datasets and neural architectures. As an example, for ViT-S, We obtain absolute gains on classification accuracy gains up to 6%. We also obtain gains of up to 28% and 8.5% on natural adversarial and out-of-distribution datasets like ImageNet-A (for ViTB) and ImageNet-R (for ViT-S), respectively. Analysis using a range of probe datasets shows substantially increased shape sensitivity in our trained models, explaining the observed improvement in robustness and classification accuracy.

Feature Separation and Recalibration for Adversarial Robustness
Kim, WooJaeandCho, YoonkiandJung, JunsikandYoon, Sung-Eui



研究问题:深度神经网络易受对抗性攻击,因为特征层面的扰动累积。目前的方法通过停用导致模型误预测的不稳健特征激活来提高模型的鲁棒性,但作者认为这些恶意激活仍然包含有区分性的线索,通过重新校准可以捕获额外的有用信息以进行正确的模型预测。
动机:现有的方法虽然可以提高模型的鲁棒性,但仍有改进空间。作者提出一种新的、易于插入的方法——特征分离和重新校准(FSR),通过对恶意的、不稳健的激活进行分离和重新校准,生成更鲁棒的特征映射。
方法:FSR方法首先将输入特征图分离为帮助模型做出正确预测的稳健特征和导致模型在对抗性攻击下误预测的非稳健特征。然后,对非稳健激活进行调整,以恢复可能对模型预测有用的线索。
效果:大量实验证明,FSR方法优于传统的去活技术,并通过微小的计算开销将现有对抗训练方法的鲁棒性提高了8.57%。代码可在https://github.com/wkim97/FSR获取。

Deep neural networks are susceptible to adversarial attacks due to the accumulation of perturbations in the feature level, and numerous works have boosted model robustness by deactivating the non-robust feature activations that cause model mispredictions. However, we claim that these malicious activations still contain discriminative cues and that with recalibration, they can capture additional useful information for correct model predictions. To this end, we propose a novel, easy-to-plugin approach named Feature Separation and Recalibration (FSR) that recalibrates the malicious, non-robust activations for more robust feature maps through Separation and Recalibration. The Separation part disentangles the input feature map into the robust feature with activations that help the model make correct predictions and the non-robust feature with activations that are responsible for model mispredictions upon adversarial attack. The Recalibration part then adjusts the non-robust activations to restore the potentially useful cues for model predictions. Extensive experiments verify the superiority of FSR compared to traditional deactivation techniques and demonstrate that it improves the robustness of existing adversarial training methods by up to 8.57% with small computational overhead. Codes are available at https://github.com/wkim97/FSR.

The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training
Kang, Gi-CheonandKim, SungdongandKim, Jin-HwaandKwak, DonghyunandZhang, Byoung-Tak



研究问题:本文旨在解决视觉对话(VisDial)任务中的问题,即如何通过图像和对话历史来回答问题。
动机:目前的视觉对话模型主要通过监督学习或在相关视觉语言数据集上的预训练进行训练,但这种方法需要大量的标注数据。因此,本文提出了一种半监督学习方法,利用网络上的无标签图像来训练视觉对话模型。
方法:本文提出的方法是生成式自我训练(GST)。首先,通过检测分布外的数据来检索领域内的图片,然后通过多模态条件文本生成生成关于这些图片的合成对话。最后,使用原始的VisDial数据和合成的对话数据训练对话代理。
效果:实验结果表明,GST在VisDial v1.0和v0.9数据集上取得了新的最先进的结果。此外,GST对视觉和文本对抗攻击具有鲁棒性,并且在低数据量的情况下也表现出色。

Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog agents solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for visually-grounded dialog, called Generative Self-Training (GST), to leverage unlabeled images on the Web. Specifically, GST first retrieves in-domain images through out-of-distribution detection and generates synthetic dialogs regarding the images via multimodal conditional text generation. GST then trains a dialog agent on the synthetic and the original VisDial data. As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1.2M to 12.9M QA data). For robust training of the synthetic dialogs, we also propose perplexity-based data selection and multimodal consistency regularization. Evaluation on VisDial v1.0 and v0.9 datasets shows that GST achieves new state-of-the-art results on both datasets. We further observe the robustness of GST against both visual and textual adversarial attacks. Finally, GST yields strong performance gains in the low-data regime. Code is available at https://github.com/gicheonkang/gst-visdial.

Towards Benchmarking and Assessing Visual Naturalness of Physical World Adversarial Attacks
Li, SiminandZhang, ShuningandChen, GujunandWang, DongandFeng, PuandWang, JiakaiandLiu, AishanandYi, XinandLiu, Xianglong



研究问题:现有的物理世界对抗攻击评估方法存在误差、偏差和不一致的问题。
动机:为了解决这些问题,本文提出了一种评估物理世界对抗攻击自然度的新方法。
方法:首先,创建了第一个包含人类评级和注视的物理攻击自然度(PAN)数据集。其次,引入了双重先验对齐(DPA)网络,该网络旨在将人类知识嵌入模型推理过程中,以模仿人类的自然度评估方式和注视行为。
效果:实验结果表明,该方法能够更准确地评估物理世界对抗攻击的自然度,并有助于改进和自动评估自然度的研究。

Physical world adversarial attack is a highly practical and threatening attack, which fools real world deep learning systems by generating conspicuous and maliciously crafted real world artifacts. In physical world attacks, evaluating naturalness is highly emphasized since human can easily detect and remove unnatural attacks. However, current studies evaluate naturalness in a case-by-case fashion, which suffers from errors, bias and inconsistencies. In this paper, we take the first step to benchmark and assess visual naturalness of physical world attacks, taking autonomous driving scenario as the first attempt. First, to benchmark attack naturalness, we contribute the first Physical Attack Naturalness (PAN) dataset with human rating and gaze. PAN verifies several insights for the first time: naturalness is (disparately) affected by contextual features (i.e., environmental and semantic variations) and correlates with behavioral feature (i.e., gaze signal). Second, to automatically assess attack naturalness that aligns with human ratings, we further introduce Dual Prior Alignment (DPA) network, which aims to embed human knowledge into model reasoning process. Specifically, DPA imitates human reasoning in naturalness assessment by rating prior alignment and mimics human gaze behavior by attentive prior alignment. We hope our work fosters researches to improve and automatically assess naturalness of physical world attacks. Our code and exemplar data can be found at https://github.com/zhangsn-19/PAN.

ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing
Li, XiaodanandChen, YuefengandZhu, YaoandWang, ShuhuiandZhang, RongandXue, Hui



研究问题:现有的深度学习模型对图像属性的改变非常敏感,如何提高模型的鲁棒性?
动机:目前的研究中,大部分关注于模型在分布外数据上的鲁棒性,而忽视了模型在分布内数据上可能存在的问题。本研究通过在分布内数据上进行模型调试,探索模型可能对哪些对象属性敏感。
方法:创建了一个包含背景、大小、位置和方向控制的对象编辑工具包,并创建了一个名为ImageNet-E的严格基准,用于评估模型在对象属性方面的鲁棒性。使用这个基准,评估了当前包括卷积神经网络和视觉转换器在内的深度学习模型的性能。
效果:研究发现,大多数模型对属性变化非常敏感。背景的微小变化会导致top-1准确率平均下降9.23%。同时,一些被认为具有鲁棒性的模型,如对抗训练的模型,实际上在面对属性变化时的表现比基础模型更差。基于这些发现,提出了通过预处理、架构设计和训练策略来提高属性鲁棒性的方法。

Recent studies have shown that higher accuracy on ImageNet usually leads to better robustness against different corruptions. In this paper, instead of following the traditional research paradigm that investigates new out-of-distribution corruptions or perturbations deep models may encounter, we conduct model debugging in in-distribution data to explore which object attributes a model may be sensitive to. To achieve this goal, we create a toolkit for object editing with controls of backgrounds, sizes, positions, and directions, and create a rigorous benchmark named ImageNet-E(diting) for evaluating the image classifier robustness in terms of object attributes. With our ImageNet-E, we evaluate the performance of current deep learning models, including both convolutional neural networks and vision transformers. We find that most models are quite sensitive to attribute changes. An imperceptible change in the background can lead to an average of 9.23% drop on top-1 accuracy. We also evaluate some robust models including both adversarially trained models and other robust trained models and find that some models show worse robustness against attribute changes than vanilla models. Based on these findings, we discover ways to enhance attribute robustness with preprocessing, architecture designs, and training strategies. We hope this work can provide some insights to the community and open up a new avenue for research in robust computer vision. The code and dataset will be publicly available.

Real-Time Controllable Denoising for Image and Video
Zhang, ZhaoyangandJiang, YitongandShao, WenqiandWang, XiaogangandLuo, PingandLin, KaimoandGu, Jinwei



研究问题:本文旨在解决传统基于滤波的去噪方法在调整去噪强度时需要反复进行网络推理的问题,以及现有神经网络模型在实时用户交互中无法实现任意去噪级别调整的问题。
动机:传统的基于滤波的去噪方法可以通过调整滤波强度来实现去噪级别的调整,但对于神经网络模型来说,每次调整去噪强度都需要进行网络推理,这在实时用户交互中几乎不可能实现。
方法:本文提出了一种名为Real-time Controllable Denoising (RCD)的深度图像和视频去噪管道,该管道通过替换现有的CNN模型的最后一层输出(通常输出单个噪声图)为一个轻量级模块来输出多个噪声图,实现了对任意去噪级别的实时编辑。
效果:实验表明,RCD可以在不牺牲原有性能的情况下,为各种现有的重量级模型实现实时可编辑的图像和视频去噪,且无需进行网络推理。

Controllable image denoising aims to generate clean samples with human perceptual priors and balance sharpness and smoothness. In traditional filter-based denoising methods, this can be easily achieved by adjusting the filtering strength. However, for NN (Neural Network)-based models, adjusting the final denoising strength requires performing network inference each time, making it almost impossible for real-time user interaction. In this paper, we introduce Real-time Controllable Denoising (RCD), the first deep image and video denoising pipeline that provides a fully controllable user interface to edit arbitrary denoising levels in real-time with only one-time network inference. Unlike existing controllable denoising methods that require multiple denoisers and training stages, RCD replaces the last output layer (which usually outputs a single noise map) of an existing CNN-based model with a lightweight module that outputs multiple noise maps. We propose a novel Noise Decorrelation process to enforce the orthogonality of the noise feature maps, allowing arbitrary noise level control through noise map interpolation. This process is network-free and does not require network inference. Our experiments show that RCD can enable real-time editable image and video denoising for various existing heavy-weight models without sacrificing their original performance.

SQUID: Deep Feature In-Painting for Unsupervised Anomaly Detection
Xiang, TiangeandZhang, YixiaoandLu, YongyiandYuille, AlanL.andZhang, ChaoyiandCai, WeidongandZhou, Zongwei



研究问题:如何利用放射成像协议中的结构化信息,进行图像修复和异常检测?
动机:放射成像协议产生的图像具有高度相似性,且患者之间的解剖结构重复出现。为了利用这种结构化信息,我们提出了空间感知的记忆队列用于放射成像图像的修复和异常检测(简称SQUID)。
方法:我们使用空间感知的记忆队列对固有的解剖结构进行分类,形成重复的模式。在推理过程中,它可以识别图像中的异常(未见过/修改过的模式)。
效果:实验结果表明,SQUID在两个胸部X射线基准数据集上的无监督异常检测中,至少比13种最先进的方法高出5个百分点,这是通过计算曲线下面积(AUC)得出的。此外,我们还创建了一个新的数据集(DigitAnatomy),该数据集合成了胸部解剖的空间相关性和一致的形状。我们希望DigitAnatomy能够促进异常检测方法的开发、评估和可解释性。

Radiography imaging protocols focus on particular body regions, therefore producing images of great similarity and yielding recurrent anatomical structures across patients. To exploit this structured information, we propose the use of Space-aware Memory Queues for In-painting and Detecting anomalies from radiography images (abbreviated as SQUID). We show that SQUID can taxonomize the ingrained anatomical structures into recurrent patterns; and in the inference, it can identify anomalies (unseen/modified patterns) in the image. SQUID surpasses 13 state-of-the-art methods in unsupervised anomaly detection by at least 5 points on two chest X-ray benchmark datasets measured by the Area Under the Curve (AUC). Additionally, we have created a new dataset (DigitAnatomy), which synthesizes the spatial correlation and consistent shape in chest anatomy. We hope DigitAnatomy can prompt the development, evaluation, and interpretability of anomaly detection methods.

Visual Recognition-Driven Image Restoration for Multiple Degradation With Intrinsic Semantics Recovery
Yang, ZizhengandHuang, JieandChang, JiahaoandZhou, ManandYu, HuandZhang, JinghaoandZhao, Feng



研究问题:深度图像识别模型在应用于低质量图像时性能显著下降。
动机:目前的图像恢复或领域适应方法要么关注视觉质量而非识别质量,要么需要任务特定的语义标注进行训练。
方法:提出一种名为VRD-IR的视觉识别驱动的图像恢复网络,用于处理多种类型的未知图像损坏,通过在一个模型中从视觉识别的角度恢复高质量图像。
效果:实验表明,VRD-IR优于现有的图像恢复方法,并在分类、检测和人员再识别等高级任务上表现出优越的性能。

Deep image recognition models suffer a significant performance drop when applied to low-quality images since they are trained on high-quality images. Although many studies have investigated to solve the issue through image restoration or domain adaptation, the former focuses on visual quality rather than recognition quality, while the latter requires semantic annotations for task-specific training. In this paper, to address more practical scenarios, we propose a Visual Recognition-Driven Image Restoration network for multiple degradation, dubbed VRD-IR, to recover high-quality images from various unknown corruption types from the perspective of visual recognition within one model. Concretely, we harmonize the semantic representations of diverse degraded images into a unified space in a dynamic manner, and then optimize them towards intrinsic semantics recovery. Moreover, a prior-ascribing optimization strategy is introduced to encourage VRD-IR to couple with various downstream recognition tasks better. Our VRD-IR is corruption- and recognition-agnostic, and can be inserted into various recognition tasks directly as an image enhancement module. Extensive experiments on multiple image distortions demonstrate that our VRD-IR surpasses existing image restoration methods and show superior performance on diverse high-level tasks, including classification, detection, and person re-identification.

You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks?
Yuan, ZenghuiandZhou, PanandZou, KaiandCheng, Yu



研究问题:本文旨在解决视觉转换器(ViTs)在产业化过程中面临的安全挑战,即后门攻击。
动机:虽然视觉转换器(ViTs)在计算机视觉领域取得了显著的成功,但其对补丁级触发器的敏感性使其容易受到后门攻击的威胁。
方法:作者设计了一种名为BadViT的新型后门攻击框架,利用通用的补丁级触发器来捕获模型的注意力,从而操纵ViTs的生存机制以迷惑自身。此外,作者还提出了BadViT的隐形变体,通过限制触发器扰动的强度来增加攻击的隐蔽性。
效果:实验证明,BadViT是一种有效的针对ViTs的后门攻击方法,其依赖于毒药的数量较少,收敛性好,且可转移至下游任务。同时,作者还从现有高级防御方案的角度探讨了ViTs内部后门攻击的风险。

Vision Transformers (ViTs), which made a splash in the field of computer vision (CV), have shaken the dominance of convolutional neural networks (CNNs). However, in the process of industrializing ViTs, backdoor attacks have brought severe challenges to security. The success of ViTs benefits from the self-attention mechanism. However, compared with CNNs, we find that this mechanism of capturing global information within patches makes ViTs more sensitive to patch-wise triggers. Under such observations, we delicately design a novel backdoor attack framework for ViTs, dubbed BadViT, which utilizes a universal patch-wise trigger to catch the model's attention from patches beneficial for classification to those with triggers, thereby manipulating the mechanism on which ViTs survive to confuse itself. Furthermore, we propose invisible variants of BadViT to increase the stealth of the attack by limiting the strength of the trigger perturbation. Through a large number of experiments, it is proved that BadViT is an efficient backdoor attack method against ViTs, which is less dependent on the number of poisons, with satisfactory convergence, and is transferable for downstream tasks. Furthermore, the risks inside of ViTs to backdoor attacks are also explored from the perspective of existing advanced defense schemes.

STDLens: Model Hijacking-Resilient Federated Learning for Object Detection
Chow, Ka-HoandLiu, LingandWei, WenqiandIlhan, FatihandWu, Yanzhao



研究问题:如何在联邦学习中防止模型劫持攻击。
动机:联邦学习虽然有许多优点,但容易受到模型劫持攻击,攻击者可以通过在协作学习过程中植入特洛伊木马梯度来控制物体检测系统的行为。
方法:本文提出了STDLens,一种保护联邦学习免受此类攻击的原则性方法。我们首先调查了现有的缓解机制,并分析了它们由于在梯度上的空间聚类分析中的固有错误而失败的原因。基于这些见解,我们引入了一个三层取证框架,以识别和驱逐特洛伊木马梯度,并在联邦学习的进程中恢复性能。
效果:实验表明,STDLens可以保护联邦学习免受不同类型的模型劫持攻击,并在识别和删除特洛伊木马梯度方面优于现有方法,具有显著更高的精度和更低的误报率。

Federated Learning (FL) has been gaining popularity as a collaborative learning framework to train deep learning-based object detection models over a distributed population of clients. Despite its advantages, FL is vulnerable to model hijacking. The attacker can control how the object detection system should misbehave by implanting Trojaned gradients using only a small number of compromised clients in the collaborative learning process. This paper introduces STDLens, a principled approach to safeguarding FL against such attacks. We first investigate existing mitigation mechanisms and analyze their failures caused by the inherent errors in spatial clustering analysis on gradients. Based on the insights, we introduce a three-tier forensic framework to identify and expel Trojaned gradients and reclaim the performance over the course of FL. We consider three types of adaptive attacks and demonstrate the robustness of STDLens against advanced adversaries. Extensive experiments show that STDLens can protect FL against different model hijacking attacks and outperform existing methods in identifying and removing Trojaned gradients with significantly higher precision and much lower false-positive rates. The source code is available at https://github.com/git-disl/STDLens.

Multispectral Video Semantic Segmentation: A Benchmark Dataset and Baseline
Ji, WeiandLi, JingjingandBian, ChengandZhou, ZongweiandZhao, JiayingandYuille, AlanL.andCheng, Li



研究问题:如何实现在复杂场景和恶劣条件下的鲁棒和可靠的语义分割。
动机:现有的方法主要依赖RGB图像输入,但在恶劣天气或光线条件下效果不佳。因此,研究者开始探索使用RGB和红外热成像(RGBT)图像作为输入的多光谱语义分割方法。
方法:本文提出了一种新的任务——多光谱视频语义分割(MVSS),并创建了一个包含738个校准后的RGB和热视频以及3545个细粒度像素级26类别语义标注的MVSeg数据集。同时,研究者还提出了一种有效的MVSS基线模型MVNet,这是首个从多光谱和时间上下文中联合学习语义表示的模型。
效果:实验证明,多光谱视频输入的使用显著提高了语义分割的效果,而MVNet基线模型的有效性也得到了验证。

Robust and reliable semantic segmentation in complex scenes is crucial for many real-life applications such as autonomous safe driving and nighttime rescue. In most approaches, it is typical to make use of RGB images as input. They however work well only in preferred weather conditions; when facing adverse conditions such as rainy, overexposure, or low-light, they often fail to deliver satisfactory results. This has led to the recent investigation into multispectral semantic segmentation, where RGB and thermal infrared (RGBT) images are both utilized as input. This gives rise to significantly more robust segmentation of image objects in complex scenes and under adverse conditions. Nevertheless, the present focus in single RGBT image input restricts existing methods from well addressing dynamic real-world scenes. Motivated by the above observations, in this paper, we set out to address a relatively new task of semantic segmentation of multispectral video input, which we refer to as Multispectral Video Semantic Segmentation, or MVSS in short. An in-house MVSeg dataset is thus curated, consisting of 738 calibrated RGB and thermal videos, accompanied by 3,545 fine-grained pixel-level semantic annotations of 26 categories. Our dataset contains a wide range of challenging urban scenes in both daytime and nighttime. Moreover, we propose an effective MVSS baseline, dubbed MVNet, which is to our knowledge the first model to jointly learn semantic representations from multispectral and temporal contexts. Comprehensive experiments are conducted using various semantic segmentation models on the MVSeg dataset. Empirically, the engagement of multispectral video input is shown to lead to significant improvement in semantic segmentation; the effectiveness of our MVNet baseline has also been verified.

An In-Depth Exploration of Person Re-Identification and Gait Recognition in Cloth-Changing Conditions
Li, WeijiaandHou, SaihuiandZhang, ChunjieandCao, ChunshuiandLiu, XuandHuang, YongzhenandZhao, Yao



研究问题:本文旨在解决监控摄像头下目标行人的衣物变化问题,即人员重识别和步态识别的目标一致。
动机:由于缺乏合适的衣物变化基准,基于视频的人员重识别在衣物变化问题上鲜有研究,而步态识别则常在受控条件下进行。
方法:本文提出了一个衣物变化基准(CCPG),这是一个衣物变化数据集,具有几个亮点:(1)提供了200个身份标识,户外和室内共捕获了超过16K个序列;(2)每个身份标识有七种不同的衣物变化状态,这在以前的数据集上很少见;(3)为了便于研究,提供了RGB和剪影版数据。此外,为了系统地研究衣物变化问题,对基于视频的人员重识别和步态识别方法进行了全面实验。
效果:实验结果表明,在不同的衣物变化条件下,人员重识别和步态识别各自具有优越性,并表明步态识别是解决衣物变化问题的一个潜在方案。该数据集将在https://github.com/BNU-IVC/CCPG上提供。

The target of person re-identification (ReID) and gait recognition is consistent, that is to match the target pedestrian under surveillance cameras. For the cloth-changing problem, video-based ReID is rarely studied due to the lack of a suitable cloth-changing benchmark, and gait recognition is often researched under controlled conditions. To tackle this problem, we propose a Cloth-Changing benchmark for Person re-identification and Gait recognition (CCPG). It is a cloth-changing dataset, and there are several highlights in CCPG, (1) it provides 200 identities and over 16K sequences are captured indoors and outdoors, (2) each identity has seven different cloth-changing statuses, which is hardly seen in previous datasets, (3) RGB and silhouettes version data are both available for research purposes. Moreover, aiming to investigate the cloth-changing problem systematically, comprehensive experiments are conducted on video-based ReID and gait recognition methods. The experimental results demonstrate the superiority of ReID and gait recognition separately in different cloth-changing conditions and suggest that gait recognition is a potential solution for addressing the cloth-changing problem. Our dataset will be available at https://github.com/BNU-IVC/CCPG.

Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions
Zhu, YuruiandWang, TianyuandFu, XueyangandYang, XuanyuandGuo, XinandDai, JifengandQiao, YuandHu, Xiaowei



研究问题:如何通过使用单一的网络参数集,去除与多种恶劣天气条件相关的图像失真。
动机:不同天气条件下的失真图像既包含一般特性,也包含其特定特性。
方法:设计了一个高效的统一框架,采用两阶段训练策略,分别学习通用和特定于天气的特征。第一阶段的目标是学习通用特征,第二阶段的目标是自适应地扩展每种天气类型的特定参数。
效果:实验结果表明,该方法在所有的合成和真实世界基准数据集上都取得了优越的性能。

Image restoration under multiple adverse weather conditions aims to remove weather-related artifacts by using the single set of network parameters. In this paper, we find that distorted images under different weather conditions contain general characteristics as well as their specific characteristics. Inspired by this observation, we design an efficient unified framework with a two-stage training strategy to explore the weather-general and weather-specific features. The first training stage aims to learn the weather-general features by taking the images under various weather conditions as the inputs and outputting the coarsely restored results. The second training stage aims to learn to adaptively expand the specific parameters for each weather type in the deep model, where requisite positions for expansion of weather-specific parameters are learned automatically. Hence, we can obtain an efficient and unified model for image restoration under multiple adverse weather conditions. Moreover, we build the first real-world benchmark dataset with multiple weather conditions to better deal with real-world weather scenarios. Experimental results show that our method achieves superior performance on all the synthetic and real-world benchmark datasets.

Private Image Generation With Dual-Purpose Auxiliary Classifier
Chen, ChenandLiu, DaochangandMa, SiqiandNepal, SuryaandXu, Chang



研究问题:如何在保证图像生成的隐私性的同时,提高生成图像的质量和实用性?
动机:在医疗等敏感数据有限的领域,保护隐私的图像生成非常重要。然而,由于隐私预算的限制,这可能会影响生成图像的质量和实用性。
方法:提出了一种新的私有图像生成方法,该方法将一个双用途辅助分类器(交替从真实数据和伪造数据中学习)纳入了差分隐私GAN的训练中。此外,通过如顺序训练等特定的训练策略,可以加速生成器的收敛并进一步提高性能。
效果:在MNIST、Fashion-MNIST和CelebA三个基准测试中,该方法在所有指标上都取得了新的最先进的成果。

Privacy-preserving image generation has been important for segments such as medical domains that have sensitive and limited data. The benefits of guaranteed privacy come at the costs of generated images' quality and utility due to the privacy budget constraints. The utility is currently measured by the gen2real accuracy (g2r%), i.e., the accuracy on real data of a downstream classifier trained using generated data. However, apart from this standard utility, we identify the "reversed utility" as another crucial aspect, which computes the accuracy on generated data of a classifier trained using real data, dubbed as real2gen accuracy (r2g%). Jointly considering these two views of utility, the standard and the reversed, could help the generation model better improve transferability between fake and real data. Therefore, we propose a novel private image generation method that incorporates a dual-purpose auxiliary classifier, which alternates between learning from real data and fake data, into the training of differentially private GANs. Additionally, our deliberate training strategies such as sequential training contributes to accelerating the generator's convergence and further boosting the performance upon exhausting the privacy budget. Our results achieve new state-of-the-arts over all metrics on three benchmarks: MNIST, Fashion-MNIST, and CelebA.

Generating Aligned Pseudo-Supervision From Non-Aligned Data for Image Restoration in Under-Display Camera
Feng, RuichengandLi, ChongyiandChen, HuaijinandLi, ShuaiandGu, JinweiandLoy, ChenChange



研究问题:由于难以收集大规模且完全对齐的配对训练数据用于Under-Display Camera (UDC)图像恢复,因此以前的研究方法依赖于基于监视器的图像系统或基于模拟的方法,牺牲了数据的逼真性并引入了领域差距。
动机:本文重新审视了经典的立体设置用于训练数据收集——使用一个UDC和一个标准相机捕捉同一场景的两个图像。主要思想是从高质量的参考图像“复制”细节并将其“粘贴”到UDC图像上。虽然能够生成真实的训练对,但这种设置容易受到透视和景深变化引起的空间错位的影响。这个问题由于UDC和普通图像之间的大领域差异而进一步复杂化,这是UDC恢复所特有的。
方法:本文通过一种新的基于变压器的框架来减轻非平凡的领域差异和空间错位,该框架为相应的UDC输入生成对齐良好且高质量的目标数据。这是通过两个精心设计的组件——域对齐模块(DAM)和几何对齐模块(GAM)实现的,这两个组件鼓励在UDC和普通视图之间发现稳健和准确的对应关系。
效果:大量的实验表明,高质量且对齐良好的伪UDC训练对对于训练一个鲁棒的恢复网络是有益的。代码和数据集可在https://github.com/jnjaby/AlignFormer获取。

Due to the difficulty in collecting large-scale and perfectly aligned paired training data for Under-Display Camera (UDC) image restoration, previous methods resort to monitor-based image systems or simulation-based methods, sacrificing the realness of the data and introducing domain gaps. In this work, we revisit the classic stereo setup for training data collection -- capturing two images of the same scene with one UDC and one standard camera. The key idea is to "copy" details from a high-quality reference image and "paste" them on the UDC image. While being able to generate real training pairs, this setting is susceptible to spatial misalignment due to perspective and depth of field changes. The problem is further compounded by the large domain discrepancy between the UDC and normal images, which is unique to UDC restoration. In this paper, we mitigate the non-trivial domain discrepancy and spatial misalignment through a novel Transformer-based framework that generates well-aligned yet high-quality target data for the corresponding UDC input. This is made possible through two carefully designed components, namely, the Domain Alignment Module (DAM) and Geometric Alignment Module (GAM), which encourage robust and accurate discovery of correspondence between the UDC and normal views. Extensive experiments show that high-quality and well-aligned pseudo UDC training pairs are beneficial for training a robust restoration network. Code and the dataset are available at https://github.com/jnjaby/AlignFormer.

CRAFT: Concept Recursive Activation FacTorization for Explainability
Fel, ThomasandPicard, AgustinandB\'ethune, LouisandBoissin, ThibautandVigouroux, DavidandColin, JulienandCad\`ene, R\'emiandSerre, Thomas



研究问题:现有的属性方法主要突出模型决策最重要的图像区域,但无法传达模型在这些位置看到了什么信息。
动机:为了解决这个问题,本文提出了Craft方法,通过生成基于概念的解释来同时确定“什么”和“哪里”。
方法:Craft引入了三个新的元素:(i)一种递归策略,用于检测和分解跨层的概念;(ii)一种新的方法,使用Sobol指数更准确地估计概念的重要性;(iii)使用隐式微分来解锁概念归属图。
效果:实验表明,递归分解产生有意义且准确的概念,提出的概念重要性估计技术比之前的方法更忠实于模型。在评估该方法对人实验者的有用性时,发现Craft在两个测试场景中显著改善(而在第三个场景中,包括Craft在内的当前方法都无法提供帮助)。总的来说,虽然在开发适用于实际场景的通用可解释性方法方面仍有很多工作要做,但在适当的粒度级别上识别有意义的概念可以提供超越属性方法的有用和补充信息。

Attribution methods are a popular class of explainability methods that use heatmaps to depict the most important areas of an image that drive a model decision. Nevertheless, recent work has shown that these methods have limited utility in practice, presumably because they only highlight the most salient parts of an image (i.e., "where" the model looked) and do not communicate any information about "what" the model saw at those locations. In this work, we try to fill in this gap with Craft -- a novel approach to identify both "what" and "where" by generating concept-based explanations. We introduce 3 new ingredients to the automatic concept extraction literature: (i) a recursive strategy to detect and decompose concepts across layers, (ii) a novel method for a more faithful estimation of concept importance using Sobol indices, and (iii) the use of implicit differentiation to unlock Concept Attribution Maps. We conduct both human and computer vision experiments to demonstrate the benefits of the proposed approach. We show that our recursive decomposition generates meaningful and accurate concepts and that the proposed concept importance estimation technique is more faithful to the model than previous methods. When evaluating the usefulness of the method for human experimenters on the utility benchmark, we find that our approach significantly improves on two of the three test scenarios (while none of the current methods including ours help on the third). Overall, our study suggests that, while much work remains toward the development of general explainability methods that are useful in practical scenarios, the identification of meaningful concepts at the proper level of granularity yields useful and complementary information beyond that afforded by attribution methods.

All-in-Focus Imaging From Event Focal Stack
Lou, HanyueandTeng, MingguiandYang, YixinandShi, Boxin



研究问题:如何从单次拍摄的图像中生成高质量的全焦点图像。
动机:传统的焦点堆栈方法需要多次拍摄同一场景的不同距离,无法很好地应用于动态场景。由于单图像去焦和去模糊问题的病态性质,从单次拍摄生成高质量的全焦点图像具有挑战性。
方法:本文提出了事件焦点堆栈,定义为在连续对焦扫描期间捕获的事件流。给定一张任意距离聚焦的RGB图像,我们探索了事件流的高时间分辨率,从中自动选择重新对焦的时间戳并重建相应的重新对焦的图像以形成焦点堆栈。通过选定时间戳周围的邻近事件引导,我们可以合并带有适当权重的焦点堆栈并恢复清晰的全焦点图像。
效果:在合成和真实数据集上的实验结果均优于最先进的方法。

Traditional focal stack methods require multiple shots to capture images focused at different distances of the same scene, which cannot be applied to dynamic scenes well. Generating a high-quality all-in-focus image from a single shot is challenging, due to the highly ill-posed nature of the single-image defocus and deblurring problem. In this paper, to restore an all-in-focus image, we propose the event focal stack which is defined as event streams captured during a continuous focal sweep. Given an RGB image focused at an arbitrary distance, we explore the high temporal resolution of event streams, from which we automatically select refocusing timestamps and reconstruct corresponding refocused images with events to form a focal stack. Guided by the neighbouring events around the selected timestamps, we can merge the focal stack with proper weights and restore a sharp all-in-focus image. Experimental results on both synthetic and real datasets show superior performance over state-of-the-art methods.

Label-Free Liver Tumor Segmentation
Hu, QixinandChen, YixiongandXiao, JunfeiandSun, ShuwenandChen, JienengandYuille, AlanL.andZhou, Zongwei



研究问题:本文旨在证明AI模型能够准确分割肝肿瘤,无需手动标注。
动机:现有的肝肿瘤分割方法需要大量的手动标注,耗时耗力。
方法:利用合成的CT扫描肿瘤进行训练,这些合成肿瘤形状和纹理逼真,能有效训练AI模型进行肝肿瘤分割。
效果:实验结果表明,使用合成肿瘤训练的AI模型在肝肿瘤分割上的表现与真实肿瘤训练的模型相近,且能自动生成大量小肿瘤样本,有助于提高早期癌症的检测成功率。同时,这种方法也大大减少了手动标注的需求。

We demonstrate that AI models can accurately segment liver tumors without the need for manual annotation by using synthetic tumors in CT scans. Our synthetic tumors have two intriguing advantages: (I) realistic in shape and texture, which even medical professionals can confuse with real tumors; (II) effective for training AI models, which can perform liver tumor segmentation similarly to the model trained on real tumors--this result is exciting because no existing work, using synthetic tumors only, has thus far reached a similar or even close performance to real tumors. This result also implies that manual efforts for annotating tumors voxel by voxel (which took years to create) can be significantly reduced in the future. Moreover, our synthetic tumors can automatically generate many examples of small (or even tiny) synthetic tumors and have the potential to improve the success rate of detecting small liver tumors, which is critical for detecting the early stages of cancer. In addition to enriching the training data, our synthesizing strategy also enables us to rigorously assess the AI robustness.

Defining and Quantifying the Emergence of Sparse Concepts in DNNs
Ren, JieandLi, MingjieandChen, QiruiandDeng, HuiqiandZhang, Quanshi



研究问题:本文旨在通过训练深度神经网络(DNN)来揭示概念涌现的现象。
动机:我们发现DNN的推理分数可以被分解为几个交互概念的效果,这些概念可以理解为稀疏符号图模型中的推理模式,从而解释了DNN。
方法:我们使用这种图模型来解释DNN,并证明该图模型可以很好地模拟DNN在大量不同掩码样本上的输出。此外,这种图模型可以进一步简化并重写为And-Or图(AOG),而不会损失太多的解释准确性。
效果:实验结果表明,这种方法可以有效地解释DNN的推理过程,并且生成的AOG具有良好的解释准确性。

This paper aims to illustrate the concept-emerging phenomenon in a trained DNN. Specifically, we find that the inference score of a DNN can be disentangled into the effects of a few interactive concepts. These concepts can be understood as inference patterns in a sparse, symbolic graphical model, which explains the DNN. The faithfulness of using such a graphical model to explain the DNN is theoretically guaranteed, because we prove that the graphical model can well mimic the DNN's outputs on an exponential number of different masked samples. Besides, such a graphical model can be further simplified and re-written as an And-Or graph (AOG), without losing much explanation accuracy. The code is released at https://github.com/sjtu-xai-lab/aog.

Adversarial Robustness via Random Projection Filters
Dong, MinjingandXu, Chang



研究问题:深度神经网络在各种任务中表现出色,但容易受到对抗性攻击。大多数防御策略都集中在对抗性训练上,但在白盒设置下,通过梯度上升可以找到导致损失增加的攻击性扰动,这使得仅使用传统的对抗性训练难以达到满意的鲁棒性能。
动机:为了解决这一问题,我们利用随机投影的特性,提出用随机投影滤波器替换部分卷积滤波器,并从理论上探索了所提出的合成滤波器的几何表示保留性。
方法:我们采用随机投影滤波器替换部分卷积滤波器的方法,并通过Johnson-Lindenstrauss引理理论探索了所提出的合成滤波器的几何表示保留性。
效果:我们在多个网络和数据集上进行了充分的评估。实验结果表明,所提出的随机投影滤波器优于最先进的基线方法。代码可在https://github.com/UniSerj/Random-Projection-Filters获取。

Deep Neural Networks show superior performance in various tasks but are vulnerable to adversarial attacks. Most defense techniques are devoted to the adversarial training strategies, however, it is difficult to achieve satisfactory robust performance only with traditional adversarial training. We mainly attribute it to that aggressive perturbations which lead to the loss increment can always be found via gradient ascent in white-box setting. Although some noises can be involved to prevent attacks from deriving precise gradients on inputs, there exist trade-offs between the defense capability and natural generalization. Taking advantage of the properties of random projection, we propose to replace part of convolutional filters with random projection filters, and theoretically explore the geometric representation preservation of proposed synthesized filters via Johnson-Lindenstrauss lemma. We conduct sufficient evaluation on multiple networks and datasets. The experimental results showcase the superiority of proposed random projection filters to state-of-the-art baselines. The code is available on https://github.com/UniSerj/Random-Projection-Filters.

Model-Agnostic Gender Debiased Image Captioning
Hirota, YusukeandNakashima, YutaandGarcia, Noa



研究问题:本文旨在解决图像描述模型在训练集中存在的性别偏见问题。
动机:虽然已有的研究通过强制模型关注人物来减少性别误分类,但这反过来又会产生性别刻板印象的词语,以牺牲预测正确性别为代价。
方法:作者提出了一个名为LIBRA的框架,该框架通过学习合成的有偏样本来减少两种类型的性别偏见,纠正性别误分类并将性别刻板印象的词语改为更中性的词语。
效果:实验结果表明,LIBRA框架能有效降低图像描述模型中的性别偏见,提高其对性别的预测准确性,并改变生成的性别刻板印象词语。

Image captioning models are known to perpetuate and amplify harmful societal bias in the training set. In this work, we aim to mitigate such gender bias in image captioning models. While prior work has addressed this problem by forcing models to focus on people to reduce gender misclassification, it conversely generates gender-stereotypical words at the expense of predicting the correct gender. From this observation, we hypothesize that there are two types of gender bias affecting image captioning models: 1) bias that exploits context to predict gender, and 2) bias in the probability of generating certain (often stereotypical) words because of gender. To mitigate both types of gender biases, we propose a framework, called LIBRA, that learns from synthetically biased samples to decrease both types of biases, correcting gender misclassification and changing gender-stereotypical words to more neutral ones.

OpenGait: Revisiting Gait Recognition Towards Better Practicality
Fan, ChaoandLiang, JunhaoandShen, ChuanfuandHou, SaihuiandHuang, YongzhenandYu, Shiqi



研究问题:尽管步态识别技术在室内数据集上取得了显著进展,但在自然环境下的表现却较差,且一些从室内数据集得出的结论无法推广到实际应用。
动机:为了提高步态识别技术的实用性,本文旨在通过全面的基准测试,而不仅仅是优化某一特定模型的性能。
方法:首先开发了一个灵活高效的步态识别代码库OpenGait,然后基于OpenGait重新审视了步态识别的最新发展,并进行了消融实验。根据这些发现,开发了一个结构简单、实验有效、实用稳健的基线模型GaitBase。
效果:在多个公共数据集上,GaitBase与当前许多步态识别方法进行了全面比较,结果表明,无论室内还是室外情况,GaitBase在大多数情况下都表现出显著的强性能。

Gait recognition is one of the most critical long-distance identification technologies and increasingly gains popularity in both research and industry communities. Despite the significant progress made in indoor datasets, much evidence shows that gait recognition techniques perform poorly in the wild. More importantly, we also find that some conclusions drawn from indoor datasets cannot be generalized to real applications. Therefore, the primary goal of this paper is to present a comprehensive benchmark study for better practicality rather than only a particular model for better performance. To this end, we first develop a flexible and efficient gait recognition codebase named OpenGait. Based on OpenGait, we deeply revisit the recent development of gait recognition by re-conducting the ablative experiments. Encouragingly,we detect some unperfect parts of certain prior woks, as well as new insights. Inspired by these discoveries, we develop a structurally simple, empirically powerful, and practically robust baseline model, GaitBase. Experimentally, we comprehensively compare GaitBase with many current gait recognition methods on multiple public datasets, and the results reflect that GaitBase achieves significantly strong performance in most cases regardless of indoor or outdoor situations. Code is available at https://github.com/ShiqiYu/OpenGait.

The Best Defense Is a Good Offense: Adversarial Augmentation Against Adversarial Attacks
Frosio, IuriandKautz, Jan



研究问题:如何提前防止对抗性攻击?
动机:大部分防御对抗性攻击的方法都是在攻击发生后进行防御,本文提出一种全新的视角和框架A^5(对抗性增强的对抗性攻击防御)。
方法:通过自动神经网络扰动分析工具,制造一个防御性的扰动,保证任何对输入的攻击(在给定的强度内)都会失败。
效果:实验证明,A^5在MNIST、CIFAR10、FashionMNIST和Tinyimagenet等数据集上的表现优于最先进的认证防御方法。同时,A^5还可以用于创建具有认证鲁棒性的物理对象。

Many defenses against adversarial attacks (e.g. robust classifiers, randomization, or image purification) use countermeasures put to work only after the attack has been crafted. We adopt a different perspective to introduce A^5 (Adversarial Augmentation Against Adversarial Attacks), a novel framework including the first certified preemptive defense against adversarial attacks. The main idea is to craft a defensive perturbation to guarantee that any attack (up to a given magnitude) towards the input in hand will fail. To this aim, we leverage existing automatic perturbation analysis tools for neural networks. We study the conditions to apply A^5 effectively, analyze the importance of the robustness of the to-be-defended classifier, and inspect the appearance of the robustified images. We show effective on-the-fly defensive augmentation with a robustifier network that ignores the ground truth label, and demonstrate the benefits of robustifier and classifier co-training. In our tests, A^5 consistently beats state of the art certified defenses on MNIST, CIFAR10, FashionMNIST and Tinyimagenet. We also show how to apply A^5 to create certifiably robust physical objects. The released code at https://github.com/NVlabs/A5 allows experimenting on a wide range of scenarios beyond the man-in-the-middle attack tested here, including the case of physical attacks.

GaitGCI: Generative Counterfactual Intervention for Gait Recognition
Dou, HuanzhangandZhang, PengyiandSu, WeiandYu, YunlongandLin, YiningandLi, Xi



研究问题:现有的步态识别方法易受干扰因素影响,难以关注到反映有效行走模式的区域。
动机:为了解决步态识别中的根本问题,提出了一种生成式反事实干预框架——GaitGCI。
方法:GaitGCI由反事实干预学习(CIL)和多样性约束动态卷积(DCDC)组成。CIL利用因果关系推断来减轻干扰因素的影响;DCDC自适应地生成样本事实/反事实注意力以感知样本属性。
效果:实验表明,提出的GaitGCI能够有效地关注反映步态模式的区分性和可解释性区域;模型无关,可以插入现有模型以提高性能,几乎无需额外成本;在任意场景下(实验室内和野外)都能高效地实现最先进的性能。

Gait is one of the most promising biometrics that aims to identify pedestrians from their walking patterns. However, prevailing methods are susceptible to confounders, resulting in the networks hardly focusing on the regions that reflect effective walking patterns. To address this fundamental problem in gait recognition, we propose a Generative Counterfactual Intervention framework, dubbed GaitGCI, consisting of Counterfactual Intervention Learning (CIL) and Diversity-Constrained Dynamic Convolution (DCDC). CIL leverages causal inference to alleviate the impact of confounders by maximizing the likelihood difference between factual/counterfactual attention. DCDC adaptively generates sample-wise factual/counterfactual attention to perceive the sample properties. With matrix decomposition and diversity constraint, DCDC guarantees the model's efficiency and effectiveness. Extensive experiments indicate that proposed GaitGCI: 1) could effectively focus on the discriminative and interpretable regions that reflect gait patterns; 2) is model-agnostic and could be plugged into existing models to improve performance with nearly no extra cost; 3) efficiently achieves state-of-the-art performance on arbitrary scenarios (in-the-lab and in-the-wild).

Adversarially Masking Synthetic To Mimic Real: Adaptive Noise Injection for Point Cloud Segmentation Adaptation
Li, GuangruiandKang, GuoliangandWang, XiaohanandWei, YunchaoandYang, Yi



研究问题:本文旨在解决合成数据标签和真实世界点云之间的领域差异,特别是在存在噪声的情况下进行语义分割的问题。
动机:由于现实世界的传感器可能会受到各种环境条件的影响,收集到的点云数据通常包含意外和不规则的噪声,这导致在理想合成数据上训练的模型无法在真实数据上取得满意的分割结果。
方法:本文设计了一个新颖的可学习掩蔽模块,通过学习在适应过程中遮蔽源点来缩小由目标噪声引起的领域差距。我们还将Gumbel-Softmax操作纳入掩蔽模块,使其能够生成二进制掩码并通过梯度反向传播进行端到端训练。
效果:实验结果表明,该方法能有效缩小合成数据和真实世界点云之间的领域差距,提高语义分割的准确性。

This paper considers the synthetic-to-real adaptation of point cloud semantic segmentation, which aims to segment the real-world point clouds with only synthetic labels available. Contrary to synthetic data which is integral and clean, point clouds collected by real-world sensors typically contain unexpected and irregular noise because the sensors may be impacted by various environmental conditions. Consequently, the model trained on ideal synthetic data may fail to achieve satisfactory segmentation results on real data. Influenced by such noise, previous adversarial training methods, which are conventional for 2D adaptation tasks, become less effective. In this paper, we aim to mitigate the domain gap caused by target noise via learning to mask the source points during the adaptation procedure. To this end, we design a novel learnable masking module, which takes source features and 3D coordinates as inputs. We incorporate Gumbel-Softmax operation into the masking module so that it can generate binary masks and be trained end-to-end via gradient back-propagation. With the help of adversarial training, the masking module can learn to generate source masks to mimic the pattern of irregular target noise, thereby narrowing the domain gap. We name our method "Adversarial Masking" as adversarial training and learnable masking module depend on each other and cooperate with each other to mitigate the domain gap. Experiments on two synthetic-to-real adaptation benchmarks verify the effectiveness of the proposed method.

Seasoning Model Soups for Robustness to Adversarial and Natural Distribution Shifts
Croce, FrancescoandRebuffi, Sylvestre-AlviseandShelhamer, EvanandGowal, Sven



研究问题:如何训练一个对多种威胁具有鲁棒性的分类器,同时避免在训练过程中需要对所有攻击有知识并且对未见过的数据分布偏移仍然脆弱的问题。
动机:现有的方法需要对所有的攻击有知识才能进行训练,并且在面对未见过的数据分布偏移时仍然很脆弱。
方法:通过获取对抗性稳健的模型汤(即参数的线性组合),在不同的l_p-范数有界对手中平滑地权衡鲁棒性,从而获得对所有威胁都具有鲁棒性的模型。
效果:实验证明,这种方法可以获得对所有威胁都具有鲁棒性的模型,并且在一些情况下,其对给定的l_p-范数对手的鲁棒性甚至超过了专门针对该对手的模型。最后,展示了对抗性稳健的模型汤可以作为一种有效的工具,通过少数例子适应数据分布的偏移。

Adversarial training is widely used to make classifiers robust to a specific threat or adversary, such as l_p-norm bounded perturbations of a given p-norm. However, existing methods for training classifiers robust to multiple threats require knowledge of all attacks during training and remain vulnerable to unseen distribution shifts. In this work, we describe how to obtain adversarially-robust model soups (i.e., linear combinations of parameters) that smoothly trade-off robustness to different l_p-norm bounded adversaries. We demonstrate that such soups allow us to control the type and level of robustness, and can achieve robustness to all threats without jointly training on all of them. In some cases, the resulting model soups are more robust to a given l_p-norm adversary than the constituent model specialized against that same adversary. Finally, we show that adversarially-robust model soups can be a viable tool to adapt to distribution shifts from a few examples.

Introducing Competition To Boost the Transferability of Targeted Adversarial Examples Through Clean Feature Mixup
Byun, JunyoungandKwon, Myung-JoonandCho, SeungjuandKim, YoonjiandKim, Changick



研究问题:现有的深度神经网络易受对抗性示例影响,通过微小的输入修改可能导致错误预测。
动机:对抗性示例在模型间具有可转移性,但定向攻击由于决策边界的差异成功率较低。为提高定向对抗性示例的可转移性,提出在优化过程中引入竞争机制。
方法:在两种新的竞争者噪声存在的情况下进行对抗性扰动:针对不同目标类别的对抗性扰动和针对正确类别的友好扰动。通过这些竞争对手,即使对抗性示例欺骗网络提取特定特征导致目标类别,这种干扰也可以被其他竞争对手抑制。因此,在这种竞争中,对抗性示例应利用更多样化的特征来压倒其干扰,从而提高其在不同模型之间的可转移性。
效果:在ImageNet-Compatible和CIFAR-10数据集上的大量实验结果表明,该方法优于现有基线方法,且计算复杂度低。

Deep neural networks are widely known to be susceptible to adversarial examples, which can cause incorrect predictions through subtle input modifications. These adversarial examples tend to be transferable between models, but targeted attacks still have lower attack success rates due to significant variations in decision boundaries. To enhance the transferability of targeted adversarial examples, we propose introducing competition into the optimization process. Our idea is to craft adversarial perturbations in the presence of two new types of competitor noises: adversarial perturbations towards different target classes and friendly perturbations towards the correct class. With these competitors, even if an adversarial example deceives a network to extract specific features leading to the target class, this disturbance can be suppressed by other competitors. Therefore, within this competition, adversarial examples should take different attack strategies by leveraging more diverse features to overwhelm their interference, leading to improving their transferability to different models. Considering the computational complexity, we efficiently simulate various interference from these two types of competitors in feature space by randomly mixing up stored clean features in the model inference and named this method Clean Feature Mixup (CFM). Our extensive experimental results on the ImageNet-Compatible and CIFAR-10 datasets show that the proposed method outperforms the existing baselines with a clear margin. Our code is available at https://github.com/dreamflake/CFM.

A Whac-a-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others
Li, ZhihengandEvtimov, IvanandGordo, AlbertandHazirbas, CanerandHassner, TalandFerrer, CristianCantonandXu, ChenliangandIbrahim, Mark



研究问题:现有的机器学习模型存在学习到不能泛化的捷径,这降低了模型的可靠性。
动机:现实世界的图像中充满了从背景到纹理的多种视觉线索。提高视觉系统的可靠性的关键是了解现有方法能否克服多个捷径,或者在“打地鼠”游戏中挣扎,即消除一个捷径会放大对其他捷径的依赖。
方法:我们提出了两个基准测试:1)UrbanCars,一个具有精确控制误导性线索的数据集;2)ImageNet-W,一个基于ImageNet的水印记(一种影响几乎所有现代视觉模型的捷径)评估集。除了纹理和背景,ImageNet-W还允许我们研究从自然图像训练中出现的多个捷径。
效果:我们发现,包括大型基础模型在内的计算机视觉模型,无论训练集、架构还是监督方式如何,当存在多个捷径时都会挣扎。即使是专门设计用来对抗捷径的方法,也会在“打地鼠”困境中挣扎。为了应对这一挑战,我们提出了最后一层集成(Last Layer Ensemble),这是一种简单而有效的方法,可以在不打地鼠的情况下减轻多个捷径的影响。我们的结果表明,多捷径缓解是一个被忽视的挑战,对于提高视觉系统的可靠性至关重要。

Machine learning models have been found to learn shortcuts---unintended decision rules that are unable to generalize---undermining models' reliability. Previous works address this problem under the tenuous assumption that only a single shortcut exists in the training data. Real-world images are rife with multiple visual cues from background to texture. Key to advancing the reliability of vision systems is understanding whether existing methods can overcome multiple shortcuts or struggle in a Whac-A-Mole game, i.e., where mitigating one shortcut amplifies reliance on others. To address this shortcoming, we propose two benchmarks: 1) UrbanCars, a dataset with precisely controlled spurious cues, and 2) ImageNet-W, an evaluation set based on ImageNet for watermark, a shortcut we discovered affects nearly every modern vision model. Along with texture and background, ImageNet-W allows us to study multiple shortcuts emerging from training on natural images. We find computer vision models, including large foundation models---regardless of training set, architecture, and supervision---struggle when multiple shortcuts are present. Even methods explicitly designed to combat shortcuts struggle in a Whac-A-Mole dilemma. To tackle this challenge, we propose Last Layer Ensemble, a simple-yet-effective method to mitigate multiple shortcuts without Whac-A-Mole behavior. Our results surface multi-shortcut mitigation as an overlooked challenge critical to advancing the reliability of vision systems. The datasets and code are released: https://github.com/facebookresearch/Whac-A-Mole.

Neumann Network With Recursive Kernels for Single Image Defocus Deblurring
Quan, YuhuiandWu, ZicongandJi, Hui



研究问题:如何从模糊的、散焦的图像中恢复出清晰、聚焦的图像。
动机:由于散焦模糊效应的空间变化和显著的大小变化,这是一个具有挑战性的恢复任务。
方法:提出了一种可学习的递归核表示(RKR)来表达散焦核,通过将递归、可分离和正原子核的线性组合来表达散焦核,从而实现了空间变化的散焦模糊过程的紧凑而有效的物理编码参数化。然后,提出了一种物理驱动的高效深度模型,并设计了一种跨尺度融合结构用于SIDD,同时引入了重模糊损失来规范RKR的学习。
效果:实验表明,该方法在性能上显著优于现有方法,且模型大小与顶级方法相当。

Single image defocus deblurring (SIDD) refers to recovering an all-in-focus image from a defocused blurry one. It is a challenging recovery task due to the spatially-varying defocus blurring effects with significant size variation. Motivated by the strong correlation among defocus kernels of different sizes and the blob-type structure of defocus kernels, we propose a learnable recursive kernel representation (RKR) for defocus kernels that expresses a defocus kernel by a linear combination of recursive, separable and positive atom kernels, leading to a compact yet effective and physics-encoded parametrization of the spatially-varying defocus blurring process. Afterwards, a physics-driven and efficient deep model with a cross-scale fusion structure is presented for SIDD, with inspirations from the truncated Neumann series for approximating the matrix inversion of the RKR-based blurring operator. In addition, a reblurring loss is proposed to regularize the RKR learning. Extensive experiments show that, our proposed approach significantly outperforms existing ones, with a model size comparable to that of the top methods.

RWSC-Fusion: Region-Wise Style-Controlled Fusion Network for the Prohibited X-Ray Security Image Synthesis
Duan, LuwenandWu, MinandMao, LijianandYin, JunandXiong, JianpingandLi, Xi



研究问题:如何有效检测安全检查X射线图像中的违禁物品?
动机:由于X射线安检图像的丰富性和多样性,以及其对于训练检测模型的重要性,解决数据不足的问题是必要的。
方法:提出一种区域感知风格控制的融合(RWSC-Fusion)网络,通过将违禁物品叠加到正常的X射线安检图像上,合成违禁X射线安检图像。该网络在结构和损失函数上都有所创新,以生成更真实的X射线安检图像。
效果:通过对私有和公共SIXray数据集进行评估,证明合成的X射线安检图像具有足够的可靠性,可以有效地扩充违禁X射线安检图像。

Automatic prohibited item detection in security inspection X-ray images is necessary for transportation.The abundance and diversity of the X-ray security images with prohibited item, termed as prohibited X-ray security images, are essential for training the detection model. In order to solve the data insufficiency, we propose a RegionWise Style-Controlled Fusion (RWSC-Fusion) network, which superimposes the prohibited items onto the normal X-ray security images, to synthesize the prohibited X-ray security images. The proposed RWSC-Fusion innovates both network structure and loss functions to generate more realistic X-ray security images. Specifically, a RWSCFusion module is designed to enable the region-wise fusion by controlling the appearance of the overlapping region with novel modulation parameters. In addition, an EdgeAttention (EA) module is proposed to effectively improve the sharpness of the synthetic images. As for the unsupervised loss function, we propose the Luminance loss in Logarithmic form (LL) and Correlation loss of Saturation Difference (CSD), to optimize the fused X-ray security images in terms of luminance and saturation. We evaluate the authenticity and the training effect of the synthetic X-ray security images on private and public SIXray dataset. The results confirm that our synthetic images are reliable enough to augment the prohibited Xray security images.

Self-Supervised Blind Motion Deblurring With Deep Expectation Maximization
Li, JiandWang, WeixiandNan, YuesongandJi, Hui



研究问题:如何从模糊的图像中恢复清晰的图像,特别是在相机抖动引起的模糊情况下。
动机:现有的深度学习方法需要大量的模糊/潜在图像对进行训练,而本文提出了一种无需数据集的深度学习方法来去除静态场景图像中的均匀和非均匀模糊效果。
方法:该方法通过深度神经网络(DNN)对潜在图像进行重参数化,并提出了蒙特卡洛期望最大化(MCEM)方法来训练DNN,而无需任何潜在图像。蒙特卡洛模拟通过朗之万动力学实现。
效果:实验表明,该方法在去除静态场景图像的运动模糊方面显著优于现有方法。

When taking a picture, any camera shake during the shutter time can result in a blurred image. Recovering a sharp image from the one blurred by camera shake is a challenging yet important problem. Most existing deep learning methods use supervised learning to train a deep neural network (DNN) on a dataset of many pairs of blurred/latent images. In contrast, this paper presents a dataset-free deep learning method for removing uniform and non-uniform blur effects from images of static scenes. Our method involves a DNN-based re-parametrization of the latent image, and we propose a Monte Carlo Expectation Maximization (MCEM) approach to train the DNN without requiring any latent images. The Monte Carlo simulation is implemented via Langevin dynamics. Experiments showed that the proposed method outperforms existing methods significantly in removing motion blur from images of static scenes.

Dynamic Generative Targeted Attacks With Pattern Injection
Feng, WeiweiandXu, NanqingandZhang, TianzhuandZhang, Yongdong



研究问题:如何提高目标攻击的有效性和广泛性。
动机:现有的目标攻击方法主要依赖特定实例或全局数据集,忽视了目标类别的真实分布,导致攻击效果有限。
方法:提出一种基于因果图的生成式攻击模型,该模型由一个交叉注意力引导的卷积模块和一个模式注入模块组成。前者采用动态和静态卷积核分别处理特定实例和全局数据集,后者利用模式原型编码目标模式,以指导目标对抗样本的生成。
效果:实验表明,该方法在13个模型上的表现优于10种现有攻击方法。

Adversarial attacks can evaluate model robustness and have been of great concerns in recent years. Among various attacks, targeted attacks aim at misleading victim models to output adversary-desired predictions, which are more challenging and threatening than untargeted ones. Existing targeted attacks can be roughly divided into instancespecific and instance-agnostic attacks. Instance-specific attacks craft adversarial examples via iterative gradient updating on the specific instance. In contrast, instanceagnostic attacks learn a universal perturbation or a generative model on the global dataset to perform attacks. However they rely too much on the classification boundary of substitute models, ignoring the realistic distribution of target class, which may result in limited targeted attack performance. And there is no attempt to simultaneously combine the information of the specific instance and the global dataset. To deal with these limitations, we first conduct an analysis via a causal graph and propose to craft transferable targeted adversarial examples by injecting target patterns. Based on this analysis, we introduce a generative attack model composed of a cross-attention guided convolution module and a pattern injection module. Concretely, the former adopts a dynamic convolution kernel and a static convolution kernel for the specific instance and the global dataset, respectively, which can inherit the advantages of both instance-specific and instance-agnostic attacks. And the pattern injection module utilizes a pattern prototype to encode target patterns, which can guide the generation of targeted adversarial examples. Besides, we also provide rigorous theoretical analysis to guarantee the effectiveness of our method. Extensive experiments demonstrate that our method show superior performance than 10 existing adversarial attacks against 13 models.

PointCert: Point Cloud Classification With Deterministic Certified Robustness Guarantees
Zhang, JinghuaiandJia, JinyuanandLiu, HongbinandGong, NeilZhenqiang



研究问题:本文旨在解决点云分类器在对抗性扰动下易受攻击的问题,以及现有防御方法的健壮性保证存在概率性的问题。
动机:点云分类在自动驾驶和增强现实等安全关键应用中起着重要作用,但现有的防御方法存在健壮性保证的概率性问题。
方法:提出了一种名为PointCert的通用框架,可以将任意点云分类器转化为具有确定性健壮性保证的抗对抗性点云的分类器。
效果:通过在ModelNet和ScanObjectNN基准数据集上的系统评估,结果显示PointCert即使面对健壮性保证存在概率性的先进防御方法,也表现出了显著的优势。

Point cloud classification is an essential component in many security-critical applications such as autonomous driving and augmented reality. However, point cloud classifiers are vulnerable to adversarially perturbed point clouds. Existing certified defenses against adversarial point clouds suffer from a key limitation: their certified robustness guarantees are probabilistic, i.e., they produce an incorrect certified robustness guarantee with some probability. In this work, we propose a general framework, namely PointCert, that can transform an arbitrary point cloud classifier to be certifiably robust against adversarial point clouds with deterministic guarantees. PointCert certifiably predicts the same label for a point cloud when the number of arbitrarily added, deleted, and/or modified points is less than a threshold. Moreover, we propose multiple methods to optimize the certified robustness guarantees of PointCert in three application scenarios. We systematically evaluate PointCert on ModelNet and ScanObjectNN benchmark datasets. Our results show that PointCert substantially outperforms state-of-the-art certified defenses even though their robustness guarantees are probabilistic.

Don't Lie to Me! Robust and Efficient Explainability With Verified Perturbation Analysis
Fel, ThomasandDucoffe, MelanieandVigouroux, DavidandCad\`ene, R\'emiandCapelle, Mika\"elandNicod\`eme, ClaireandSerre, Thomas



研究问题:如何有效地解释深度神经网络的决策过程。
动机:现有的采样方法在估计单个像素的重要性时存在偏差和其它人为因素,导致当前的解释性方法不可靠。
方法:提出EVA(使用验证扰动分析进行解释)——首个保证对扰动空间进行全面探索的解释性方法。具体来说,我们利用验证扰动分析的时间效率、可追踪性和保证流形的全面覆盖等有益特性,来有效地描述最有可能驱动模型决策的输入变量。
效果:我们在多个基准上进行了系统评估,并展示了最先进的结果。

A variety of methods have been proposed to try to explain how deep neural networks make their decisions. Key to those approaches is the need to sample the pixel space efficiently in order to derive importance maps. However, it has been shown that the sampling methods used to date introduce biases and other artifacts, leading to inaccurate estimates of the importance of individual pixels and severely limit the reliability of current explainability methods. Unfortunately, the alternative -- to exhaustively sample the image space is computationally prohibitive. In this paper, we introduce EVA (Explaining using Verified perturbation Analysis) -- the first explainability method guarantee to have an exhaustive exploration of a perturbation space. Specifically, we leverage the beneficial properties of verified perturbation analysis -- time efficiency, tractability and guaranteed complete coverage of a manifold -- to efficiently characterize the input variables that are most likely to drive the model decision. We evaluate the approach systematically and demonstrate state-of-the-art results on multiple benchmarks.

Defending Against Patch-Based Backdoor Attacks on Self-Supervised Learning
Tejankar, AjinkyaandSanjabi, MaziarandWang, QifanandWang, SinongandFirooz, HamedandPirsiavash, HamedandTan, Liang



研究问题:如何防御基于补丁的数据投毒后门攻击对自监督学习的影响。
动机:现有的自监督学习方法存在被基于补丁的数据投毒后门攻击的风险,即攻击者可以通过污染一小部分未标记的数据使得受害者在训练模型时植入后门。
方法:本文提出了一种三步防御流程,首先在被污染的数据上训练模型,然后使用训练好的模型在训练数据中搜索并移除被污染的样本,最后在清理过的数据集上训练最终模型。
效果:实验结果表明,该方法能有效防御此类攻击,提高了模型在含有触发器的图像上的准确率,且优于其他基线和最先进的防御方法。

Recently, self-supervised learning (SSL) was shown to be vulnerable to patch-based data poisoning backdoor attacks. It was shown that an adversary can poison a small part of the unlabeled data so that when a victim trains an SSL model on it, the final model will have a backdoor that the adversary can exploit. This work aims to defend self-supervised learning against such attacks. We use a three-step defense pipeline, where we first train a model on the poisoned data. In the second step, our proposed defense algorithm (PatchSearch) uses the trained model to search the training data for poisoned samples and removes them from the training set. In the third step, a final model is trained on the cleaned-up training set. Our results show that PatchSearch is an effective defense. As an example, it improves a model's accuracy on images containing the trigger from 38.2% to 63.7% which is very close to the clean model's accuracy, 64.6%. Moreover, we show that PatchSearch outperforms baselines and state-of-the-art defense approaches including those using additional clean, trusted data. Our code is available at https://github.com/UCDvision/PatchSearch

AGAIN: Adversarial Training With Attribution Span Enlargement and Hybrid Feature Fusion
Yin, ShenglinandYao, KeluandShi, ShengandDu, YangzhouandXiao, Zhen



研究问题:对抗训练的深度神经网络在训练和测试阶段表现出显著的鲁棒性差距。
动机:对抗训练的深度神经网络在训练时具有高鲁棒性,但在测试时表现较差,这是由于其输入图像的关注范围较小。
方法:提出一种通用的方法来提高对抗训练方法的鲁棒性泛化,通过扩大学习到的注意力范围以及使用混合特征统计数据进行特征融合。
效果:实验证明该方法能有效提高对抗训练的深度神经网络的鲁棒性,超越先前的最佳方法,同时提供了理论分析以证明其有效性。

The deep neural networks (DNNs) trained by adversarial training (AT) usually suffered from significant robust generalization gap, i.e., DNNs achieve high training robustness but low test robustness. In this paper, we propose a generic method to boost the robust generalization of AT methods from the novel perspective of attribution span. To this end, compared with standard DNNs, we discover that the generalization gap of adversarially trained DNNs is caused by the smaller attribution span on the input image. In other words, adversarially trained DNNs tend to focus on specific visual concepts on training images, causing its limitation on test robustness. In this way, to enhance the robustness, we propose an effective method to enlarge the learned attribution span. Besides, we use hybrid feature statistics for feature fusion to enrich the diversity of features. Extensive experiments show that our method can effectively improves robustness of adversarially trained DNNs, outperforming previous SOTA methods. Furthermore, we provide a theoretical analysis of our method to prove its effectiveness.

Adversarial Normalization: I Can Visualize Everything (ICE)
Choi, HoyoungandJin, SeungwanandHan, Kyungsik



研究问题:如何提高视觉转换器的可解释性可视化效果。
动机:现有的视觉转换器模型的可解释性可视化方法存在挑战,如结构依赖、学习过程中的非线性不稳定性以及自注意力得分对相关性的有限反映等。
方法:提出一种名为ICE的新方法,该方法使模型能够直接预测图像中每个补丁的类别,从而推进视觉转换器的有效可视化。通过预测不影响图像类别的补丁的背景类别,将背景与前景区域区分开来。
效果:在ImageNet-Segmentation数据集上,ICE在所有四种情况下都优于其他可解释性可视化方法。在CUB-200-2011和PASCALVOC07/12数据集上,ICE实现了与最先进的方法相当的性能。在ImageNet数据集上,将ICE融入DeiT-S的编码器后,效率提高了44.01%。其性能和效率与最先进的剪枝模型EViT相当,证明了ICE的有效性。

Vision transformers use [CLS] tokens to predict image classes. Their explainability visualization has been studied using relevant information from [CLS] tokens or focusing on attention scores during self-attention. Such visualization, however, is challenging because of the dependence of the structure of a vision transformer on skip connections and attention operators, the instability of non-linearities in the learning process, and the limited reflection of self-attention scores on relevance. We argue that the output vectors for each input patch token in a vision transformer retain the image information of each patch location, which can facilitate the prediction of an image class. In this paper, we propose ICE (Adversarial Normalization: I Can visualize Everything), a novel method that enables a model to directly predict a class for each patch in an image; thus, advancing the effective visualization of the explainability of a vision transformer. Our method distinguishes background from foreground regions by predicting background classes for patches that do not determine image classes. We used the DeiT-S model, the most representative model employed in studies, on the explainability visualization of vision transformers. On the ImageNet-Segmentation dataset, ICE outperformed all explainability visualization methods for four cases depending on the model size. We also conducted quantitative and qualitative analyses on the tasks of weakly-supervised object localization and unsupervised object discovery. On the CUB-200-2011 and PASCALVOC07/12 datasets, ICE achieved comparable performance to the state-of-the-art methods. We incorporated ICE into the encoder of DeiT-S and improved efficiency by 44.01% on the ImageNet dataset over that achieved by the original DeiT-S model. We showed performance on the accuracy and efficiency comparable to EViT, the state-of-the-art pruning model, demonstrating the effectiveness of ICE. The code is available at https://github.com/Hanyang-HCC-Lab/ICE.

Reinforcement Learning-Based Black-Box Model Inversion Attacks
Han, GyojinandChoi, JaehyunandLee, HaeilandKim, Junmo



研究问题:现有的黑盒模型倒置攻击在预设的查询次数内无法保证完成攻击过程,或无法达到与白盒攻击相同的性能水平。
动机:为了克服这些限制,我们提出了一种基于强化学习的黑盒模型倒置攻击方法。
方法:我们将潜在空间搜索定义为马尔可夫决策过程(MDP)问题,并使用强化学习来解决。我们的方法利用生成的图像的置信度分数为代理提供奖励。
效果:实验结果在不同的数据集和模型上表明,我们的攻击成功地恢复了目标模型的私有信息,达到了最先进的攻击性能。

Model inversion attacks are a type of privacy attack that reconstructs private data used to train a machine learning model, solely by accessing the model. Recently, white-box model inversion attacks leveraging Generative Adversarial Networks (GANs) to distill knowledge from public datasets have been receiving great attention because of their excellent attack performance. On the other hand, current black-box model inversion attacks that utilize GANs suffer from issues such as being unable to guarantee the completion of the attack process within a predetermined number of query accesses or achieve the same level of performance as white-box attacks. To overcome these limitations, we propose a reinforcement learning-based black-box model inversion attack. We formulate the latent space search as a Markov Decision Process (MDP) problem and solve it with reinforcement learning. Our method utilizes the confidence scores of the generated images to provide rewards to an agent. Finally, the private data can be reconstructed using the latent vectors found by the agent trained in the MDP. The experiment results on various datasets and models demonstrate that our attack successfully recovers the private information of the target model by achieving state-of-the-art attack performance. We emphasize the importance of studies on privacy-preserving machine learning by proposing a more advanced black-box model inversion attack.

Learning a Practical SDR-to-HDRTV Up-Conversion Using New Dataset and Degradation Models
Guo, ChengandFan, LeidongandXue, ZiyuandJiang, Xiuhua



研究问题:在媒体行业中,当用户拥有HDR-WCG电视时,如何将SDR视频转换为HDRTV。
动机:由于大部分现有素材仍然是SDR,而用户拥有的是HDR-WCG电视,因此需要一种有效的方法进行SDR到HDRTV的上转换。
方法:我们提出了新的HDRTV数据集(称为HDRTV4K)和新的HDR-to-SDR降级模型。然后,我们使用这些数据训练了一个亮度分割网络(LSN),该网络由一个全局映射主干和两个Transformer分支组成,分别处理亮部和暗部的亮度范围。我们还更新了评估标准,通过定制的度量标准和主观实验进行评估。
效果:我们的实验结果表明,这种方法可以显著提高观看体验,并且其有效性已经通过消融研究得到证明。

In media industry, the demand of SDR-to-HDRTV up-conversion arises when users possess HDR-WCG (high dynamic range-wide color gamut) TVs while most off-the-shelf footage is still in SDR (standard dynamic range). The research community has started tackling this low-level vision task by learning-based approaches. When applied to real SDR, yet, current methods tend to produce dim and desaturated result, making nearly no improvement on viewing experience. Different from other network-oriented methods, we attribute such deficiency to training set (HDR-SDR pair). Consequently, we propose new HDRTV dataset (dubbed HDRTV4K) and new HDR-to-SDR degradation models. Then, it's used to train a luminance-segmented network (LSN) consisting of a global mapping trunk, and two Transformer branches on bright and dark luminance range. We also update assessment criteria by tailored metrics and subjective experiment. Finally, ablation studies are conducted to prove the effectiveness. Our work is available at: https://github.com/AndreGuo/HDRTVDM.

Patch-Craft Self-Supervised Training for Correlated Image Denoising
Vaksman, GregoryandElad, Michael



研究问题:如何训练一种无需知道噪声模型或访问真实目标的图像去噪模型,以处理未知相关噪声。
动机:现有的图像去噪方法需要成对的损坏图像和对应的真实目标,但在许多应用中,这样的数据是不存在的。
方法:提出一种新的自我监督训练技术,通过捕捉容易获取的噪声突发,构造人工补丁工艺图像作为训练目标。
效果:通过大量的合成和真实图像噪声实验评估,证明了该方法在处理未知相关噪声上的有效性。

Supervised neural networks are known to achieve excellent results in various image restoration tasks. However, such training requires datasets composed of pairs of corrupted images and their corresponding ground truth targets. Unfortunately, such data is not available in many applications. For the task of image denoising in which the noise statistics is unknown, several self-supervised training methods have been proposed for overcoming this difficulty. Some of these require knowledge of the noise model, while others assume that the contaminating noise is uncorrelated, both assumptions are too limiting for many practical needs. This work proposes a novel self-supervised training technique suitable for the removal of unknown correlated noise. The proposed approach neither requires knowledge of the noise model nor access to ground truth targets. The input to our algorithm consists of easily captured bursts of noisy shots. Our algorithm constructs artificial patch-craft images from these bursts by patch matching and stitching, and the obtained crafted images are used as targets for the training. Our method does not require registration of the different images within the burst. We evaluate the proposed framework through extensive experiments with synthetic and real image noise.

Single Image Backdoor Inversion via Robust Smoothed Classifiers
Sun, MingjieandKolter, Zico



研究问题:本文旨在解决如何通过优化过程将干净图像集合转换为目标类别,以恢复插入到机器学习模型中的后门触发器的问题。
动机:尽管后门反转方法已成为许多后门检测和防御方法的支柱,但关于需要多少个干净图像才能成功恢复后门的研究却很少。
方法:本文提出了一种名为SmoothInv的方法,该方法首先构建了一个鲁棒的平滑版后门分类器,然后进行针对目标类别的引导式图像合成,以揭示后门模式。
效果:实验结果表明,与现有方法相比,SmoothInv能够从单个图像中可靠地恢复成功的后门,同时保持对原始后门的高保真度。此外,我们还展示了如何从后门分类器中识别出被后门化的目标类别。最后,我们提出了两种针对我们的方法的反制措施,并证明在面对适应性攻击者时,SmoothInv仍然具有鲁棒性。

Backdoor inversion, the process of finding a backdoor trigger inserted into a machine learning model, has become the pillar of many backdoor detection and defense methods. Previous works on backdoor inversion often recover the backdoor through an optimization process to flip a support set of clean images into the target class. However, it is rarely studied and understood how large this support set should be to recover a successful backdoor. In this work, we show that one can reliably recover the backdoor trigger with as few as a single image. Specifically, we propose the SmoothInv method, which first constructs a robust smoothed version of the backdoored classifier and then performs guided image synthesis towards the target class to reveal the backdoor pattern. SmoothInv requires neither an explicit modeling of the backdoor via a mask variable, nor any complex regularization schemes, which has become the standard practice in backdoor inversion methods. We perform both quantitaive and qualitative study on backdoored classifiers from previous published backdoor attacks. We demonstrate that compared to existing methods, SmoothInv is able to recover successful backdoors from single images, while maintaining high fidelity to the original backdoor. We also show how we identify the target backdoored class from the backdoored classifier. Last, we propose and analyze two countermeasures to our approach and show that SmoothInv remains robust in the face of an adaptive attacker. Our code is available at https://github.com/locuslab/smoothinv.

Evading DeepFake Detectors via Adversarial Statistical Consistency
Hou, YangandGuo, QingandHuang, YihaoandXie, XiaofeiandMa, LeiandZhao, Jianjun



研究问题:随着深度伪造技术(DeepFake)的飞速发展,如何有效地检测出这些伪造的图片成为了一个重要的问题。
动机:目前的深度伪造检测方法主要依赖于检测真实图片和生成的图片在空间和频率域上的统计差异,但这些方法往往容易被最新的深度伪造技术所欺骗。
方法:本文提出了一种针对深度伪造检测器的统计一致性攻击(StatAttack),该方法通过在生成的假图片上添加一些统计敏感的自然退化效果(如曝光、模糊和噪声),并使用分布感知损失来优化这些退化效果,使得生成的对抗样本的特征分布接近真实的自然图片。此外,作者还进一步将这种方法扩展到了更强大的多层级退化版本(MStatAttack)。
效果:在四个基于空间的检测器和两个基于频率的检测器上进行的大量实验表明,无论是在白盒还是黑盒设置下,本文提出的方法都能有效地欺骗现有的深度伪造检测器。

In recent years, as various realistic face forgery techniques known as DeepFake improves by leaps and bounds, more and more DeepFake detection techniques have been proposed. These methods typically rely on detecting statistical differences between natural (i.e., real) and DeepFake-generated images in both spatial and frequency domains. In this work, we propose to explicitly minimize the statistical differences to evade state-of-the-art DeepFake detectors. To this end, we propose a statistical consistency attack (StatAttack) against DeepFake detectors, which contains two main parts. First, we select several statistical-sensitive natural degradations (i.e., exposure, blur, and noise) and add them to the fake images in an adversarial way. Second, we find that the statistical differences between natural and DeepFake images are positively associated with the distribution shifting between the two kinds of images, and we propose to use a distribution-aware loss to guide the optimization of different degradations. As a result, the feature distributions of generated adversarial examples is close to the natural images. Furthermore, we extend the StatAttack to a more powerful version, MStatAttack, where we extend the single-layer degradation to multi-layer degradations sequentially and use the loss to tune the combination weights jointly. Comprehensive experimental results on four spatial-based detectors and two frequency-based detectors with four datasets demonstrate the effectiveness of our proposed attack method in both white-box and black-box settings.

OCTET: Object-Aware Counterfactual Explanations
Zemni, MehdiandChen, Micka\"elandZablocki, \'EloiandBen-Younes, H\'ediandP\'erez, PatrickandCord, Matthieu



研究问题:如何为深度视觉模型提供可解释性,特别是在复杂的城市场景中。
动机:在安全关键应用(如自动驾驶)中广泛使用深度视觉模型,其可解释性成为关注焦点。
方法:提出一种以物体为中心的反事实解释生成框架,通过将查询图像编码到易于进行物体级操作的潜在空间中,使用户能够控制反事实生成过程中要探索的搜索方向。
效果:在驾驶场景的反事实解释基准测试上进行一系列实验,结果显示该方法不仅适用于分类,还可以用于语义分割模型的解释。并通过用户研究测量反事实解释在理解决策模型中的有用性。

Nowadays, deep vision models are being widely deployed in safety-critical applications, e.g., autonomous driving, and explainability of such models is becoming a pressing concern. Among explanation methods, counterfactual explanations aim to find minimal and interpretable changes to the input image that would also change the output of the model to be explained. Such explanations point end-users at the main factors that impact the decision of the model. However, previous methods struggle to explain decision models trained on images with many objects, e.g., urban scenes, which are more difficult to work with but also arguably more critical to explain. In this work, we propose to tackle this issue with an object-centric framework for counterfactual explanation generation. Our method, inspired by recent generative modeling works, encodes the query image into a latent space that is structured in a way to ease object-level manipulations. Doing so, it provides the end-user with control over which search directions (e.g., spatial displacement of objects, style modification, etc.) are to be explored during the counterfactual generation. We conduct a set of experiments on counterfactual explanation benchmarks for driving scenes, and we show that our method can be adapted beyond classification, e.g., to explain semantic segmentation models. To complete our analysis, we design and run a user study that measures the usefulness of counterfactual explanations in understanding a decision model. Code is available at https://github.com/valeoai/OCTET.

Polarized Color Image Denoising
Li, ZhuoxiaoandJiang, HaiyangandCao, MingdengandZheng, Yinqiang



研究问题:单芯片偏振彩色摄影在提供视觉纹理和物体表面信息的同时,使用额外的研究问题:单芯片偏振彩色摄影在提供视觉纹理和物体表面信息的同时,使用额外的定向偏振滤波器阵列会降低光子计数和信噪比,导致图像噪声大,影响偏振分析性能。
动机:传统的图像处理流程由于通道中隐含的物理约束过于复杂,难以应对这种挑战。
方法:本文提出了一种噪声建模方法进行真实数据合成,并设计了一种受视觉Transformer启发的强大网络结构。
效果:通过实验评估,我们采集了一个真实的偏振彩色图像数据集,包括原始短曝光有噪声图像和长曝光参考图像对,证明了我们的方法在数据合成和偏振彩色图像去噪方面的有效性。

Single-chip polarized color photography provides both visual textures and object surface information in one snapshot. However, the use of an additional directional polarizing filter array tends to lower photon count and SNR, when compared to conventional color imaging. As a result, such a bilayer structure usually leads to unpleasant noisy images and undermines performance of polarization analysis, especially in low-light conditions. It is a challenge for traditional image processing pipelines owing to the fact that the physical constraints exerted implicitly in the channels are excessively complicated. In this paper, we propose to tackle this issue through a noise modeling method for realistic data synthesis and a powerful network structure inspired by vision Transformer. A real-world polarized color image dataset of paired raw short-exposed noisy images and long-exposed reference images is captured for experimental evaluation, which has demonstrated the effectiveness of our approaches for data synthesis and polarized color image denoising.

A Unified HDR Imaging Method With Pixel and Patch Level
Yan, QingsenandChen, WeiyeandZhang, SongandZhu, YuandSun, JinqiuandZhang, Yanning



研究问题:如何有效地映射低动态范围(LDR)图像到高动态范围(HDR),特别是在动态场景中,避免由于物体运动或相机跳跃引起的重影问题。
动机:尽管已有一些基于深度神经网络(DNNs)的方法被提出以缓解重影问题,但在存在运动和饱和度的情况下,这些方法无法生成令人满意的结果。
方法:我们提出了一种混合的高动态范围去重影网络,称为HyHDRNet,通过学习参考和非参考图像之间的复杂关系来生成视觉上令人愉悦的HDR图像。该网络由内容对齐子网络和基于Transformer的融合子网络组成。
效果:我们在四个广泛使用的公共HDR图像去重影数据集上测试了该方法。实验证明,HyHDRNet在数量和质量上都优于最先进的方法,能够生成具有统一纹理和颜色的吸引人的HDR可视化效果。

Mapping Low Dynamic Range (LDR) images with different exposures to High Dynamic Range (HDR) remains nontrivial and challenging on dynamic scenes due to ghosting caused by object motion or camera jitting. With the success of Deep Neural Networks (DNNs), several DNNs-based methods have been proposed to alleviate ghosting, they cannot generate approving results when motion and saturation occur. To generate visually pleasing HDR images in various cases, we propose a hybrid HDR deghosting network, called HyHDRNet, to learn the complicated relationship between reference and non-reference images. The proposed HyHDRNet consists of a content alignment subnetwork and a Transformer-based fusion subnetwork. Specifically, to effectively avoid ghosting from the source, the content alignment subnetwork uses patch aggregation and ghost attention to integrate similar content from other non-reference images with patch level and suppress undesired components with pixel level. To achieve mutual guidance between patch-level and pixel-level, we leverage a gating module to sufficiently swap useful information both in ghosted and saturated regions. Furthermore, to obtain a high-quality HDR image, the Transformer-based fusion subnetwork uses a Residual Deformable Transformer Block (RDTB) to adaptively merge information for different exposed regions. We examined the proposed method on four widely used public HDR image deghosting datasets. Experiments demonstrate that HyHDRNet outperforms state-of-the-art methods both quantitatively and qualitatively, achieving appealing HDR visualization with unified textures and colors.

Zero-Shot Dual-Lens Super-Resolution
Xu, RuikangandYao, MingdeandXiong, Zhiwei



研究问题:如何利用现有的手机设备上的不对称双镜头配置,实现对同一场景的超分辨率重建。
动机:由于相机的未知获取过程(如微小的相机运动),即使在同一设备上,建模真实的超分辨率也存在图像特定的退化问题。
方法:本文提出了一种零样本双镜头超分辨率解决方案(ZeDuSR),仅在测试时使用双镜头对来学习特定于图像的超分辨率模型。
效果:通过提出退化不变的对齐方法和退化感知的训练策略,ZeDuSR能够充分利用单个双镜头对内的信息,实验证明其在合成和现实世界的双镜头数据集上都优于现有解决方案。

The asymmetric dual-lens configuration is commonly available on mobile devices nowadays, which naturally stores a pair of wide-angle and telephoto images of the same scene to support realistic super-resolution (SR). Even on the same device, however, the degradation for modeling realistic SR is image-specific due to the unknown acquisition process (e.g., tiny camera motion). In this paper, we propose a zero-shot solution for dual-lens SR (ZeDuSR), where only the dual-lens pair at test time is used to learn an image-specific SR model. As such, ZeDuSR adapts itself to the current scene without using external training data, and thus gets rid of generalization difficulty. However, there are two major challenges to achieving this goal: 1) dual-lens alignment while keeping the realistic degradation, and 2) effective usage of highly limited training data. To overcome these two challenges, we propose a degradation-invariant alignment method and a degradation-aware training strategy to fully exploit the information within a single dual-lens pair. Extensive experiments validate the superiority of ZeDuSR over existing solutions on both synthesized and real-world dual-lens datasets.

Fantastic Breaks: A Dataset of Paired 3D Scans of Real-World Broken Objects and Their Complete Counterparts
Lamb, NikolasandPalmer, CameronandMolloy, BenjaminandBanerjee, SeanandBanerjee, NatashaKholgade



研究问题:自动形状修复方法缺乏真实破损几何形状的数据集。
动机:我们提出了Fantastic Breaks(及其获取方式),这是一个包含150个破损物体扫描、防水和清理后的3D网格的数据集,与完整的对应物进行了配对和几何对齐。
方法:通过详细分析断裂几何,我们揭示了Fantastic Breaks与使用几何和物理基础方法生成的合成断裂数据集之间的差异。我们展示了使用多种基于学习的预先在合成数据集上训练并在Fantastic Breaks子集上重新训练的方法进行形状修复评估。
效果:实验结果表明,Fantastic Breaks可以显著提高形状修复的效果。

Automated shape repair approaches currently lack access to datasets that describe real-world damaged geometry. We present Fantastic Breaks (and Where to Find Them: https://terascale-all-sensing-research-studio.github.io/FantasticBreaks), a dataset containing scanned, waterproofed, and cleaned 3D meshes for 150 broken objects, paired and geometrically aligned with complete counterparts. Fantastic Breaks contains class and material labels, proxy repair parts that join to broken meshes to generate complete meshes, and manually annotated fracture boundaries. Through a detailed analysis of fracture geometry, we reveal differences between Fantastic Breaks and synthetic fracture datasets generated using geometric and physics-based methods. We show experimental shape repair evaluation with Fantastic Breaks using multiple learning-based approaches pre-trained with synthetic datasets and re-trained with subset of Fantastic Breaks.

topic-7

Topic words :  performance,  time,  training,  neural,  efficient,  network,  memory,  parameters

Frame Flexible Network
Zhang, YitianandBai, YueandLiu, ChangandWang, HuanandLi, ShengandFu, Yun



研究问题:现有的视频识别算法对不同帧数的输入进行不同的训练流程,需要重复的训练操作和大量的存储成本。
动机:如果使用未在训练中使用的其他帧来评估模型,性能会显著下降(如图1所示,总结为“时间频率偏差”现象)。为了解决这个问题,我们提出了一个通用框架,名为“灵活帧网络”(FFN),它不仅允许模型在不同的帧上进行评估以调整其计算,而且显著降低了存储多个模型的内存成本。
方法:具体来说,FFN整合了多组训练序列,通过多频对齐(MFAL)学习时间频率不变的表示,并利用多频适应(MFAD)进一步增强表示能力。
效果:通过各种架构和流行基准的综合实证验证,坚实地证明了FFN的有效性和泛化性(例如,在Something-Something V1数据集的帧4/8/16上比Uniformer有7.08/5.15/2.17%的性能提升)。代码可在https://github.com/BeSpontaneous/FFN获取。

Existing video recognition algorithms always conduct different training pipelines for inputs with different frame numbers, which requires repetitive training operations and multiplying storage costs. If we evaluate the model using other frames which are not used in training, we observe the performance will drop significantly (see Fig.1, which is summarized as Temporal Frequency Deviation phenomenon. To fix this issue, we propose a general framework, named Frame Flexible Network (FFN), which not only enables the model to be evaluated at different frames to adjust its computation, but also reduces the memory costs of storing multiple models significantly. Concretely, FFN integrates several sets of training sequences, involves Multi-Frequency Alignment (MFAL) to learn temporal frequency invariant representations, and leverages Multi-Frequency Adaptation (MFAD) to further strengthen the representation abilities. Comprehensive empirical validations using various architectures and popular benchmarks solidly demonstrate the effectiveness and generalization of FFN (e.g., 7.08/5.15/2.17% performance gain at Frame 4/8/16 on Something-Something V1 dataset over Uniformer). Code is available at https://github.com/BeSpontaneous/FFN.

Minimizing the Accumulated Trajectory Error To Improve Dataset Distillation
Du, JiaweiandJiang, YidiandTan, VincentY.F.andZhou, JoeyTianyiandLi, Haizhou



研究问题:大规模真实世界数据的处理需要大量的计算、存储和训练,以及寻找优秀的神经网络架构。
动机:为了解决这个问题,提出了数据集蒸馏的方法,将大量真实世界数据的信息提炼成小而紧凑的合成数据集,以减少数据处理的负担。
方法:现有的方法主要依赖于通过匹配在真实数据和合成数据训练过程中获得的梯度来学习合成数据集。然而,这些梯度匹配方法受到由于蒸馏和后续评估之间的差异导致的累积轨迹误差的影响。为此,我们提出了一种新的方法,鼓励优化算法寻求平坦的轨迹。
效果:我们的新方法被称为平坦轨迹蒸馏(FTD),它通过向平坦轨迹进行正则化,使得在合成数据上训练的权重能够抵抗累积误差的干扰。实验结果表明,FTD方法可以显著提高梯度匹配方法的性能,在ImageNet数据集的高分辨率图像子集上提高了4.7%。此外,我们还验证了该方法在不同分辨率数据集上的有效性和通用性,并展示了其在神经架构搜索中的应用。

Model-based deep learning has achieved astounding successes due in part to the availability of large-scale real-world data. However, processing such massive amounts of data comes at a considerable cost in terms of computations, storage, training and the search for good neural architectures. Dataset distillation has thus recently come to the fore. This paradigm involves distilling information from large real-world datasets into tiny and compact synthetic datasets such that processing the latter yields similar performances as the former. State-of-the-art methods primarily rely on learning the synthetic dataset by matching the gradients obtained during training between the real and synthetic data. However, these gradient-matching methods suffer from the accumulated trajectory error caused by the discrepancy between the distillation and subsequent evaluation. To alleviate the adverse impact of this accumulated trajectory error, we propose a novel approach that encourages the optimization algorithm to seek a flat trajectory. We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory. Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7% on a subset of images of the ImageNet dataset with higher resolution images. We also validate the effectiveness and generalizability of our method with datasets of different resolutions and demonstrate its applicability to neural architecture search.

Fast Point Cloud Generation With Straight Flows
Wu, LemengandWang, DilinandGong, ChengyueandLiu, XingchaoandXiong, YunyangandRanjan, RakeshandKrishnamoorthi, RaghuramanandChandra, VikasandLiu, Qiang



研究问题:现有的扩散模型在生成高质量点云样本时需要数千步的迭代去噪,这限制了其在许多3D实时应用中的使用。
动机:为了解决这个问题,我们提出了一种新的模型——Point Straight Flow(PSF),它通过将复杂的学习轨迹优化为一条直线,只需要一步就可以实现优秀的性能。
方法:我们对标准的扩散模型进行了重新设计,将其优化的学习路径从曲线变为直线。此外,我们还开发了一种蒸馏策略,可以在不损失性能的情况下将直线路径缩短为一步,使其能够应用于具有延迟约束的3D实时应用。
效果:我们在多个3D任务上进行评估,发现PSF的性能与标准的扩散模型相当,甚至超过了其他高效的3D点云生成方法。在实际的3D应用中,如点云补全和低延迟设置下的无训练文本引导生成,PSF表现出色。

Diffusion models have emerged as a powerful tool for point cloud generation. A key component that drives the impressive performance for generating high-quality samples from noise is iteratively denoise for thousands of steps. While beneficial, the complexity of learning steps has limited its applications to many 3D real-world. To address this limitation, we propose Point Straight Flow (PSF), a model that exhibits impressive performance using one step. Our idea is based on the reformulation of the standard diffusion model, which optimizes the curvy learning trajectory into a straight path. Further, we develop a distillation strategy to shorten the straight path into one step without a performance loss, enabling applications to 3D real-world with latency constraints. We perform evaluations on multiple 3D tasks and find that our PSF performs comparably to the standard diffusion model, outperforming other efficient 3D point cloud generation methods. On real-world applications such as point cloud completion and training-free text-guided generation in a low-latency setup, PSF performs favorably.

Achieving a Better Stability-Plasticity Trade-Off via Auxiliary Networks in Continual Learning
Kim, SanghwanandNoci, LorenzoandOrvieto, AntonioandHofmann, Thomas



研究问题:与人类能够以连续的方式学习新任务的自然能力相比,神经网络存在灾难性遗忘的问题,即在优化新任务后,模型在旧任务上的表现会大幅下降。
动机:为了解决神经网络的灾难性遗忘问题,持续学习社区提出了几种解决方案,旨在让神经网络具备学习当前任务的能力(可塑性)的同时,仍能保持对以前任务的高准确性(稳定性)。然而,可塑性-稳定性权衡问题仍然远未解决,其背后的机制也尚未得到充分理解。
方法:本文提出了一种名为辅助网络持续学习的新颖方法,该方法为主要关注稳定性的持续学习模型添加了一个促进可塑性的额外辅助网络。具体来说,所提出的框架通过一个在可塑性和稳定性之间自然插值的正则化器来实现。
效果:实验结果表明,ANCL在任务增量和类别增量场景中超越了强大的基线。通过对ANCL解决方案的广泛分析,我们发现了稳定性-可塑性权衡问题背后的一些基本原则。

In contrast to the natural capabilities of humans to learn new tasks in a sequential fashion, neural networks are known to suffer from catastrophic forgetting, where the model's performances on old tasks drop dramatically after being optimized for a new task. Since then, the continual learning (CL) community has proposed several solutions aiming to equip the neural network with the ability to learn the current task (plasticity) while still achieving high accuracy on the previous tasks (stability). Despite remarkable improvements, the plasticity-stability trade-off is still far from being solved, and its underlying mechanism is poorly understood. In this work, we propose Auxiliary Network Continual Learning (ANCL), a novel method that applies an additional auxiliary network which promotes plasticity to the continually learned model which mainly focuses on stability. More concretely, the proposed framework materializes in a regularizer that naturally interpolates between plasticity and stability, surpassing strong baselines on task incremental and class incremental scenarios. Through extensive analyses on ANCL solutions, we identify some essential principles beneath the stability-plasticity trade-off.

Power Bundle Adjustment for Large-Scale 3D Reconstruction
Weber, SimonandDemmel, NikolausandChan, TinChonandCremers, Daniel



研究问题:本文旨在提出一种解决大规模束调整问题的扩展算法——电力束调整。
动机:现有的迭代方法在解决大规模的束调整问题上存在速度慢和精度低的问题。
方法:通过逆舒尔补的幂级数展开,提出了一种新的求解器——逆展开方法,并证明其收敛性。
效果:实验结果表明,该方法在解决正规方程的速度和精度上优于现有的迭代方法,并且可以作为分布式束调整框架的子问题求解器,显著提高分布式优化的速度和精度。

We introduce Power Bundle Adjustment as an expansion type algorithm for solving large-scale bundle adjustment problems. It is based on the power series expansion of the inverse Schur complement and constitutes a new family of solvers that we call inverse expansion methods. We theoretically justify the use of power series and we prove the convergence of our approach. Using the real-world BAL dataset we show that the proposed solver challenges the state-of-the-art iterative methods and significantly accelerates the solution of the normal equation, even for reaching a very high accuracy. This easy-to-implement solver can also complement a recently presented distributed bundle adjustment framework. We demonstrate that employing the proposed Power Bundle Adjustment as a sub-problem solver significantly improves speed and accuracy of the distributed optimization.

Boosting Verified Training for Robust Image Classifications via Abstraction
Zhang, ZhaodiandXue, ZhiyiandChen, YangandLiu, SiandZhang, YuelingandLiu, JingandZhang, Min



研究问题:如何提高图像分类器的鲁棒性?
动机:现有的图像分类器对输入的微小变化敏感,导致分类结果的不稳定性。
方法:提出一种基于抽象的新的训练方法,通过将扰动图像映射到区间进行训练,减小训练集的方差和模型的损失函数的复杂度,从而提高模型的鲁棒性。同时,该方法还提供了一种可验证的、与神经网络类型和规模无关的黑盒验证方法。
效果:在多个基准测试中,该方法比现有技术表现更好,能减少95.64%的已训练模型的错误,实现高达602.50倍的速度提升,并能处理高达1.38亿个可训练参数的更大模型。

This paper proposes a novel, abstraction-based, certified training method for robust image classifiers. Via abstraction, all perturbed images are mapped into intervals before feeding into neural networks for training. By training on intervals, all the perturbed images that are mapped to the same interval are classified as the same label, rendering the variance of training sets to be small and the loss landscape of the models to be smooth. Consequently, our approach significantly improves the robustness of trained models. For the abstraction, our training method also enables a sound and complete black-box verification approach, which is orthogonal and scalable to arbitrary types of neural networks regardless of their sizes and architectures. We evaluate our method on a wide range of benchmarks in different scales. The experimental results show that our method outperforms state of the art by (i) reducing the verified errors of trained models up to 95.64%; (ii) totally achieving up to 602.50x speedup; and (iii) scaling up to larger models with up to 138 million trainable parameters. The demo is available at https://github.com/zhangzhaodi233/ABSCERT.git.

Post-Training Quantization on Diffusion Models
Shang, YuzhangandYuan, ZhihangandXie, BinandWu, BingzheandYan, Yan



研究问题:现有的去噪扩散模型生成过程缓慢,因为迭代噪声估计需要依赖复杂的神经网络。
动机:为了解决这一问题,本文提出了一种从压缩噪声估计网络的角度加速生成的方法。
方法:通过引入后训练量化(PTQ)来加速去噪扩散模型的生成过程。针对去噪扩散模型多时间步结构的特点,对量化操作、校准数据集和校准指标进行了探索和改进。
效果:实验表明,该方法可以直接将全精度去噪扩散模型量化为8位模型,同时在无需训练的情况下保持甚至提高其性能。此外,该方法还可以作为其他快速采样方法(如DDIM)的即插即用模块。

Denoising diffusion (score-based) generative models have recently achieved significant accomplishments in generating realistic and diverse data. These approaches define a forward diffusion process for transforming data into noise and a backward denoising process for sampling data from noise. Unfortunately, the generation process of current denoising diffusion models is notoriously slow due to the lengthy iterative noise estimations, which rely on cumbersome neural networks. It prevents the diffusion models from being widely deployed, especially on edge devices. Previous works accelerate the generation process of diffusion model (DM) via finding shorter yet effective sampling trajectories. However, they overlook the cost of noise estimation with a heavy network in every iteration. In this work, we accelerate generation from the perspective of compressing the noise estimation network. Due to the difficulty of retraining DMs, we exclude mainstream training-aware compression paradigms and introduce post-training quantization (PTQ) into DM acceleration. However, the output distributions of noise estimation networks change with time-step, making previous PTQ methods fail in DMs since they are designed for single-time step scenarios. To devise a DM-specific PTQ method, we explore PTQ on DM in three aspects: quantized operations, calibration dataset, and calibration metric. We summarize and use several observations derived from all-inclusive investigations to formulate our method, which especially targets the unique multi-time-step structure of DMs. Experimentally, our method can directly quantize full-precision DMs into 8-bit models while maintaining or even improving their performance in a training-free manner. Importantly, our method can serve as a plug-and-play module on other fast-sampling methods, such as DDIM.

X-Pruner: eXplainable Pruning for Vision Transformers
Yu, LuandXiang, Wei



研究问题:视觉转换器模型在各种任务中占据主导地位,但计算成本高、内存需求大,不适合在边缘平台上部署。
动机:现有的模型剪枝方法无法解释模型内部单元与目标类别之间的关系,导致性能下降。
方法:提出一种新的可解释性剪枝框架X-Pruner,通过设计一种可解释的剪枝标准来测量每个可剪单元对预测每个目标类别的贡献。
效果:实验结果表明,X-Pruner在减少计算成本的同时,性能只有轻微下降,优于现有的黑盒方法。

Recently vision transformer models have become prominent models for a range of tasks. These models, however, usually suffer from intensive computational costs and heavy memory requirements, making them impractical for deployment on edge platforms. Recent studies have proposed to prune transformers in an unexplainable manner, which overlook the relationship between internal units of the model and the target class, thereby leading to inferior performance. To alleviate this problem, we propose a novel explainable pruning framework dubbed X-Pruner, which is designed by considering the explainability of the pruning criterion. Specifically, to measure each prunable unit's contribution to predicting each target class, a novel explainability-aware mask is proposed and learned in an end-to-end manner. Then, to preserve the most informative units and learn the layer-wise pruning rate, we adaptively search the layer-wise threshold that differentiates between unpruned and pruned units based on their explainability-aware mask values. To verify and evaluate our method, we apply the X-Pruner on representative transformer models including the DeiT and Swin Transformer. Comprehensive simulation results demonstrate that the proposed X-Pruner outperforms the state-of-the-art black-box methods with significantly reduced computational costs and slight performance degradation.

Hard Sample Matters a Lot in Zero-Shot Quantization
Li, HuantongandWu, XiangmiaoandLv, FanbingandLiao, DaihaiandLi, ThomasH.andZhang, YonggangandHan, BoandTan, Mingkui



研究问题:如何利用零散量化(ZSQ)压缩和加速深度神经网络,特别是在无法获取全精度模型训练数据的情况下。
动机:现有的ZSQ方法中,由于网络量化使用的是合成样本,因此量化模型的性能严重依赖于合成样本的质量。然而,我们发现这些方法中构造的合成样本很容易被模型拟合,导致在这些样本上性能显著下降。
方法:我们提出了HArd sample Synthesizing and Training (HAST)方法。具体来说,HAST在合成样本时更关注困难样本,并在训练量化模型时使合成样本难以拟合。同时,HAST对齐了全精度和量化模型提取的特征,以确保这两种模型提取的特征的相似性。
效果:大量实验表明,HAST显著优于现有的ZSQ方法,其性能与使用真实数据进行量化的模型相当。

Zero-shot quantization (ZSQ) is promising for compressing and accelerating deep neural networks when the data for training full-precision models are inaccessible. In ZSQ, network quantization is performed using synthetic samples, thus, the performance of quantized models depends heavily on the quality of synthetic samples. Nonetheless, we find that the synthetic samples constructed in existing ZSQ methods can be easily fitted by models. Accordingly, quantized models obtained by these methods suffer from significant performance degradation on hard samples. To address this issue, we propose HArd sample Synthesizing and Training (HAST). Specifically, HAST pays more attention to hard samples when synthesizing samples and makes synthetic samples hard to fit when training quantized models. HAST aligns features extracted by full-precision and quantized models to ensure the similarity between features extracted by these two models. Extensive experiments show that HAST significantly outperforms existing ZSQ methods, achieving performance comparable to models that are quantized with real data.

Neural Rate Estimator and Unsupervised Learning for Efficient Distributed Image Analytics in Split-DNN Models
Ahuja, NileshandDatta, ParualandKanzariya, BhavyaandSomayazulu, V.SrinivasaandTickoo, Omesh



研究问题:如何优化图像分析的压缩和任务性能。
动机:传统的压缩方法在压缩图像数据时会引入影响下游分析任务性能的伪影,而分块深度神经网络(Split-DNN)可以解决这个问题。
方法:提出一种高质量的“神经速率估计器”,将低维瓶颈输出解释为中间特征的潜在表示,并将率失真优化问题视为训练一个等效的变分自动编码器的问题。
效果:通过使用蒸馏基于的损失代替监督损失项(如交叉熵损失),可以在无需显式训练标签的情况下进行无监督训练瓶颈单元,从而在图像分类和语义分割任务上以更低的比特率获得更好的任务准确性。

Thanks to advances in computer vision and AI, there has been a large growth in the demand for cloud-based visual analytics in which images captured by a low-powered edge device are transmitted to the cloud for analytics. Use of conventional codecs (JPEG, MPEG, HEVC, etc.) for compressing such data introduces artifacts that can seriously degrade the performance of the downstream analytic tasks. Split-DNN computing has emerged as a paradigm to address such usages, in which a DNN is partitioned into a client-side portion and a server side portion. Low-complexity neural networks called 'bottleneck units' are introduced at the split point to transform the intermediate layer features into a lower-dimensional representation better suited for compression and transmission. Optimizing the pipeline for both compression and task-performance requires high-quality estimates of the information-theoretic rate of the intermediate features. Most works on compression for image analytics use heuristic approaches to estimate the rate, leading to suboptimal performance. We propose a high-quality 'neural rate-estimator' to address this gap. We interpret the lower-dimensional bottleneck output as a latent representation of the intermediate feature and cast the rate-distortion optimization problem as one of training an equivalent variational auto-encoder with an appropriate loss function. We show that this leads to improved rate-distortion outcomes. We further show that replacing supervised loss terms (such as cross-entropy loss) by distillation-based losses in a teacher-student framework allows for unsupervised training of bottleneck units without the need for explicit training labels. This makes our method very attractive for real world deployments where access to labeled training data is difficult or expensive. We demonstrate that our method outperforms several state-of-the-art methods by obtaining improved task accuracy at lower bitrates on image classification and semantic segmentation tasks.

1\% VS 100\%: Parameter-Efficient Low Rank Adapter for Dense Predictions
Yin, DongshuoandYang, YiranandWang, ZhechaoandYu, HongfengandWei, KaiwenandSun, Xian



研究问题:如何有效地微调大规模预训练的视觉模型以适应下游任务,同时减少可训练参数的数量。
动机:虽然微调整个大模型可以获得最先进的性能,但效率低下且需要为每个任务存储一个相同大小的新模型副本。
方法:提出LoRand方法,通过生成低秩合成的小适配器结构,保持原始主干参数固定,实现高参数共享,仅对预训练主干参数的1%到3%进行训练。
效果:在物体检测、语义分割和实例分割任务上进行大量实验,结果显示LoRand在COCO和ADE20K数据集上的性能与标准微调相当,并在资源较少的PASCAL VOC数据集上优于微调。

Fine-tuning large-scale pre-trained vision models to downstream tasks is a standard technique for achieving state-of-the-art performance on computer vision benchmarks. However, fine-tuning the whole model with millions of parameters is inefficient as it requires storing a same-sized new model copy for each task. In this work, we propose LoRand, a method for fine-tuning large-scale vision models with a better trade-off between task performance and the number of trainable parameters. LoRand generates tiny adapter structures with low-rank synthesis while keeping the original backbone parameters fixed, resulting in high parameter sharing. To demonstrate LoRand's effectiveness, we implement extensive experiments on object detection, semantic segmentation, and instance segmentation tasks. By only training a small percentage (1% to 3%) of the pre-trained backbone parameters, LoRand achieves comparable performance to standard fine-tuning on COCO and ADE20K and outperforms fine-tuning in low-resource PASCAL VOC dataset.

ResFormer: Scaling ViTs With Multi-Resolution Training
Tian, RuiandWu, ZuxuanandDai, QiandHu, HanandQiao, YuandJiang, Yu-Gang



研究问题:视觉转换器在处理未见过的训练输入分辨率时性能下降的问题。
动机:提出一个改进的框架,通过多分辨率训练来提高对各种(大部分未见过)测试分辨率的性能。
方法:构建ResFormer框架,在不同的分辨率上复制图像并实施规模一致性损失以获取不同尺度的交互信息。同时,提出一种全局-局部位置嵌入策略,根据输入大小进行平滑变化。
效果:在ImageNet上进行的大量实验表明,ResFormer具有强大的可扩展性,能有效地处理广泛的分辨率。例如,当评估相对较低和较高的分辨率时,ResFormer-B-MR的Top-1准确率分别为75.86%和81.72%,比DeiT-B高出48%和7.49%。此外,还证明ResFormer具有灵活性,可以容易地扩展到语义分割、目标检测和视频动作识别等领域。

Vision Transformers (ViTs) have achieved overwhelming success, yet they suffer from vulnerable resolution scalability, i.e., the performance drops drastically when presented with input resolutions that are unseen during training. We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions. In particular, ResFormer operates on replicated images of different resolutions and enforces a scale consistency loss to engage interactive information across different scales. More importantly, to alternate among varying resolutions effectively, especially novel ones in testing, we propose a global-local positional embedding strategy that changes smoothly conditioned on input sizes. We conduct extensive experiments for image classification on ImageNet. The results provide strong quantitative evidence that ResFormer has promising scaling abilities towards a wide range of resolutions. For instance, ResFormer- B-MR achieves a Top-1 accuracy of 75.86% and 81.72% when evaluated on relatively low and high resolutions respectively (i.e., 96 and 640), which are 48% and 7.49% better than DeiT-B. We also demonstrate, moreover, ResFormer is flexible and can be easily extended to semantic segmentation, object detection and video action recognition.

You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model
Tang, ShengkunandWang, YaqingandKong, ZhenglunandZhang, TianchiandLi, YaoandDing, CaiwenandWang, YanzhiandLiang, YiandXu, Dongkuan



研究问题:大型转换器模型虽然在各种下游视觉语言任务上取得了显著改进,但其庞大的模型规模导致推理速度慢和服务器成本增加。
动机:并非所有输入都需要相同数量的计算来进行,这可能导致计算资源浪费。因此,需要一种方法来动态分配计算资源以提高推理效率。
方法:提出一种新的早期退出策略,允许在编码器和解码器中同时根据输入层相似性动态跳过多个层的早期退出,即MuE。通过分解编码器中的图像和文本模态,MuE具有灵活性,可以在模态方面跳过不同的层,提高推理效率的同时最小化性能下降。
效果:实验结果表明,该方法在SNLI-VE和MS COCO数据集上可以将推理时间减少50%和40%,同时保持99%和96%的性能。

Large-scale transformer models bring significant improvements for various downstream vision language tasks with a unified architecture. The performance improvements come with increasing model size, resulting in slow inference speed and increased cost for severing. While some certain predictions benefit from the full complexity of the large-scale model, not all of input need the same amount of computation to conduct, potentially leading to computation resource waste. To handle this challenge, early exiting is proposed to adaptively allocate computational power in term of input complexity to improve inference efficiency. The existing early exiting strategies usually adopt output confidence based on intermediate layers as a proxy of input complexity to incur the decision of skipping following layers. However, such strategies cannot apply to encoder in the widely-used unified architecture with both encoder and decoder due to difficulty of output confidence estimation in the encoder. It is suboptimal in term of saving computation power to ignore the early exiting in encoder component. To handle this challenge, we propose a novel early exiting strategy for unified visual language models, which allows dynamically skip the layers in encoder and decoder simultaneously in term of input layer-wise similarities with multiple times of early exiting, namely MuE. By decomposing the image and text modalities in the encoder, MuE is flexible and can skip different layers in term of modalities, advancing the inference efficiency while minimizing performance drop. Experiments on the SNLI-VE and MS COCO datasets show that the proposed approach MuE can reduce inference time by up to 50% and 40% while maintaining 99% and 96% performance respectively.

PD-Quant: Post-Training Quantization Based on Prediction Difference Metric
Liu, JiaweiandNiu, LinandYuan, ZhihangandYang, DaweiandWang, XinggangandLiu, Wenyu



研究问题:如何确定合适的量化参数,以降低神经网络的尺寸和计算成本,同时保持预测精度。
动机:现有的后训练量化方法只考虑局部信息,可能无法得到最优的量化参数。
方法:提出PD-Quant方法,通过使用网络预测前后的差异信息来确定量化参数,并调整激活分布以解决因校准集数量少而导致的过拟合问题。
效果:实验表明,PD-Quant能产生更好的量化参数,提高量化模型的预测精度,尤其在低位设置中表现优秀。例如,在权重2位、激活2位的情况下,PD-Quant将ResNet-18的准确率提升至53.14%,RegNetX-600MF的准确率提升至40.67%。

Post-training quantization (PTQ) is a neural network compression technique that converts a full-precision model into a quantized model using lower-precision data types. Although it can help reduce the size and computational cost of deep neural networks, it can also introduce quantization noise and reduce prediction accuracy, especially in extremely low-bit settings. How to determine the appropriate quantization parameters (e.g., scaling factors and rounding of weights) is the main problem facing now. Existing methods attempt to determine these parameters by minimize the distance between features before and after quantization, but such an approach only considers local information and may not result in the most optimal quantization parameters. We analyze this issue and propose PD-Quant, a method that addresses this limitation by considering global information. It determines the quantization parameters by using the information of differences between network prediction before and after quantization. In addition, PD-Quant can alleviate the overfitting problem in PTQ caused by the small number of calibration sets by adjusting the distribution of activations. Experiments show that PD-Quant leads to better quantization parameters and improves the prediction accuracy of quantized models, especially in low-bit settings. For example, PD-Quant pushes the accuracy of ResNet-18 up to 53.14% and RegNetX-600MF up to 40.67% in weight 2-bit activation 2-bit. The code is released at https://github.com/hustvl/PD-Quant.

Ultra-High Resolution Segmentation With Ultra-Rich Context: A Novel Benchmark
Ji, DeyiandZhao, FengandLu, HongtaoandTao, MingyuanandYe, Jieping



研究问题:目前需要一种大规模、高分辨率的细粒度密集标注数据集来推动超高清分割(UHR)方法的发展。
动机:现有的UHR数据集在图像数量、场景复杂度和上下文丰富度等方面存在不足,因此需要开发一种新的数据集来满足需求。
方法:研究者提出了URUR数据集,它包含大量的高分辨率图像(3008张,大小为5120x5120),覆盖了63个城市的复杂场景,丰富的上下文信息(1百万个实例,8个类别),以及精细的标注(约800亿个手动标注像素)。同时,研究者还提出了WSDNet框架,该框架通过集成多级离散小波变换(DWT)和波形平滑损失(WSL)函数,有效降低了计算负担并保留了更多的空间细节。
效果:实验证明,URUR数据集在多个UHR数据集上表现出了最先进的性能,而WSDNet框架也显示出了高效和出色的分割效果。

With the increasing interest and rapid development of methods for Ultra-High Resolution (UHR) segmentation, a large-scale benchmark covering a wide range of scenes with full fine-grained dense annotations is urgently needed to facilitate the field. To this end, the URUR dataset is introduced, in the meaning of Ultra-High Resolution dataset with Ultra-Rich Context. As the name suggests, URUR contains amounts of images with high enough resolution (3,008 images of size 5,120x5,120), a wide range of complex scenes (from 63 cities), rich-enough context (1 million instances with 8 categories) and fine-grained annotations (about 80 billion manually annotated pixels), which is far superior to all the existing UHR datasets including DeepGlobe, Inria Aerial, UDD, etc.. Moreover, we also propose WSDNet, a more efficient and effective framework for UHR segmentation especially with ultra-rich context. Specifically, multi-level Discrete Wavelet Transform (DWT) is naturally integrated to release computation burden while preserve more spatial details, along with a Wavelet Smooth Loss (WSL) to reconstruct original structured context and texture with a smooth constrain. Experiments on several UHR datasets demonstrate its state-of-the-art performance. The dataset is available at https://github.com/jankyee/URUR.

Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting
Li, GenandJi, JieandQin, MinghaiandNiu, WeiandRen, BinandAfghah, FatemehandGuo, LinkeandMa, Xiaolong



研究问题:如何利用深度学习进行视频分辨率提升,同时减少存储和带宽资源的消耗。
动机:现有的视频传输系统通过将视频分割成块并使用超分辨率模型对每个块进行过拟合来提高视频质量,但这需要大量的块来保证过拟合的质量,从而增加了存储和数据传输的负担。
方法:提出一种新的方法,通过利用空间-时间信息精确地将视频分割成块,从而将块的数量和模型大小保持在最低限度。并通过数据感知的联合训练技术将该方法提升为一个单一的过拟合模型,进一步减少了存储需求。
效果:在实验中,该方法在实时视频超分辨率方面取得了良好的效果,与最先进的方法相比,该方法在直播视频分辨率提升任务上实现了28 fps的流速度和41.60 dB的PSNR,速度快了14倍,质量提高了2.29 dB。

As deep convolutional neural networks (DNNs) are widely used in various fields of computer vision, leveraging the overfitting ability of the DNN to achieve video resolution upscaling has become a new trend in the modern video delivery system. By dividing videos into chunks and overfitting each chunk with a super-resolution model, the server encodes videos before transmitting them to the clients, thus achieving better video quality and transmission efficiency. However, a large number of chunks are expected to ensure good overfitting quality, which substantially increases the storage and consumes more bandwidth resources for data transmission. On the other hand, decreasing the number of chunks through training optimization techniques usually requires high model capacity, which significantly slows down execution speed. To reconcile such, we propose a novel method for high-quality and efficient video resolution upscaling tasks, which leverages the spatial-temporal information to accurately divide video into chunks, thus keeping the number of chunks as well as the model size to a minimum. Additionally, we advance our method into a single overfitting model by a data-aware joint training technique, which further reduces the storage requirement with negligible quality drop. We deploy our proposed overfitting models on an off-the-shelf mobile phone, and experimental results show that our method achieves real-time video super-resolution with high video quality. Compared with the state-of-the-art, our method achieves 28 fps streaming speed with 41.60 PSNR, which is 14 times faster and 2.29 dB better in the live video resolution upscaling tasks.

Class-Incremental Exemplar Compression for Class-Incremental Learning
Luo, ZilinandLiu, YaoyaoandSchiele, BerntandSun, Qianru



研究问题:如何有效地进行基于范例的类别增量学习(CIL)?
动机:目前的CIL方法在每个增量阶段都需要存储旧类别的少量范例,这限制了范例的数量。
方法:提出了一种名为类别增量掩蔽(CIM)的自适应掩模生成模型,通过从类激活映射(CAM)中生成0-1掩码来压缩非判别性像素的范例,从而突破了“少量”的限制。
效果:在Food-101、ImageNet-100和ImageNet-1000等高分辨率CIL基准测试中,使用CIM压缩的范例实现了新的最先进的CIL准确性,比FOSTER在10阶段ImageNet-1000上高出4.8个百分点。

Exemplar-based class-incremental learning (CIL) finetunes the model with all samples of new classes but few-shot exemplars of old classes in each incremental phase, where the "few-shot" abides by the limited memory budget. In this paper, we break this "few-shot" limit based on a simple yet surprisingly effective idea: compressing exemplars by downsampling non-discriminative pixels and saving "many-shot" compressed exemplars in the memory. Without needing any manual annotation, we achieve this compression by generating 0-1 masks on discriminative pixels from class activation maps (CAM). We propose an adaptive mask generation model called class-incremental masking (CIM) to explicitly resolve two difficulties of using CAM: 1) transforming the heatmaps of CAM to 0-1 masks with an arbitrary threshold leads to a trade-off between the coverage on discriminative pixels and the quantity of exemplars, as the total memory is fixed; and 2) optimal thresholds vary for different object classes, which is particularly obvious in the dynamic environment of CIL. We optimize the CIM model alternatively with the conventional CIL model through a bilevel optimization problem. We conduct extensive experiments on high-resolution CIL benchmarks including Food-101, ImageNet-100, and ImageNet-1000, and show that using the compressed exemplars by CIM can achieve a new state-of-the-art CIL accuracy, e.g., 4.8 percentage points higher than FOSTER on 10-Phase ImageNet-1000. Our code is available at https://github.com/xfflzl/CIM-CIL.

Boost Vision Transformer With GPU-Friendly Sparsity and Quantization
Yu, ChongandChen, TaoandGan, ZhongxueandFan, Jiayuan



研究问题:如何充分利用GPU硬件对视觉变换器进行加速部署。
动机:由于视觉变换器中的许多堆叠自注意力和交叉注意力块涉及许多高维张量乘法操作,因此其GPU加速部署具有挑战性且鲜有研究。
方法:设计了一种压缩方案,最大限度地利用了GPU友好的2:4细粒度结构化稀疏和量化。首先通过2:4结构化剪枝将原始大模型稀疏化,然后通过稀疏蒸馏感知量化训练将浮点稀疏模型进一步量化为定点模型。
效果:实验结果表明,该压缩方案在减少视觉变换器模型大小6.4-12.7倍、FLOPs 30.3-62倍的同时,在ImageNet分类、COCO检测和ADE20K分割基准任务上几乎没有精度下降。此外,该方案还可以在实际部署性能上提高1.39-1.79倍和3.22-3.43倍的延迟和吞吐量。

The transformer extends its success from the language to the vision domain. Because of the numerous stacked self-attention and cross-attention blocks in the transformer, which involve many high-dimensional tensor multiplication operations, the acceleration deployment of vision transformer on GPU hardware is challenging and also rarely studied. This paper thoroughly designs a compression scheme to maximally utilize the GPU-friendly 2:4 fine-grained structured sparsity and quantization. Specially, an original large model with dense weight parameters is first pruned into a sparse one by 2:4 structured pruning, which considers the GPU's acceleration of 2:4 structured sparse pattern with FP16 data type, then the floating-point sparse model is further quantized into a fixed-point one by sparse-distillation-aware quantization aware training, which considers GPU can provide an extra speedup of 2:4 sparse calculation with integer tensors. A mixed-strategy knowledge distillation is used during the pruning and quantization process. The proposed compression scheme is flexible to support supervised and unsupervised learning styles. Experiment results show GPUSQ-ViT scheme achieves state-of-the-art compression by reducing vision transformer models 6.4-12.7 times on model size and 30.3-62 times on FLOPs with negligible accuracy degradation on ImageNet classification, COCO detection and ADE20K segmentation benchmarking tasks. Moreover, GPUSQ-ViT can boost actual deployment performance by 1.39-1.79 times and 3.22-3.43 times of latency and throughput on A100 GPU, and 1.57-1.69 times and 2.11-2.51 times improvement of latency and throughput on AGX Orin.

Bilateral Memory Consolidation for Continual Learning
Nie, XingandXu, ShixiongandLiu, XiyanandMeng, GaofengandHuo, ChunleiandXiang, Shiming



研究问题:如何提高深度模型在处理长期任务序列时的记忆能力,防止灾难性遗忘?
动机:人类能够持续获取并整合新知识,而深度模型在处理长期任务时会出现记忆丧失的问题。
方法:提出一种新的双向记忆整合(BiMeCo)框架,将模型参数分为短期记忆模块和长期记忆模块,通过知识蒸馏和基于动量的更新来增强两个记忆模块之间的动态交互,形成通用知识以防止遗忘。
效果:实验表明,BiMeCo显著提高了现有连续学习方法的性能,例如,在CIFAR-100上使用ResNet-18时,与最先进的CwD方法结合使用,BiMeCo的参数减少了2倍,性能提高了约2%到6%。

Humans are proficient at continuously acquiring and integrating new knowledge. By contrast, deep models forget catastrophically, especially when tackling highly long task sequences. Inspired by the way our brains constantly rewrite and consolidate past recollections, we propose a novel Bilateral Memory Consolidation (BiMeCo) framework that focuses on enhancing memory interaction capabilities. Specifically, BiMeCo explicitly decouples model parameters into short-term memory module and long-term memory module, responsible for representation ability of the model and generalization over all learned tasks, respectively. BiMeCo encourages dynamic interactions between two memory modules by knowledge distillation and momentum-based updating for forming generic knowledge to prevent forgetting. The proposed BiMeCo is parameter-efficient and can be integrated into existing methods seamlessly. Extensive experiments on challenging benchmarks show that BiMeCo significantly improves the performance of existing continual learning methods. For example, combined with the state-of-the-art method CwD, BiMeCo brings in significant gains of around 2% to 6% while using 2x fewer parameters on CIFAR-100 under ResNet-18.

FlexiViT: One Model for All Patch Sizes
Beyer, LucasandIzmailov, PavelandKolesnikov, AlexanderandCaron, MathildeandKornblith, SimonandZhai, XiaohuaandMinderer, MatthiasandTschannen, MichaelandAlabdulmohsin, IbrahimandPavetic, Filip



研究问题:如何利用图像转换器(ViT)在不同的计算预算下进行训练?
动机:改变图像转换器的切片大小会影响速度和准确性,但通常需要重新训练模型。
方法:在训练时随机化切片大小,使得模型能够适应不同的计算预算。
效果:通过广泛的任务评估,发现这种方法可以匹配甚至超越在单一切片大小下训练的标准ViT模型,为依赖ViT主干架构的大多数模型提供了易于添加的计算适应性能力。

Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, openworld detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pretrained models are available at github.com/googleresearch/big_vision.

Wavelet Diffusion Models Are Fast and Scalable Image Generators
Phung, HaoandDao, QuanandTran, Anh



研究问题:扩散模型在高保真图像生成方面表现出色,但训练和推理速度慢是其应用于实时应用的主要障碍。
动机:为了解决扩散模型运行速度慢的问题,本文提出了一种基于小波的扩散方案。
方法:通过小波分解从图像和特征层面提取高低频率成分,并对其进行自适应处理以加快处理速度,同时保持良好的生成质量。此外,还提出了使用重建项来有效提高模型训练的收敛性。
效果:在CelebA-HQ、CIFAR-10、LSUN-Church和STL-10数据集上的实验结果表明,该方法为实现实时高保真扩散模型迈出了重要一步。

Diffusion models are rising as a powerful solution for high-fidelity image generation, which exceeds GANs in quality in many circumstances. However, their slow training and inference speed is a huge bottleneck, blocking them from being used in real-time applications. A recent DiffusionGAN method significantly decreases the models' running time by reducing the number of sampling steps from thousands to several, but their speeds still largely lag behind the GAN counterparts. This paper aims to reduce the speed gap by proposing a novel wavelet-based diffusion scheme. We extract low-and-high frequency components from both image and feature levels via wavelet decomposition and adaptively handle these components for faster processing while maintaining good generation quality. Furthermore, we propose to use a reconstruction term, which effectively boosts the model training convergence. Experimental results on CelebA-HQ, CIFAR-10, LSUN-Church, and STL-10 datasets prove our solution is a stepping-stone to offering real-time and high-fidelity diffusion models. Our code and pre-trained checkpoints are available at https://github.com/VinAIResearch/WaveDiff.git.

NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers
Liu, YijiangandYang, HuanruiandDong, ZhenandKeutzer, KurtandDu, LiandZhang, Shanghang



研究问题:如何提高视觉变换器的后训练量化效果。
动机:视觉变换器复杂的架构和高昂的训练成本促使了对后训练量化的探索,但其激活分布的重尾性阻碍了现有后训练量化方法的效果。
方法:本文提出了NoisyQuant,一种针对视觉变换器后训练激活量化性能的量化器无关增强方法。通过在被量化的值上添加固定的均匀噪声偏置,可以在可证明的条件下显著降低量化误差。
效果:大量实验表明,NoisyQuant在最小化计算开销的同时,大大提高了视觉变换器的后训练量化性能。例如,在线性均匀6位激活量化中,NoisyQuant分别将ViT、DeiT和Swin Transformer在ImageNet上的SOTA top-1准确率提高了1.7%、1.1%和0.5%,实现了与先前的非线性、混合精度量化相媲美甚至更高的性能。

The complicated architecture and high training cost of vision transformers urge the exploration of post-training quantization. However, the heavy-tailed distribution of vision transformer activations hinders the effectiveness of previous post-training quantization methods, even with advanced quantizer designs. Instead of tuning the quantizer to better fit the complicated activation distribution, this paper proposes NoisyQuant, a quantizer-agnostic enhancement for the post-training activation quantization performance of vision transformers. We make a surprising theoretical discovery that for a given quantizer, adding a fixed Uniform noisy bias to the values being quantized can significantly reduce the quantization error under provable conditions. Building on the theoretical insight, NoisyQuant achieves the first success on actively altering the heavy-tailed activation distribution with additive noisy bias to fit a given quantizer. Extensive experiments show NoisyQuant largely improves the post-training quantization performance of vision transformer with minimal computation overhead. For instance, on linear uniform 6-bit activation quantization, NoisyQuant improves SOTA top-1 accuracy on ImageNet by up to 1.7%, 1.1% and 0.5% for ViT, DeiT, and Swin Transformer respectively, achieving on-par or even higher performance than previous nonlinear, mixed-precision quantization.

Video Compression With Entropy-Constrained Neural Representations
Gomes, CarlosandAzevedo, RobertoandSchroers, Christopher



研究问题:如何通过神经网络对视频进行编码,以实现新的视频处理形式。
动机:传统的视频压缩技术在性能上仍然优于最新的神经网络视频表示(NVR)方法。
方法:提出了一种新的卷积架构用于视频表示,可以更好地表示时空信息,并设计了一种能够同时优化码率和失真的训练策略。
效果:在UVG数据集上的实验结果表明,该方法在视频压缩方面取得了新的最先进的结果,并且是第一个在性能上超过常用的HEVC基准的NVR-based视频压缩方法。

Encoding videos as neural networks is a recently proposed approach that allows new forms of video processing. However, traditional techniques still outperform such neural video representation (NVR) methods for the task of video compression. This performance gap can be explained by the fact that current NVR methods: i) use architectures that do not efficiently obtain a compact representation of temporal and spatial information; and ii) minimize rate and distortion disjointly (first overfitting a network on a video and then using heuristic techniques such as post-training quantization or weight pruning to compress the model). We propose a novel convolutional architecture for video representation that better represents spatio-temporal information and a training strategy capable of jointly optimizing rate and distortion. All network and quantization parameters are jointly learned end-to-end, and the post-training operations used in previous works are unnecessary. We evaluate our method on the UVG dataset, achieving new state-of-the-art results for video compression with NVRs. Moreover, we deliver the first NVR-based video compression method that improves over the typically adopted HEVC benchmark (x265, disabled b-frames, "medium" preset), closing the gap to autoencoder-based video compression techniques.

A General Regret Bound of Preconditioned Gradient Method for DNN Training
Yong, HongweiandSun, YingandZhang, Lei



研究问题:优化深度神经网络的自适应学习率方法只考虑了全预条件矩阵的对角元素,而全矩阵预条件梯度方法虽然理论上具有较低的遗憾界,但由于其高复杂性,不适用于训练深度神经网络。
动机:提出一种带有约束的全矩阵预条件梯度的一般遗憾界,并展示预处理器的更新公式可以通过解决锥约束优化问题得到。
方法:通过最小化引导函数的上界,开发了一种新的深度神经网络优化器AdaBK。同时开发了一系列技术,包括统计信息更新、阻尼、高效的矩阵逆根计算和梯度幅度保持等,使AdaBK能够有效且高效地实施。
效果:提出的AdaBK可以方便地嵌入到许多现有的深度神经网络优化器中,如SGDM和AdamW。相应的SGDM_BK和AdamW_BK算法在图像分类、目标检测和分割等基准视觉任务上表现出比现有深度神经网络优化器显著的改进。

While adaptive learning rate methods, such as Adam, have achieved remarkable improvement in optimizing Deep Neural Networks (DNNs), they consider only the diagonal elements of the full preconditioned matrix. Though the full-matrix preconditioned gradient methods theoretically have a lower regret bound, they are impractical for use to train DNNs because of the high complexity. In this paper, we present a general regret bound with a constrained full-matrix preconditioned gradient and show that the updating formula of the preconditioner can be derived by solving a cone-constrained optimization problem. With the block-diagonal and Kronecker-factorized constraints, a specific guide function can be obtained. By minimizing the upper bound of the guide function, we develop a new DNN optimizer, termed AdaBK. A series of techniques, including statistics updating, dampening, efficient matrix inverse root computation, and gradient amplitude preservation, are developed to make AdaBK effective and efficient to implement. The proposed AdaBK can be readily embedded into many existing DNN optimizers, e.g., SGDM and AdamW, and the corresponding SGDM_BK and AdamW_BK algorithms demonstrate significant improvements over existing DNN optimizers on benchmark vision tasks, including image classification, object detection and segmentation. The source code will be made publicly available.

Temporal Interpolation Is All You Need for Dynamic Neural Radiance Fields
Park, SungheonandSon, MinjungandJang, SeokhwanandAhn, YoungChunandKim, Ji-YeonandKang, Nahyup



研究问题:如何训练动态场景的有意义的时空表示。
动机:在动态场景中,时间插值在学习有意义的表示中起着关键作用。
方法:提出一种基于特征向量的时间插值的新方法来训练动态场景的时空神经辐射场。根据底层表示,提出了两种特征插值方法,神经网络或网格。在神经网络表示中,通过多个神经网络模块从空间-时间输入中提取特征,并根据时间帧进行插值。所提出的多级特征插值网络有效地捕捉了短期和长期时间范围的特征。在网格表示中,通过四维哈希网格学习空间-时间特征,显著减少了训练时间。网格表示比先前的基于神经网络的方法快100倍以上,同时保持了渲染质量。连接静态和动态特征并添加简单的平滑性项进一步提高了我们提出模型的性能。
效果:尽管模型架构简单,但我们的方法在神经网络表示的渲染质量和网格表示的训练速度方面都取得了最先进的性能。

Temporal interpolation often plays a crucial role to learn meaningful representations in dynamic scenes. In this paper, we propose a novel method to train spatiotemporal neural radiance fields of dynamic scenes based on temporal interpolation of feature vectors. Two feature interpolation methods are suggested depending on underlying representations, neural networks or grids. In the neural representation, we extract features from space-time inputs via multiple neural network modules and interpolate them based on time frames. The proposed multi-level feature interpolation network effectively captures features of both short-term and long-term time ranges. In the grid representation, space-time features are learned via four-dimensional hash grids, which remarkably reduces training time. The grid representation shows more than 100 times faster training speed than the previous neural-net-based methods while maintaining the rendering quality. Concatenating static and dynamic features and adding a simple smoothness term further improve the performance of our proposed models. Despite the simplicity of the model architectures, our method achieved state-of-the-art performance both in rendering quality for the neural representation and in training speed for the grid representation.

PlenVDB: Memory Efficient VDB-Based Radiance Fields for Fast Training and Rendering
Yan, HanandLiu, CelongandMa, ChaoandMei, Xing



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

In this paper, we present a new representation for neural radiance fields that accelerates both the training and the inference processes with VDB, a hierarchical data structure for sparse volumes. VDB takes both the advantages of sparse and dense volumes for compact data representation and efficient data access, being a promising data structure for NeRF data interpolation and ray marching. Our method, Plenoptic VDB (PlenVDB), directly learns the VDB data structure from a set of posed images by means of a novel training strategy and then uses it for real-time rendering. Experimental results demonstrate the effectiveness and the efficiency of our method over previous arts: First, it converges faster in the training process. Second, it delivers a more compact data format for NeRF data presentation. Finally, it renders more efficiently on commodity graphics hardware. Our mobile PlenVDB demo achieves 30+ FPS, 1280x720 resolution on an iPhone12 mobile phone. Check plenvdb.github.io for details.

Tangentially Elongated Gaussian Belief Propagation for Event-Based Incremental Optical Flow Estimation
Nagata, JunandSekikawa, Yusuke



研究问题:如何实现低延迟的光流估计。
动机:现有的方法无法同时实现低延迟和全光流的估计。
方法:提出一种切向拉长的高斯(TEG)信念传播(BP)方法,通过消息传递和贝叶斯推断实现全光流的增量估计。
效果:在真实世界数据集上,TEGBP方法的效果优于最先进的增量准全流方法,并将在接收后开源。

Optical flow estimation is a fundamental functionality in computer vision. An event-based camera, which asynchronously detects sparse intensity changes, is an ideal device for realizing low-latency estimation of the optical flow owing to its low-latency sensing mechanism. An existing method using local plane fitting of events could utilize the sparsity to realize incremental updates for low-latency estimation; however, its output is merely a normal component of the full optical flow. An alternative approach using a frame-based deep neural network could estimate the full flow; however, its intensive non-incremental dense operation prohibits the low-latency estimation. We propose tangentially elongated Gaussian (TEG) belief propagation (BP) that realizes incremental full-flow estimation. We model the probability of full flow as the joint distribution of TEGs from the normal flow measurements, such that the marginal of this distribution with correct prior equals the full flow. We formulate the marginalization using a message-passing based on the BP to realize efficient incremental updates using sparse measurements. In addition to the theoretical justification, we evaluate the effectiveness of the TEGBP in real-world datasets; it outperforms SOTA incremental quasi-full flow method by a large margin. The code will be open-sourced upon acceptance.

Fair Scratch Tickets: Finding Fair Sparse Networks Without Weight Training
Tang, PengweiandYao, WeiandLi, ZhicongandLiu, Yong



研究问题:本文旨在解决计算机视觉模型可能存在的公平性问题。
动机:目前存在大量的工作致力于通过预处理、处理中和处理后的方法来减轻计算机视觉中的不公平现象。
方法:本文提出了一种新颖的处理中公平感知学习范式,借鉴了计算机视觉公平性中的彩票假设(LTH)。我们随机初始化一个密集神经网络,并找到合适的权重二值掩码以获得公平的稀疏子网络,而无需进行任何权重训练。
效果:实验结果表明,我们在随机初始化的网络中发现具有天生公平性的这种稀疏子网络,其准确性-公平性权衡与现有的公平感知处理中方法训练的密集神经网络相当。我们将这些公平子网络称为“公平刮刮乐”(FSTs)。我们还从理论上为它们提供了公平性和准确性的保证。在我们的实验中,我们在各种数据集、目标属性、随机初始化方法、稀疏模式和公平代理上研究了FSTs的存在性。我们还发现FSTs可以跨数据集转移,并研究了FSTs的其他性质。

Recent studies suggest that computer vision models come at the risk of compromising fairness. There are extensive works to alleviate unfairness in computer vision using pre-processing, in-processing, and post-processing methods. In this paper, we lead a novel fairness-aware learning paradigm for in-processing methods through the lens of the lottery ticket hypothesis (LTH) in the context of computer vision fairness. We randomly initialize a dense neural network and find appropriate binary masks for the weights to obtain fair sparse subnetworks without any weight training. Interestingly, to the best of our knowledge, we are the first to discover that such sparse subnetworks with inborn fairness exist in randomly initialized networks, achieving an accuracy-fairness trade-off comparable to that of dense neural networks trained with existing fairness-aware in-processing approaches. We term these fair subnetworks as Fair Scratch Tickets (FSTs). We also theoretically provide fairness and accuracy guarantees for them. In our experiments, we investigate the existence of FSTs on various datasets, target attributes, random initialization methods, sparsity patterns, and fairness surrogates. We also find that FSTs can transfer across datasets and investigate other properties of FSTs.

The Resource Problem of Using Linear Layer Leakage Attack in Federated Learning
Zhao, JoshuaC.andElkordy, AhmedRoushdyandSharma, AtulandEzzeldin, YahyaH.andAvestimehr, SalmanandBagchi, Saurabh



研究问题:如何在联邦学习中提高隐私保护水平,使服务器只能访问解密后的聚合更新?
动机:线性层泄漏方法是目前唯一能够扩展并实现高泄漏率的数据重建攻击方法,但这种方法的资源开销随着客户端数量的增加而增加。
方法:通过增加注入的全连接层的大小来提高泄漏率。同时,通过将聚合视为多个个体更新的组合,应用稀疏性来减轻资源开销。
效果:与最先进的方法相比,使用稀疏性可以将模型大小开销减少327倍,计算时间减少3.34倍,同时保持相同的总泄漏率,即使在1000个客户端的聚合中,泄漏率仍为77%。

Secure aggregation promises a heightened level of privacy in federated learning, maintaining that a server only has access to a decrypted aggregate update. Within this setting, linear layer leakage methods are the only data reconstruction attacks able to scale and achieve a high leakage rate regardless of the number of clients or batch size. This is done through increasing the size of an injected fully-connected (FC) layer. We show that this results in a resource overhead which grows larger with an increasing number of clients. We show that this resource overhead is caused by an incorrect perspective in all prior work that treats an attack on an aggregate update in the same way as an individual update with a larger batch size. Instead, by attacking the update from the perspective that aggregation is combining multiple individual updates, this allows the application of sparsity to alleviate resource overhead. We show that the use of sparsity can decrease the model size overhead by over 327x and the computation time by 3.34x compared to SOTA while maintaining equivalent total leakage rate, 77% even with 1000 clients in aggregation.

Auto-CARD: Efficient and Robust Codec Avatar Driving for Real-Time Mobile Telepresence
Fu, YongganandLi, YuechengandLi, ChenghuiandSaragih, JasonandZhang, PeizhaoandDai, XiaoliangandLin, Yingyan(Celine)



研究问题:如何通过减少计算冗余,实现在AR/VR中实时、稳健的逼真头像驱动。
动机:当前,对AR/VR中的实时、高逼真度头像驱动有着强烈需求,但高昂的计算成本和设备限制是主要瓶颈。
方法:提出了一种名为Auto-CARD的框架,通过最小化两种冗余源来实现这一目标。首先,开发了一种专用的神经架构搜索技术AVE-NAS,用于AR/VR中的头像编码;其次,利用连续渲染过程中的图像时间冗余性,开发了一种称为LATEX的机制来跳过冗余帧的计算。
效果:在Meta Quest 2上进行的实时Codec Avatar驱动测试中,Auto-CARD框架实现了5.05倍的速度提升,同时保持了与最先进的头像编码设计相当甚至更好的动画质量。

Real-time and robust photorealistic avatars for telepresence in AR/VR have been highly desired for enabling immersive photorealistic telepresence. However, there still exists one key bottleneck: the considerable computational expense needed to accurately infer facial expressions captured from headset-mounted cameras with a quality level that can match the realism of the avatar's human appearance. To this end, we propose a framework called Auto-CARD, which for the first time enables real-time and robust driving of Codec Avatars when exclusively using merely on-device computing resources. This is achieved by minimizing two sources of redundancy. First, we develop a dedicated neural architecture search technique called AVE-NAS for avatar encoding in AR/VR, which explicitly boosts both the searched architectures' robustness in the presence of extreme facial expressions and hardware friendliness on fast evolving AR/VR headsets. Second, we leverage the temporal redundancy in consecutively captured images during continuous rendering and develop a mechanism dubbed LATEX to skip the computation of redundant frames. Specifically, we first identify an opportunity from the linearity of the latent space derived by the avatar decoder and then propose to perform adaptive latent extrapolation for redundant frames. For evaluation, we demonstrate the efficacy of our Auto-CARD framework in real-time Codec Avatar driving settings, where we achieve a 5.05x speed-up on Meta Quest 2 while maintaining a comparable or even better animation quality than state-of-the-art avatar encoder designs.

Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners
Chen, ZitianandShen, YikangandDing, MingyuandChen, ZhenfangandZhao, HengshuangandLearned-Miller, ErikG.andGan, Chuang



研究问题:多任务学习(MTL)中的优化比单任务学习(STL)更具挑战性,因为不同任务的梯度可能是矛盾的。
动机:当任务相关时,共享一些参数可能有益(合作),但某些任务需要额外的参数以在特定类型的数据或区分上具有专业知识(专业化)。
方法:我们提出了Mod-Squad,这是一个模块化为专家小组的新模型。这种结构允许我们将合作和专业化形式化为匹配专家和任务的过程。我们在训练单个模型时优化这个匹配过程。具体来说,我们在一个transformer模型中加入了专家混合(MoE)层,并引入了一种新的损失函数,该函数包含了任务和专家之间的相互依赖性。
效果:实验结果显示,我们的方法是优越的。对于每个任务,我们可以将这小部分专家提取出来作为一个独立的模型,其性能与大模型相同。

Optimization in multi-task learning (MTL) is more challenging than single-task learning (STL), as the gradient from different tasks can be contradictory. When tasks are related, it can be beneficial to share some parameters among them (cooperation). However, some tasks require additional parameters with expertise in a specific type of data or discrimination (specialization). To address the MTL challenge, we propose Mod-Squad, a new model that is Modularized into groups of experts (a 'Squad'). This structure allows us to formalize cooperation and specialization as the process of matching experts and tasks. We optimize this matching process during the training of a single model. Specifically, we incorporate mixture of experts (MoE) layers into a transformer model, with a new loss that incorporates the mutual dependence between tasks and experts. As a result, only a small set of experts are activated for each task. This prevents the sharing of the entire backbone model between all tasks, which strengthens the model, especially when the training set size and the number of tasks scale up. More interestingly, for each task, we can extract the small set of experts as a standalone model that maintains the same performance as the large model. Extensive experiments on the Taskonomy dataset with 13 vision tasks and the PASCALContext dataset with 5 vision tasks show the superiority of our approach. The project page can be accessed at https://vis-www.cs.umass.edu/mod-squad.

Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks
Chen, JierunandKao, Shiu-hongandHe, HaoandZhuo, WeipengandWen, SongandLee, Chul-HoandChan, S.-H.Gary



研究问题:如何设计更快的神经网络?
动机:减少浮点运算数量(FLOPs)并不能直接导致网络延迟的降低,因为操作效率低下导致的低浮点运算每秒(FLOPS)。
方法:重新审视流行的操作符,发现其低FLOPS主要是由于频繁的内存访问,特别是深度卷积。因此,提出一种新的部分卷积(PConv),通过减少冗余计算和内存访问来更有效地提取空间特征。基于PConv,进一步提出了FasterNet,这是一种新的神经网络家族,在各种设备上都能显著提高运行速度,同时保持对各种视觉任务的准确性。
效果:例如,在ImageNet-1k上,我们的小型FasterNet-T0比MobileViT-XXS在GPU、CPU和ARM处理器上的速度快2.8倍、3.3倍和2.4倍,同时准确率高出2.9%。我们的大型FasterNet-L实现了令人印象深刻的83.5%的top-1准确率,与新兴的Swin-B相当,同时在GPU上的推理吞吐量高36%,在CPU上的计算时间节省了37%。代码可在https://github.com/JierunChen/FasterNet获取。

To design fast neural networks, many works have been focusing on reducing the number of floating-point operations (FLOPs). We observe that such reduction in FLOPs, however, does not necessarily lead to a similar level of reduction in latency. This mainly stems from inefficiently low floating-point operations per second (FLOPS). To achieve faster networks, we revisit popular operators and demonstrate that such low FLOPS is mainly due to frequent memory access of the operators, especially the depthwise convolution. We hence propose a novel partial convolution (PConv) that extracts spatial features more efficiently, by cutting down redundant computation and memory access simultaneously. Building upon our PConv, we further propose FasterNet, a new family of neural networks, which attains substantially higher running speed than others on a wide range of devices, without compromising on accuracy for various vision tasks. For example, on ImageNet-1k, our tiny FasterNet-T0 is 2.8x, 3.3x, and 2.4x faster than MobileViT-XXS on GPU, CPU, and ARM processors, respectively, while being 2.9% more accurate. Our large FasterNet-L achieves impressive 83.5% top-1 accuracy, on par with the emerging Swin-B, while having 36% higher inference throughput on GPU, as well as saving 37% compute time on CPU. Code is available at https://github.com/JierunChen/FasterNet.

Adaptive Plasticity Improvement for Continual Learning
Liang, Yan-ShuoandLi, Wu-Jun



研究问题:本文旨在解决持续学习中的灾难性遗忘问题,同时考虑评估和提高模型在新任务上的可塑性。
动机:现有的方法在解决遗忘问题时可能会损害模型对新任务的适应性。
方法:提出一种名为自适应可塑性改进的新方法,通过评估并自适应地提高模型在新任务上的可塑性。
效果:实验结果表明,该方法在准确性和内存使用方面优于其他最先进的基线方法。

Many works have tried to solve the catastrophic forgetting (CF) problem in continual learning (lifelong learning). However, pursuing non-forgetting on old tasks may damage the model's plasticity for new tasks. Although some methods have been proposed to achieve stability-plasticity trade-off, no methods have considered evaluating a model's plasticity and improving plasticity adaptively for a new task. In this work, we propose a new method, called adaptive plasticity improvement (API), for continual learning. Besides the ability to overcome CF on old tasks, API also tries to evaluate the model's plasticity and then adaptively improve the model's plasticity for learning a new task if necessary. Experiments on several real datasets show that API can outperform other state-of-the-art baselines in terms of both accuracy and memory usage.

Towards Better Decision Forests: Forest Alternating Optimization
Carreira-Perpi\~n\'an, Miguel\'A.andGabidolla, MagzhanandZharmagambetov, Arman



研究问题:如何优化决策森林模型以提高其准确性?
动机:尽管决策森林是机器学习中最准确的模型之一,但其训练方式高度依赖启发式方法,且优化过程复杂。
方法:提出一种新的优化算法——森林交替优化(Forest Alternating Optimization),通过在所有树和参数上联合优化期望损失和正则化来学习森林。
效果:实验证明,该方法训练出的森林在准确率上超过了最先进的模型,同时使用的树更少、更小。

Decision forests are among the most accurate models in machine learning. This is remarkable given that the way they are trained is highly heuristic: neither the individual trees nor the overall forest optimize any well-defined loss. While diversity mechanisms such as bagging or boosting have been until now critical in the success of forests, we think that a better optimization should lead to better forests---ideally eliminating any need for an ensembling heuristic. However, unlike for most other models, such as neural networks, optimizing forests or trees is not easy, because they define a non-differentiable function. We show, for the first time, that it is possible to learn a forest by optimizing a desirable loss and regularization jointly over all its trees and parameters. Our algorithm, Forest Alternating Optimization, is based on defining a forest as a parametric model with a fixed number of trees and structure (rather than adding trees indefinitely as in bagging or boosting). It then iteratively updates each tree in alternation so that the objective function decreases monotonically. The algorithm is so effective at optimizing that it easily overfits, but this can be corrected by averaging. The result is a forest that consistently exceeds the accuracy of the state-of-the-art while using fewer, smaller trees.

DA Wand: Distortion-Aware Selection Using Neural Mesh Parameterization
Liu, RichardandAigerman, NoamandKim, VladimirG.andHanocka, Rana



研究问题:本文旨在提出一种神经技术,用于学习在一点周围选择局部子区域进行网格参数化。
动机:我们的框架是由用于表面去马赛克、纹理或绘画的交互式工作流程驱动的。
方法:我们的主要想法是以数据驱动的方式学习局部参数化,使用神经网络框架中的新型可微分参数化层。我们训练一个分割网络来选择参数化为二维的3D区域,并对由此产生的畸变进行惩罚,从而产生对畸变敏感的分割。训练完成后,用户可以在我们的系统中交互地选择网格上的一点并获得该点周围的大型有意义区域,该区域会产生低畸变的参数化。
效果:实验结果表明,我们的系统可以有效地进行网格参数化,并且用户可以通过交互方式获得满意的结果。

We present a neural technique for learning to select a local sub-region around a point which can be used for mesh parameterization. The motivation for our framework is driven by interactive workflows used for decaling, texturing, or painting on surfaces. Our key idea to to learn a local parameterization in a data-driven manner, using a novel differentiable parameterization layer within a neural network framework. We train a segmentation network to select 3D regions that are parameterized into 2D and penalized by the resulting distortion, giving rise to segmentations which are distortion-aware. Following training, a user can use our system to interactively select a point on the mesh and obtain a large, meaningful region around the selection which induces a low-distortion parameterization. Our code and project page are publicly available.

Disentangled Representation Learning for Unsupervised Neural Quantization
Noh, HaechanandHyun, SangeekandJeong, WoojinandLim, HanshinandHeo, Jae-Pil



研究问题:现有的深度学习量化器难以像传统浅层量化器那样从剩余向量空间中受益。
动机:为了解决这个问题,我们提出了一种新的无监督神经量化的解耦表示学习方法。
方法:我们的方法类似于剩余向量空间的概念,通过将倒排索引的信息与向量解耦,使潜在空间更紧凑。
效果:在大规模数据集上的实验结果证实了我们的方法比最先进的检索系统有大幅度的提高。

The inverted index is a widely used data structure to avoid the infeasible exhaustive search. It accelerates retrieval significantly by splitting the database into multiple disjoint sets and restricts distance computation to a small fraction of the database. Moreover, it even improves search quality by allowing quantizers to exploit the compact distribution of residual vector space. However, we firstly point out a problem that an existing deep learning-based quantizer hardly benefits from the residual vector space, unlike conventional shallow quantizers. To cope with this problem, we introduce a novel disentangled representation learning for unsupervised neural quantization. Similar to the concept of residual vector space, the proposed method enables more compact latent space by disentangling information of the inverted index from the vectors. Experimental results on large-scale datasets confirm that our method outperforms the state-of-the-art retrieval systems by a large margin.

Meta-Learning With a Geometry-Adaptive Preconditioner
Kang, SuhyunandHwang, DuhunandEo, MoonjungandKim, TaesupandRhee, Wonjong



研究问题:如何提高元学习算法的效果?
动机:现有的元学习算法如MAML在优化结构上存在局限,需要改进。
方法:提出一种名为GAP的新算法,通过元学习得到一个依赖于任务特定参数的预处理器,并满足黎曼度量条件,以改善内部循环优化。
效果:实验结果显示,GAP在多种少次学习任务上优于最先进的MAML和PGD-MAML。

Model-agnostic meta-learning (MAML) is one of the most successful meta-learning algorithms. It has a bi-level optimization structure where the outer-loop process learns a shared initialization and the inner-loop process optimizes task-specific weights. Although MAML relies on the standard gradient descent in the inner-loop, recent studies have shown that controlling the inner-loop's gradient descent with a meta-learned preconditioner can be beneficial. Existing preconditioners, however, cannot simultaneously adapt in a task-specific and path-dependent way. Additionally, they do not satisfy the Riemannian metric condition, which can enable the steepest descent learning with preconditioned gradient. In this study, we propose Geometry-Adaptive Preconditioned gradient descent (GAP) that can overcome the limitations in MAML; GAP can efficiently meta-learn a preconditioner that is dependent on task-specific parameters, and its preconditioner can be shown to be a Riemannian metric. Thanks to the two properties, the geometry-adaptive preconditioner is effective for improving the inner-loop optimization. Experiment results show that GAP outperforms the state-of-the-art MAML family and preconditioned gradient descent-MAML (PGD-MAML) family in a variety of few-shot learning tasks. Code is available at: https://github.com/Suhyun777/CVPR23-GAP.

Global Vision Transformer Pruning With Hessian-Aware Saliency
Yang, HuanruiandYin, HongxuandShen, MayingandMolchanov, PavloandLi, HaiandKautz, Jan



研究问题:如何降低Transformer模型的计算成本?
动机:Transformer模型在许多任务上取得了最先进的结果,但其设计的架构在推理过程中会产生巨大的计算成本。
方法:通过首次对全局结构剪枝的尝试,重新分配参数,挑战了ViT模型中所有堆叠块具有统一维度的常见设计哲学。
效果:通过对DeiT-Base模型进行迭代剪枝,产生了一种新的架构族NViT(新颖的ViT),其参数重分布更有效地利用了参数。在ImageNet-1K上,NViT-Base实现了2.6倍FLOPs减少、5.1倍参数减少和1.9倍运行速度提升,且几乎无损。较小的NViT变体在相同的吞吐量下实现了超过1%的精度增益,以及比SWIN-Small模型少3.3倍的参数减少,这些结果大大超过了现有技术。进一步的分析显示了ViT模型的高可剪枝性,ViT块内的敏感性差异,以及堆叠ViT块之间的独特参数分布趋势。这些见解为更有效的ViT提供了一种简单而有效的参数重分布规则,以提高其性能。

Transformers yield state-of-the-art results across many tasks. However, their heuristically designed architecture impose huge computational costs during inference. This work aims on challenging the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage, where we redistribute the parameters both across transformer blocks and between different structures within the block via the first systematic attempt on global structural pruning. Dealing with diverse ViT structural components, we derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction. Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter redistribution that utilizes parameters more efficiently. On ImageNet-1K, NViT-Base achieves a 2.6x FLOPs reduction, 5.1x parameter reduction, and 1.9x run-time speedup over the DeiT-Base model in a near lossless manner. Smaller NViT variants achieve more than 1% accuracy gain at the same throughput of the DeiT Small/Tiny variants, as well as a lossless 3.3x parameter reduction over the SWIN-Small model. These results outperform prior art by a large margin. Further analysis is provided on the parameter redistribution insight of NViT, where we show the high prunability of ViT models, distinct sensitivity within ViT block, and unique parameter distribution trend across stacked ViT blocks. Our insights provide viability for a simple yet effective parameter redistribution rule towards more efficient ViTs for off-the-shelf performance boost.

Efficient On-Device Training via Gradient Filtering
Yang, YuedongandLi, GuihongandMarculescu, Radu



研究问题:如何在设备上进行有效的模型训练,特别是在边缘AI中。
动机:当前在设备上的训练存在大量运算和内存消耗的问题,这对联邦学习、持续学习等应用产生了影响。
方法:提出一种新的梯度过滤方法,通过在梯度图中创建具有较少唯一元素的特定结构,显著减少了训练过程中的反向传播的计算复杂性和内存消耗。
效果:在多个CNN模型(如MobileNet、DeepLabV3、UPerNet)和设备(如Raspberry Pi和Jetson Nano)上进行的图像分类和语义分割的大量实验表明,该方法有效且适用广泛。例如,与最先进的技术相比,我们在ImageNet分类上实现了高达19倍的速度提升和77.1%的内存节省,仅损失了0.1%的准确率。此外,该方法易于实施和部署,与NVIDIA Jetson Nano上的MKLDNN和CUDNN的高度优化基线相比,观察到了20倍以上的速度提升和90%的能源节省。因此,该方法为设备上的训练开辟了新的研究方向,具有巨大的潜力。

Despite its importance for federated learning, continuous learning and many other applications, on-device training remains an open problem for EdgeAI. The problem stems from the large number of operations (e.g., floating point multiplications and additions) and memory consumption required during training by the back-propagation algorithm. Consequently, in this paper, we propose a new gradient filtering approach which enables on-device CNN model training. More precisely, our approach creates a special structure with fewer unique elements in the gradient map, thus significantly reducing the computational complexity and memory consumption of back propagation during training. Extensive experiments on image classification and semantic segmentation with multiple CNN models (e.g., MobileNet, DeepLabV3, UPerNet) and devices (e.g., Raspberry Pi and Jetson Nano) demonstrate the effectiveness and wide applicability of our approach. For example, compared to SOTA, we achieve up to 19x speedup and 77.1% memory savings on ImageNet classification with only 0.1% accuracy loss. Finally, our method is easy to implement and deploy; over 20x speedup and 90% energy savings have been observed compared to highly optimized baselines in MKLDNN and CUDNN on NVIDIA Jetson Nano. Consequently, our approach opens up a new direction of research with a huge potential for on-device training.

Learning To Exploit the Sequence-Specific Prior Knowledge for Image Processing Pipelines Optimization
Qin, HainaandHan, LongfeiandXiong, WeihuaandWang, JuanandMa, WentaoandLi, BingandHu, Weiming



研究问题:如何优化图像信号处理(ISP)中的超参数以提高图像质量。
动机:目前的ISP超参数优化方法忽视了ISP模块之间的顺序关系和参数间的相似性,导致优化效果不佳。
方法:提出一种序贯的ISP超参数预测框架,利用ISP模块间的顺序关系和参数间的相似性来指导模型的序列过程。
效果:在物体检测、图像分割和图像质量任务上验证了该方法的有效性。

The hardware image signal processing (ISP) pipeline is the intermediate layer between the imaging sensor and the downstream application, processing the sensor signal into an RGB image. The ISP is less programmable and consists of a series of processing modules. Each processing module handles a subtask and contains a set of tunable hyperparameters. A large number of hyperparameters form a complex mapping with the ISP output. The industry typically relies on manual and time-consuming hyperparameter tuning by image experts, biased towards human perception. Recently, several automatic ISP hyperparameter optimization methods using downstream evaluation metrics come into sight. However, existing methods for ISP tuning treat the high-dimensional parameter space as a global space for optimization and prediction all at once without inducing the structure knowledge of ISP. To this end, we propose a sequential ISP hyperparameter prediction framework that utilizes the sequential relationship within ISP modules and the similarity among parameters to guide the model sequence process. We validate the proposed method on object detection, image segmentation, and image quality tasks.

Complexity-Guided Slimmable Decoder for Efficient Deep Video Compression
Hu, ZhihaoandXu, Dong



研究问题:如何提高深度视频压缩的效率?
动机:目前的深度视频编码方法在压缩效率上还有待提高。
方法:提出了复杂度引导的可变卷积层解码器(cgSlimDecoder)和跳过自适应熵编码(SaEC)。其中,cgSlimDecoder通过引入新的通道宽度选择模块自动决定每个可变卷积层的最优通道宽度,并通过优化复杂度-率失真相关目标函数来直接学习新引入的通道宽度选择模块和其他模块的参数。SaEC则通过跳过已由超先验网络准确预测的元素的熵编码过程,进一步加速了运动和残差解码器的熵解码过程。
效果:实验证明,这两种新方法可以广泛应用于三种常用的深度视频编解码器(DVC、FVC和DCVC),并在保持性能稳定的同时显著提高了编码效率。

In this work, we propose the complexity-guided slimmable decoder (cgSlimDecoder) in combination with skip-adaptive entropy coding (SaEC) for efficient deep video compression. Specifically, given the target complexity constraints, in our cgSlimDecoder, we introduce a set of new channel width selection modules to automatically decide the optimal channel width of each slimmable convolution layer. By optimizing the complexity-rate-distortion related objective function to directly learn the parameters of the newly introduced channel width selection modules and other modules in the decoder, our cgSlimDecoder can automatically allocate the optimal numbers of parameters for different types of modules (e.g., motion/residual decoder and the motion compensation network) and simultaneously support multiple complexity levels by using a single learnt decoder instead of multiple decoders. In addition, our proposed SaEC can further accelerate the entropy decoding procedure in both motion and residual decoders by simply skipping the entropy coding process for the elements in the encoded feature maps that are already well-predicted by the hyperprior network. As demonstrated in our comprehensive experiments, our newly proposed methods cgSlimDecoder and SaEC are general and can be readily incorporated into three widely used deep video codecs (i.e., DVC, FVC and DCVC) to significantly improve their coding efficiency with negligible performance drop.

Q-DETR: An Efficient Low-Bit Quantized Detection Transformer
Xu, ShengandLi, YanjingandLin, MingbaoandGao, PengandGuo, GuodongandL\"u, JinhuandZhang, Baochang



研究问题:如何减少低比特量化DETR(Q-DETR)的计算和内存需求,同时保持其性能。
动机:检测转换器(DETR)在资源受限的设备上的应用需要大量的计算和内存资源,而量化是一种解决方案,但现有的量化方法会导致性能大幅下降。
方法:通过分布校正蒸馏(DRD)解决这个问题,该方法将查询信息失真最小化,并引入了一种新的前景感知查询匹配方案来有效地传递教师信息。
效果:实验结果表明,这种方法比现有技术表现更好,例如,理论上4比特Q-DETR可以加速具有ResNet-50主干的DETR 6.6倍,并在COCO数据集上实现39.4%的AP,与实值版本只有2.6%的性能差距。

The recent detection transformer (DETR) has advanced object detection, but its application on resource-constrained devices requires massive computation and memory resources. Quantization stands out as a solution by representing the network in low-bit parameters and operations. However, there is a significant performance drop when performing low-bit quantized DETR (Q-DETR) with existing quantization methods. We find that the bottlenecks of Q-DETR come from the query information distortion through our empirical analyses. This paper addresses this problem based on a distribution rectification distillation (DRD). We formulate our DRD as a bi-level optimization problem, which can be derived by generalizing the information bottleneck (IB) principle to the learning of Q-DETR. At the inner level, we conduct a distribution alignment for the queries to maximize the self-information entropy. At the upper level, we introduce a new foreground-aware query matching scheme to effectively transfer the teacher information to distillation-desired features to minimize the conditional information entropy. Extensive experimental results show that our method performs much better than prior arts. For example, the 4-bit Q-DETR can theoretically accelerate DETR with ResNet-50 backbone by 6.6x and achieve 39.4% AP, with only 2.6% performance gaps than its real-valued counterpart on the COCO dataset.

ERM-KTP: Knowledge-Level Machine Unlearning via Knowledge Transfer
Lin, ShenandZhang, XiaoyuandChen, ChenyangandChen, XiaofengandSusilo, Willy



研究问题:现有的机器学习模型在删除学习数据时效率低下,且存在严重的安全漏洞。
动机:为了解决这些问题,我们尝试从知识的角度定义机器的“反学习”,并提出了一种基于知识的反学习方法,即ERM-KTP。
方法:我们提出了一种降低知识纠缠的“实体减少掩码”(ERM)结构,并在接收到删除请求时,通过我们的知识转移和禁止(KTP)方法,将非目标数据点的知识从原始模型转移到反学习模型,同时禁止目标数据点的知识。
效果:实验证明,我们的ERM-KTP反学习方法具有高效、高保真度和可扩展性,并且由于其ERM结构和精心设计的掩码,该方法具有良好的解释性。

Machine unlearning can fortify the privacy and security of machine learning applications. Unfortunately, the exact unlearning approaches are inefficient, and the approximate unlearning approaches are unsuitable for complicated CNNs. Moreover, the approximate approaches have serious security flaws because even unlearning completely different data points can produce the same contribution estimation as unlearning the target data points. To address the above problems, we try to define machine unlearning from the knowledge perspective, and we propose a knowledge-level machine unlearning method, namely ERM-KTP. Specifically, we propose an entanglement-reduced mask (ERM) structure to reduce the knowledge entanglement among classes during the training phase. When receiving the unlearning requests, we transfer the knowledge of the non-target data points from the original model to the unlearned model and meanwhile prohibit the knowledge of the target data points via our proposed knowledge transfer and prohibition (KTP) method. Finally, we will get the unlearned model as the result and delete the original model to accomplish the unlearning process. Especially, our proposed ERM-KTP is an interpretable unlearning method because the ERM structure and the crafted masks in KTP can explicitly explain the operation and the effect of unlearning data points. Extensive experiments demonstrate the effectiveness, efficiency, high fidelity, and scalability of the ERM-KTP unlearning method.

DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training
Chen, YihaoandQi, XianbiaoandWang, JiananandZhang, Lei



研究问题:本文旨在解决训练对比学习模型时,对比损失的内存消耗过大的问题。
动机:当前的对比学习模型在训练过程中,由于需要重复计算所有GPU间的梯度,导致内存消耗巨大。
方法:提出DisCo-CLIP方法,将对比损失及其梯度计算分解为两部分,一部分在当前GPU上计算内部梯度,另一部分通过all_reduce从其他GPU收集外部梯度,避免了重复计算。
效果:DisCo-CLIP方法可以将对比损失的GPU内存消耗从O(B^2)降低到O(B^2/N),其中B和N分别是批量大小和用于训练的GPU数量。这种方法在不牺牲任何计算精度的情况下,实现了分布式的对比损失计算,对于大批量CLIP训练尤其有效。

We propose DisCo-CLIP, a distributed memory-efficient CLIP training approach, to reduce the memory consumption of contrastive loss when training contrastive learning models. Our approach decomposes the contrastive loss and its gradient computation into two parts, one to calculate the intra-GPU gradients and the other to compute the inter-GPU gradients. According to our decomposition, only the intra-GPU gradients are computed on the current GPU, while the inter-GPU gradients are collected via all_reduce from other GPUs instead of being repeatedly computed on every GPU. In this way, we can reduce the GPU memory consumption of contrastive loss computation from O(B^2) to O(B^2 / N), where B and N are the batch size and the number of GPUs used for training. Such a distributed solution is mathematically equivalent to the original non-distributed contrastive loss computation, without sacrificing any computation accuracy. It is particularly efficient for large-batch CLIP training. For instance, DisCo-CLIP can enable contrastive training of a ViT-B/32 model with a batch size of 32K or 196K using 8 or 64 A100 40GB GPUs, compared with the original CLIP solution which requires 128 A100 40GB GPUs to train a ViT-B/32 model with a batch size of 32K.

NIRVANA: Neural Implicit Representations of Videos With Adaptive Networks and Autoregressive Patch-Wise Modeling
Maiya, ShishiraR.andGirish, SharathandEhrlich, MaxandWang, HanyuandLee, KwotSinandPoirson, PatrickandWu, PengxiangandWang, ChenandShrivastava, Abhinav



研究问题:如何有效地利用视频中的时间和空间冗余进行高质量视频压缩。
动机:现有的视频编码方法没有充分利用视频的时空冗余,且结构固定,不适用于长时间或高分辨率的视频。
方法:提出NIRVANA方法,将视频视为一组帧,并为每组帧分别设计网络进行预测。该方法在时间和空间维度上共享计算,减少了视频编码时间。视频表示是自回归的,当前组的网络由前一组模型的权重初始化。为了提高效率,我们在训练过程中对网络参数进行量化,无需后期剪枝或量化。
效果:在UVG数据集上,NIRVANA将编码质量从37.36提高到37.70(以PSNR衡量),并将编码速度提高12倍,同时保持相同的压缩率。与之前的视频INR工作相比,我们的方法更适应大分辨率和长时间的视频,具有很高的灵活性和扩展性。此外,通过适应具有不同帧间运动的视频,NIRVANA实现了可变比特率压缩,解码速度快6倍,并能很好地适应更多的GPU,适用于各种部署场景。

Implicit Neural Representations (INR) have recently shown to be powerful tool for high-quality video compression. However, existing works are limiting as they do not explicitly exploit the temporal redundancy in videos, leading to a long encoding time. Additionally, these methods have fixed architectures which do not scale to longer videos or higher resolutions. To address these issues, we propose NIRVANA, which treats videos as groups of frames and fits separate networks to each group performing patch-wise prediction. %This design shares computation within each group, in the spatial and temporal dimensions, resulting in reduced encoding time of the video. The video representation is modeled autoregressively, with networks fit on a current group initialized using weights from the previous group's model. To further enhance efficiency, we perform quantization of the network parameters during training, requiring no post-hoc pruning or quantization. When compared with previous works on the benchmark UVG dataset, NIRVANA improves encoding quality from 37.36 to 37.70 (in terms of PSNR) and the encoding speed by 12x, while maintaining the same compression rate. In contrast to prior video INR works which struggle with larger resolution and longer videos, we show that our algorithm is highly flexible and scales naturally due to its patch-wise and autoregressive designs. Moreover, our method achieves variable bitrate compression by adapting to videos with varying inter-frame motion. NIRVANA achieves 6x decoding speed and scales well with more GPUs, making it practical for various deployment scenarios.

Preserving Linear Separability in Continual Learning by Backward Feature Projection
Gu, QiaoandShim, DongsubandShkurti, Florian



研究问题:如何克服灾难性遗忘,实现模型在新任务学习中的稳定性和可塑性的平衡。
动机:现有的特征蒸馏方法在减少遗忘的同时,忽视了新特征的可塑性,导致模型在新任务学习中的表现不佳。
方法:提出一种名为“向后特征投影”(BFP)的方法,允许新特征在旧特征的可学习线性变换下进行变化,同时保持旧类别的线性可分性。
效果:实验证明,BFP可以显著提高模型在新任务学习中的表现,并帮助模型学习到更好的表示空间,其中线性可分性得到了良好的保持,线性探测也取得了高分类准确率。

Catastrophic forgetting has been a major challenge in continual learning, where the model needs to learn new tasks with limited or no access to data from previously seen tasks. To tackle this challenge, methods based on knowledge distillation in feature space have been proposed and shown to reduce forgetting. However, most feature distillation methods directly constrain the new features to match the old ones, overlooking the need for plasticity. To achieve a better stability-plasticity trade-off, we propose Backward Feature Projection (BFP), a method for continual learning that allows the new features to change up to a learnable linear transformation of the old features. BFP preserves the linear separability of the old classes while allowing the emergence of new feature directions to accommodate new classes. BFP can be integrated with existing experience replay methods and boost performance by a significant margin. We also demonstrate that BFP helps learn a better representation space, in which linear separability is well preserved during continual learning and linear probing achieves high classification accuracy.

Differentiable Architecture Search With Random Features
Zhang, XuanyangandLi, YonggangandZhang, XiangyuandWang, YongtaoandSun, Jian



研究问题:解决Differentiable architecture search (DARTS)的性能崩溃问题。
动机:DARTS在NAS技术中具有高效的搜索效率和效果,但其存在性能崩溃的问题。
方法:从两个方面着手解决这个问题。首先,研究了DARTS中超网络的表达能力,并提出了只训练BatchNorm的新DARTS范式。其次,理论上发现随机特征会稀释超网络优化中的跳跃连接辅助作用,使搜索算法更公平地选择操作,从而解决了性能崩溃问题。
效果:实验结果表明,RF-DARTS在CIFAR-10上获得了94.36%的测试准确率,并在ImageNet上实现了最新的24.0%的top-1测试误差。此外,RF-DARTS在三个数据集(CIFAR-10、CIFAR-100和SVHN)和四个搜索空间(S1-S4)上表现稳健。RF-PCDARTS在ImageNet上取得了更好的结果,即23.9%的top-1和7.1%的top-5测试误差,超越了直接在ImageNet上搜索的代表性方法如单路径、训练自由和部分通道范式。

Differentiable architecture search (DARTS) has significantly promoted the development of NAS techniques because of its high search efficiency and effectiveness but suffers from performance collapse. In this paper, we make efforts to alleviate the performance collapse problem for DARTS from two aspects. First, we investigate the expressive power of the supernet in DARTS and then derive a new setup of DARTS paradigm with only training BatchNorm. Second, we theoretically find that random features dilute the auxiliary connection role of skip-connection in supernet optimization and enable search algorithm focus on fairer operation selection, thereby solving the performance collapse problem. We instantiate DARTS and PC-DARTS with random features to build an improved version for each named RF-DARTS and RF-PCDARTS respectively. Experimental results show that RF-DARTS obtains 94.36% test accuracy on CIFAR-10 (which is the nearest optimal result in NAS-Bench-201), and achieves the newest state-of-the-art top-1 test error of 24.0% on ImageNet when transferring from CIFAR-10. Moreover, RF-DARTS performs robustly across three datasets (CIFAR-10, CIFAR-100, and SVHN) and four search spaces (S1-S4). Besides, RF-PCDARTS achieves even better results on ImageNet, that is, 23.9% top-1 and 7.1% top-5 test error, surpassing representative methods like single-path, training-free, and partial-channel paradigms directly searched on ImageNet.

Unified Pose Sequence Modeling
Foo, LinGengandLi, TianjiaoandRahmani, HosseinandKe, QiuhongandLiu, Jun



研究问题:如何统一基于姿态数据的不同人类行为理解任务,如动作识别、3D姿态估计和3D早期动作预测。
动机:不同的基于姿态的任务需要不同的输出数据格式,这限制了现有方法为每个任务利用特定网络架构的能力。
方法:提出了一种新的统一姿态序列(UPS)模型,通过将文本基的动作标签和坐标基的人体姿态视为语言序列,统一了上述任务的异构输出格式。然后,通过优化一个单一的自回归变压器,可以得到一个可以处理所有上述任务的统一输出序列。
效果:在四个不同的任务上进行了广泛的实验,结果表明,UPS模型在四个流行的行为理解基准测试中表现出色。

We propose a Unified Pose Sequence Modeling approach to unify heterogeneous human behavior understanding tasks based on pose data, e.g., action recognition, 3D pose estimation and 3D early action prediction. A major obstacle is that different pose-based tasks require different output data formats. Specifically, the action recognition and prediction tasks require class predictions as outputs, while 3D pose estimation requires a human pose output, which limits existing methods to leverage task-specific network architectures for each task. Hence, in this paper, we propose a novel Unified Pose Sequence (UPS) model to unify heterogeneous output formats for the aforementioned tasks by considering text-based action labels and coordinate-based human poses as language sequences. Then, by optimizing a single auto-regressive transformer, we can obtain a unified output sequence that can handle all the aforementioned tasks. Moreover, to avoid the interference brought by the heterogeneity between different tasks, a dynamic routing mechanism is also proposed to empower our UPS with the ability to learn which subsets of parameters should be shared among different tasks. To evaluate the efficacy of the proposed UPS, extensive experiments are conducted on four different tasks with four popular behavior understanding benchmarks.

Compression-Aware Video Super-Resolution
Wang, YingweiandIsobe, TakashiandJia, XuandTao, XinandLu, HuchuanandTai, Yu-Wing



研究问题:如何提高移动设备上存储或在互联网上传输的压缩视频的清晰度?
动机:现有的大多数视频超分辨率(VSR)方法都假设理想的输入,导致实验设置与实际应用之间存在较大的性能差距。
方法:提出一种新颖实用的压缩感知视频超分辨率模型,该模型可以根据估计的压缩级别调整其视频增强过程。设计了一个压缩编码器来模拟输入帧的压缩级别,然后在插入压缩感知模块后,基于隐式计算的表示对基础VSR模型进行条件处理。此外,还提出了通过充分利用在压缩视频流的信息融合过程中自然嵌入的元数据来进一步加强VSR模型的方法。
效果:通过在压缩VSR基准上的大量实验,证明了所提出方法的有效性和效率。

Videos stored on mobile devices or delivered on the Internet are usually in compressed format and are of various unknown compression parameters, but most video super-resolution (VSR) methods often assume ideal inputs resulting in large performance gap between experimental settings and real-world applications. In spite of a few pioneering works being proposed recently to super-resolve the compressed videos, they are not specially designed to deal with videos of various levels of compression. In this paper, we propose a novel and practical compression-aware video super-resolution model, which could adapt its video enhancement process to the estimated compression level. A compression encoder is designed to model compression levels of input frames, and a base VSR model is then conditioned on the implicitly computed representation by inserting compression-aware modules. In addition, we propose to further strengthen the VSR model by taking full advantage of meta data that is embedded naturally in compressed video streams in the procedure of information fusion. Extensive experiments are conducted to demonstrate the effectiveness and efficiency of the proposed method on compressed VSR benchmarks.

Regularization of Polynomial Networks for Image Recognition
Chrysos, GrigoriosG.andWang, BohanandDeng, JiankangandCevher, Volkan



研究问题:如何提高深度神经网络(DNNs)的解释性,同时保持其高性能?
动机:尽管深度神经网络在各种任务上表现出色,但其“黑箱”特性使得理论分析困难。与此同时,多项式网络(PNs)作为具有良好性能和改进解释性的替代方法,但尚未达到强大的DNN基线的性能。
方法:引入一类新的PNs,能够在六个基准测试中达到ResNet的性能。通过强有力的正则化策略,使PNs能够匹配DNN的性能。此外,还提出了D-PolyNets,其扩展程度高于先前提出的PNs,同时保持相似的性能。
效果:新的模型有助于理解元素激活函数的作用,且源代码已在GitHub上公开。

Deep Neural Networks (DNNs) have obtained impressive performance across tasks, however they still remain as black boxes, e.g., hard to theoretically analyze. At the same time, Polynomial Networks (PNs) have emerged as an alternative method with a promising performance and improved interpretability but have yet to reach the performance of the powerful DNN baselines. In this work, we aim to close this performance gap. We introduce a class of PNs, which are able to reach the performance of ResNet across a range of six benchmarks. We demonstrate that strong regularization is critical and conduct an extensive study of the exact regularization schemes required to match performance. To further motivate the regularization schemes, we introduce D-PolyNets that achieve a higher-degree of expansion than previously proposed polynomial networks. D-PolyNets are more parameter-efficient while achieving a similar performance as other polynomial networks. We expect that our new models can lead to an understanding of the role of elementwise activation functions (which are no longer required for training PNs). The source code is available at https://github.com/grigorisg9gr/regularized_polynomials.

EfficientViT: Memory Efficient Vision Transformer With Cascaded Group Attention
Liu, XinyuandPeng, HouwenandZheng, NingxinandYang, YuqingandHu, HanandYuan, Yixuan



研究问题:本文旨在解决视觉转换器在处理速度和效率上的问题,以适应实时应用的需求。
动机:虽然视觉转换器具有很高的模型能力,但其显著的性能提升伴随着巨大的计算成本,使其不适合实时应用。
方法:本文提出了一种名为EfficientViT的高效视觉转换器系列。通过设计新的构建模块,使用单内存受限的MHSA在有效的FFN层之间进行三明治布局,提高了内存效率并增强了通道通信。同时,通过级联组注意力模块,将不同的注意力头与完整的特征进行不同的分割,不仅节省了计算成本,还提高了注意力的多样性。
效果:实验结果表明,EfficientViT在速度和准确性之间取得了良好的平衡,优于现有的高效模型。例如,EfficientViT-M5在Nvidia V100 GPU和Intel Xeon CPU上的吞吐量分别比MobileNetV3-Large高40.4%和45.2%,同时准确率高出1.9%。与最近的高效模型MobileViT-XXS相比,EfficientViT-M2在GPU/CPU上的运行速度快5.8x/3.7x,转换为ONNX格式时快7.4x,且准确率高出1.8%。

Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that the attention maps share high similarities across heads, leading to computational redundancy. To address this, we present a cascaded group attention module feeding attention heads with different splits of the full feature, which not only saves computation cost but also improves attention diversity. Comprehensive experiments demonstrate EfficientViT outperforms existing efficient models, striking a good trade-off between speed and accuracy. For instance, our EfficientViT-M5 surpasses MobileNetV3-Large by 1.9% in accuracy, while getting 40.4% and 45.2% higher throughput on Nvidia V100 GPU and Intel Xeon CPU, respectively. Compared to the recent efficient model MobileViT-XXS, EfficientViT-M2 achieves 1.8% superior accuracy, while running 5.8x/3.7x faster on the GPU/CPU, and 7.4x faster when converted to ONNX format. Code and models will be available soon.

Simulated Annealing in Early Layers Leads to Better Generalization
Sarfi, AmirM.andKarimpour, ZahraandChaudhary, MuawizandKhalid, NasirM.andRavanelli, MircoandMudur, SudhirandBelilovsky, Eugene



研究问题:如何提高深度学习模型的泛化能力?
动机:现有的迭代学习方法通过延长训练时间来提高模型的泛化能力,但效果有限。
方法:提出一种利用模拟退火算法在网络早期层进行学习的方法(SEAL),代替后期层的重新初始化。
效果:在Tiny-ImageNet数据集上进行的实验表明,该方法在目标任务上的表现优于现有的最优方法LLF,并且在迁移学习和少样本学习任务上也取得了更好的效果。

Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer periods of time in exchange for improved generalization. LLF (later-layer-forgetting) is a state-of-the-art method in this category. It strengthens learning in early layers by periodically re-initializing the last few layers of the network. Our principal innovation in this work is to use Simulated annealing in EArly Layers (SEAL) of the network in place of re-initialization of later layers. Essentially, later layers go through the normal gradient descent process, while the early layers go through short stints of gradient ascent followed by gradient descent. Extensive experiments on the popular Tiny-ImageNet dataset benchmark and a series of transfer learning and few-shot learning tasks show that we outperform LLF by a significant margin. We further show that, compared to normal training, LLF features, although improving on the target task, degrade the transfer learning performance across all datasets we explored. In comparison, our method outperforms LLF across the same target datasets by a large margin. We also show that the prediction depth of our method is significantly lower than that of LLF and normal training, indicating on average better prediction performance.

Integral Neural Networks
Solodskikh, KirillandKurbanov, AzimandAydarkhanov, RuslanandZhelavskaya, IrinaandParfenov, YuryandSong, DehuaandLefkimmiatis, Stamatios



研究问题:如何利用连续层表示和积分操作来训练深度神经网络,以实现网络结构的剪枝和性能的保持。
动机:传统的离散神经网络在结构剪枝时会损失大量性能,而本文提出的积分神经网络(INNs)可以在不进行微调的情况下实现高比例的结构剪枝,同时保持相近的性能。
方法:采用连续层表示和积分操作来定义INNs的权重函数,将输入层的离散变换替换为连续积分操作。在推理阶段,可以通过数值积分求积将连续层转换为传统的张量表示。通过这种方法,可以将网络任意大小的离散化,并对积分核进行各种离散化间隔。
效果:实验结果表明,所提出的INNs在多个任务上与常规离散网络具有相同的性能。与传统的离散剪枝方法相比,在没有微调的情况下,INNs可以在高比例(高达30%)的结构剪枝下保持相近的性能(对于ResNet18在Imagenet上的损失约为2%),而传统剪枝方法在这种情况下的损失为65%。

We introduce a new family of deep neural networks. Instead of the conventional representation of network layers as N-dimensional weight tensors, we use continuous layer representation along the filter and channel dimensions. We call such networks Integral Neural Networks (INNs). In particular, the weights of INNs are represented as continuous functions defined on N-dimensional hypercubes, and the discrete transformations of inputs to the layers are replaced by continuous integration operations, accordingly. During the inference stage, our continuous layers can be converted into the traditional tensor representation via numerical integral quadratures. Such kind of representation allows the discretization of a network to an arbitrary size with various discretization intervals for the integral kernels. This approach can be applied to prune the model directly on the edge device while featuring only a small performance loss at high rates of structural pruning without any fine-tuning. To evaluate the practical benefits of our proposed approach, we have conducted experiments using various neural network architectures for multiple tasks. Our reported results show that the proposed INNs achieve the same performance with their conventional discrete counterparts, while being able to preserve approximately the same performance (2 % accuracy loss for ResNet18 on Imagenet) at a high rate (up to 30%) of structural pruning without fine-tuning, compared to 65 % accuracy loss of the conventional pruning methods under the same conditions.

Accelerating Dataset Distillation via Model Augmentation
Zhang, LeiandZhang, JieandLei, BowenandMukherjee, SubhabrataandPan, XiangandZhao, BoandDing, CaiwenandLi, YaoandXu, Dongkuan



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Dataset Distillation (DD), a newly emerging field, aims at generating much smaller but efficient synthetic training datasets from large ones. Existing DD methods based on gradient matching achieve leading performance; however, they are extremely computationally intensive as they require continuously optimizing a dataset among thousands of randomly initialized models. In this paper, we assume that training the synthetic data with diverse models leads to better generalization performance. Thus we propose two model augmentation techniques, i.e. using early-stage models and parameter perturbation to learn an informative synthetic set with significantly reduced training cost. Extensive experiments demonstrate that our method achieves up to 20x speedup and comparable performance on par with state-of-the-art methods.

Solving Relaxations of MAP-MRF Problems: Combinatorial In-Face Frank-Wolfe Directions
Kolmogorov, Vladimir



研究问题:解决MAP-MRF推理问题的LP松弛,特别是Swoboda和Kolmogorov(2019)以及Kolmogorov和Pock(2021)最近提出的方法。
动机:该方法使用一种变体的Frank-Wolfe(FW)方法来最小化一个平滑凸函数在一个组合多面体上,作为关键的计算子程序。
方法:我们提出了一个基于面对面Frank-Wolfe方向的有效实现,这是在Freund等人(2017)的不同背景下引入的。更一般地,我们为组合子问题定义了一种抽象数据结构,使其能够使用面对面FW方向,并描述了其针对树形MAP-MRF推理子问题的特化。
效果:实验结果表明,由此产生的方法在某些类别的问题上是目前最先进的LP求解器。我们的代码可以在pub.ist.ac.at/vnk/papers/IN-FACE-FW.html获取。

We consider the problem of solving LP relaxations of MAP-MRF inference problems, and in particular the method proposed recently in (Swoboda, Kolmogorov 2019; Kolmogorov, Pock 2021). As a key computational subroutine, it uses a variant of the Frank-Wolfe (FW) method to minimize a smooth convex function over a combinatorial polytope. We propose an efficient implementation of this subproutine based on in-face Frank-Wolfe directions, introduced in (Freund et al. 2017) in a different context. More generally, we define an abstract data structure for a combinatorial subproblem that enables in-face FW directions, and describe its specialization for tree-structured MAP-MRF inference subproblems. Experimental results indicate that the resulting method is the current state-of-art LP solver for some classes of problems. Our code is available at pub.ist.ac.at/ vnk/papers/IN-FACE-FW.html.

Adapting Shortcut With Normalizing Flow: An Efficient Tuning Framework for Visual Recognition
Wang, YaomingandShi, BowenandZhang, XiaopengandLi, JinandLiu, YuchenandDai, WenruiandLi, ChenglinandXiong, HongkaiandTian, Qi



研究问题:预训练后微调在视觉识别任务中已被证明有效,但所有参数的微调可能计算成本高,特别是对于大规模模型。
动机:为了减轻计算和存储需求,近期的研究探索了参数高效微调(PEFT),其专注于调整最少数量的参数以实现有效适应。然而,现有方法未能分析额外参数对模型的影响,导致微调过程不清晰且次优。
方法:本文引入了一种新颖而有效的PEFT范式,名为SNF(通过归一化流进行的快捷适应),它利用归一化流来调整快捷层。我们强调没有Lipschitz约束的层在适应下游数据集时可能导致误差传播。由于修改这些层中的过度参数化的残差连接是昂贵的,我们专注于调整便宜但关键的快捷方式。此外,用少量参数进行PEFT的信息学习可能是具有挑战性的,信息损失可能导致标签信息退化。为解决这个问题,我们提出了一种信息保留的归一化流。
效果:实验结果表明SNF的有效性。具体来说,仅使用0.036M的参数,SNF就超过了以前的方法,在FGVC和VTAB-1k基准测试中使用ViT/B-16作为主干网络。代码可在https://github.com/Wang-Yaoming/SNF获取。

Pretraining followed by fine-tuning has proven to be effective in visual recognition tasks. However, fine-tuning all parameters can be computationally expensive, particularly for large-scale models. To mitigate the computational and storage demands, recent research has explored Parameter-Efficient Fine-Tuning (PEFT), which focuses on tuning a minimal number of parameters for efficient adaptation. Existing methods, however, fail to analyze the impact of the additional parameters on the model, resulting in an unclear and suboptimal tuning process. In this paper, we introduce a novel and effective PEFT paradigm, named SNF (Shortcut adaptation via Normalization Flow), which utilizes normalizing flows to adjust the shortcut layers. We highlight that layers without Lipschitz constraints can lead to error propagation when adapting to downstream datasets. Since modifying the over-parameterized residual connections in these layers is expensive, we focus on adjusting the cheap yet crucial shortcuts. Moreover, learning new information with few parameters in PEFT can be challenging, and information loss can result in label information degradation. To address this issue, we propose an information-preserving normalizing flow. Experimental results demonstrate the effectiveness of SNF. Specifically, with only 0.036M parameters, SNF surpasses previous approaches on both the FGVC and VTAB-1k benchmarks using ViT/B-16 as the backbone. The code is available at https://github.com/Wang-Yaoming/SNF

Endpoints Weight Fusion for Class Incremental Semantic Segmentation
Xiao, Jia-WenandZhang, Chang-BinandFeng, JiekangandLiu, XialeiandvandeWeijer, JoostandCheng, Ming-Ming



研究问题:本文旨在解决类别增量语义分割(CISS)中的灾难性遗忘问题,以提高模型的判别能力。
动机:现有的方法主要通过正则化(如知识蒸馏)来保持当前模型中之前的知识,但仅限制新旧模型表示的一致性往往对模型的提升有限。
方法:提出一种名为端点权重融合(EWF)的方法,将包含旧知识的模型与保留新知识的模型动态融合,以增强在不断变化的分布中对旧类别的记忆。同时,利用蒸馏优化过程,使参数空间内的参数融合距离更近。
效果:在两个广泛使用的数据集上进行实验,取得了最先进的性能。

Class incremental semantic segmentation (CISS) focuses on alleviating catastrophic forgetting to improve discrimination. Previous work mainly exploit regularization (e.g., knowledge distillation) to maintain previous knowledge in the current model. However, distillation alone often yields limited gain to the model since only the representations of old and new models are restricted to be consistent. In this paper, we propose a simple yet effective method to obtain a model with strong memory of old knowledge, named Endpoints Weight Fusion (EWF). In our method, the model containing old knowledge is fused with the model retaining new knowledge in a dynamic fusion manner, strengthening the memory of old classes in ever-changing distributions. In addition, we analyze the relation between our fusion strategy and a popular moving average technique EMA, which reveals why our method is more suitable for class-incremental learning. To facilitate parameter fusion with closer distance in the parameter space, we use distillation to enhance the optimization process. Furthermore, we conduct experiments on two widely used datasets, achieving the state-of-the-art performance.

Efficient Robust Principal Component Analysis via Block Krylov Iteration and CUR Decomposition
Fang, ShunandXu, ZhengqinandWu, ShiqianandXie, Shoulie



研究问题:如何有效地进行主成分分析(RPCA)以解决大规模矩阵的计算问题。
动机:现有的RPCA算法在处理大规模矩阵时,由于需要执行奇异值分解(SVD),因此需要大量的计算资源。
方法:本文提出了一种基于块克莱洛迭代和CUR分解的高效RPCA(eRPCA)算法。具体来说,使用克莱洛迭代方法来近似特征值分解,其复杂度为O(ndrq + n(rq)^2),其中q是一个小参数,r是目标秩。然后,根据估计的秩,采用CUR分解来替换SVD更新低秩矩阵组件,每次迭代的复杂度从O(rnd)降低到O(r^2n)。
效果:实验结果表明,所提出的eRPCA在各种低层视觉应用中比最先进的方法更有效、更高效。

Robust principal component analysis (RPCA) is widely studied in computer vision. Recently an adaptive rank estimate based RPCA has achieved top performance in low-level vision tasks without the prior rank, but both the rank estimate and RPCA optimization algorithm involve singular value decomposition, which requires extremely huge computational resource for large-scale matrices. To address these issues, an efficient RPCA (eRPCA) algorithm is proposed based on block Krylov iteration and CUR decomposition in this paper. Specifically, the Krylov iteration method is employed to approximate the eigenvalue decomposition in the rank estimation, which requires O(ndrq + n(rq)^2) for an (nxd) input matrix, in which q is a parameter with a small value, r is the target rank. Based on the estimated rank, CUR decomposition is adopted to replace SVD in updating low-rank matrix component, whose complexity reduces from O(rnd) to O(r^2n) per iteration. Experimental results verify the efficiency and effectiveness of the proposed eRPCA over the state-of-the-art methods in various low-level vision applications.

Toward Accurate Post-Training Quantization for Image Super Resolution
Tu, ZhijunandHu, JieandChen, HantingandWang, Yunhe



研究问题:如何利用少量未标记的校准图像进行图像超分辨率的后训练量化(PTQ)。
动机:现有的SR模型在部署到移动设备上时,需要进行模型量化,但现有方法需要完整的数据集和高昂的计算开销。
方法:通过分析激活的非对称边界,引入基于密度的双重剪切来切断异常值。同时,提出一种像素感知的校准方法,以适应不同样本的高度动态范围。
效果:实验证明,该方法在各种模型和数据集上显著优于现有的PTQ算法。例如,使用100个未标记的图像将EDSRx4量化为4位时,Urban100基准测试提高了2.091 dB。

Model quantization is a crucial step for deploying super resolution (SR) networks on mobile devices. However, existing works focus on quantization-aware training, which requires complete dataset and expensive computational overhead. In this paper, we study post-training quantization(PTQ) for image super resolution using only a few unlabeled calibration images. As the SR model aims to maintain the texture and color information of input images, the distribution of activations are long-tailed, asymmetric and highly dynamic compared with classification models. To this end, we introduce the density-based dual clipping to cut off the outliers based on analyzing the asymmetric bounds of activations. Moreover, we present a novel pixel aware calibration method with the supervision of the full-precision model to accommodate the highly dynamic range of different samples. Extensive experiments demonstrate that the proposed method significantly outperforms existing PTQ algorithms on various models and datasets. For instance, we get a 2.091 dB increase on Urban100 benchmark when quantizing EDSRx4 to 4-bit with 100 unlabeled images. Our code is available at both https://github.com/huawei-noah/Efficient-Computing/tree/master/Quantization/PTQ4SR and https://gitee.com/mindspore/models/tree/master/research/cv/PTQ4SR.

Real-Time Neural Light Field on Mobile Devices
Cao, JunliandWang, HuanandChemerys, PavloandShakhrai, VladislavandHu, JuandFu, YunandMakoviichuk, DenysandTulyakov, SergeyandRen, Jian



研究问题:如何提高神经渲染(NeRF)模型在资源受限设备(如移动设备)上的运行速度和效率。
动机:目前的NeRF模型虽然在新颖视图合成方面取得了显著成果,但由于体积渲染过程,其推理速度非常慢,限制了其在资源受限设备上的应用。
方法:提出一种高效的神经网络模型,该模型在移动设备上实时运行,用于神经渲染。与现有的工作不同,我们引入了一种新颖的网络架构,该架构在移动设备上运行效率高、延迟低且体积小。
效果:我们的模型在移动设备上实现了高分辨率生成,同时保持了对合成和真实世界场景的实时推理,例如,在iPhone 13上渲染一张1008x756的真实3D场景图像只需18.04ms。此外,我们的模型实现了与NeRF相当的图像质量,并优于MobileNeRF(在真实向前看数据集上的PSNR为26.15 vs. 25.91)。

Recent efforts in Neural Rendering Fields (NeRF) have shown impressive results on novel view synthesis by utilizing implicit neural representation to represent 3D scenes. Due to the process of volumetric rendering, the inference speed for NeRF is extremely slow, limiting the application scenarios of utilizing NeRF on resource-constrained hardware, such as mobile devices. Many works have been conducted to reduce the latency of running NeRF models. However, most of them still require high-end GPU for acceleration or extra storage memory, which is all unavailable on mobile devices. Another emerging direction utilizes the neural light field (NeLF) for speedup, as only one forward pass is performed on a ray to predict the pixel color. Nevertheless, to reach a similar rendering quality as NeRF, the network in NeLF is designed with intensive computation, which is not mobile-friendly. In this work, we propose an efficient network that runs in real-time on mobile devices for neural rendering. We follow the setting of NeLF to train our network. Unlike existing works, we introduce a novel network architecture that runs efficiently on mobile devices with low latency and small size, i.e., saving 15x 24x storage compared with MobileNeRF. Our model achieves high-resolution generation while maintaining real-time inference for both synthetic and real-world scenes on mobile devices, e.g., 18.04ms (iPhone 13) for rendering one 1008x756 image of real 3D scenes. Additionally, we achieve similar image quality as NeRF and better quality than MobileNeRF (PSNR 26.15 vs. 25.91 on the real-world forward-facing dataset).

Deep Dive Into Gradients: Better Optimization for 3D Object Detection With Gradient-Corrected IoU Supervision
Ming, QiandMiao, LingjuanandMa, ZheandZhao, LinandZhou, ZhiqiangandHuang, XuhuiandChen, YuanpeiandGuo, Yufei



研究问题:现有的3D物体检测中,使用交并比(IoU)作为评估指标时,存在梯度异常和收敛速度慢的问题。
动机:为了解决这些问题,提出了一种改进的IoU损失函数——梯度校正IoU(GCIoU)损失。
方法:设计了一种梯度校正策略,使3D IoU损失具有合理的梯度,确保模型在训练初期快速收敛,并在后期实现精细的边界框调整。同时,引入了梯度缩放策略,以适应不同尺度物体的优化步长。
效果:通过在KITTI数据集上的实验,证明了该方法的优越性,实现了更稳定的效果提升和更快的模型收敛。

Intersection-over-Union (IoU) is the most popular metric to evaluate regression performance in 3D object detection. Recently, there are also some methods applying IoU to the optimization of 3D bounding box regression. However, we demonstrate through experiments and mathematical proof that the 3D IoU loss suffers from abnormal gradient w.r.t. angular error and object scale, which further leads to slow convergence and suboptimal regression process, respectively. In this paper, we propose a Gradient-Corrected IoU (GCIoU) loss to achieve fast and accurate 3D bounding box regression. Specifically, a gradient correction strategy is designed to endow 3D IoU loss with a reasonable gradient. It ensures that the model converges quickly in the early stage of training, and helps to achieve fine-grained refinement of bounding boxes in the later stage. To solve suboptimal regression of 3D IoU loss for objects at different scales, we introduce a gradient rescaling strategy to adaptively optimize the step size. Finally, we integrate GCIoU Loss into multiple models to achieve stable performance gains and faster model convergence. Experiments on KITTI dataset demonstrate superiority of the proposed method. The code is available at https://github.com/ming71/GCIoU-loss.

MobileOne: An Improved One Millisecond Mobile Backbone
Vasu, PavanKumarAnasosaluandGabriel, JamesandZhu, JeffandTuzel, OncelandRanjan, Anurag



研究问题:优化神经网络的度量标准如FLOPs或参数数量,可能无法准确反映其在移动设备上的延迟。
动机:针对此问题,我们通过在移动设备上部署几种对移动友好的网络进行广泛分析,并设计了一种高效的MobileOne网络。
方法:我们对现有的高效神经网络进行了架构和优化瓶颈的分析,并提出相应的解决方案。然后,我们设计了一种新的、高效的神经网络Backbone MobileOne,并在iPhone12上实现了低于1ms的推理时间。
效果:实验结果表明,MobileOne在效率架构中取得了最先进的性能,同时在移动设备上的运行速度比现有技术快很多倍。此外,我们的模型在图像分类、目标检测和语义分割等多种任务上都表现出良好的泛化能力。

Efficient neural network backbones for mobile devices are often optimized for metrics such as FLOPs or parameter count. However, these metrics may not correlate well with latency of the network when deployed on a mobile device. Therefore, we perform extensive analysis of different metrics by deploying several mobile-friendly networks on a mobile device. We identify and analyze architectural and optimization bottlenecks in recent efficient neural networks and provide ways to mitigate these bottlenecks. To this end, we design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet. We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile. Our best model obtains similar performance on ImageNet as MobileFormer while being 38x faster. Our model obtains 2.3% better top-1 accuracy on ImageNet than EfficientNet at similar latency. Furthermore, we show that our model generalizes to multiple tasks -- image classification, object detection, and semantic segmentation with significant improvements in latency and accuracy as compared to existing efficient architectures when deployed on a mobile device.

AccelIR: Task-Aware Image Compression for Accelerating Neural Restoration
Ye, JuncheolandYeo, HyunhoandPark, JinwooandHan, Dongsu



研究问题:如何优化图像压缩以提高图像恢复(IR)质量并减少计算开销。
动机:尽管深度神经网络在图像恢复方面表现出色,但需要大量的计算资源。现有的方法主要通过设计新的神经网络或参数剪枝来解决这个问题,但大多数方法没有考虑到压缩对图像恢复质量的影响。
方法:提出了一种名为AccelIR的框架,该框架通过考虑整个图像恢复任务的端到端管道来优化图像压缩。AccelIR根据压缩对图像恢复质量的影响,优化了图像块内的压缩级别。然后,它在压缩后的图像上运行轻量级的图像恢复网络,有效地减少了图像恢复的计算开销,同时保持了相同的图像质量和大小。
效果:通过使用七种不同的图像恢复网络进行广泛的评估,发现AccelIR可以平均将超分辨率、去噪和去模糊的计算开销分别降低49%、29%和32%。

Recently, deep neural networks have been successfully applied for image restoration (IR) (e.g., super-resolution, de-noising, de-blurring). Despite their promising performance, running IR networks requires heavy computation. A large body of work has been devoted to addressing this issue by designing novel neural networks or pruning their parameters. However, the common limitation is that while images are saved in a compressed format before being enhanced by IR, prior work does not consider the impact of compression on the IR quality. In this paper, we present AccelIR, a framework that optimizes image compression considering the end-to-end pipeline of IR tasks. AccelIR encodes an image through IR-aware compression that optimizes compression levels across image blocks within an image according to the impact on the IR quality. Then, it runs a lightweight IR network on the compressed image, effectively reducing IR computation, while maintaining the same IR quality and image size. Our extensive evaluation using seven IR networks shows that AccelIR can reduce the computing overhead of super-resolution, de-nosing, and de-blurring by 49%, 29%, and 32% on average, respectively

One-Shot Model for Mixed-Precision Quantization
Koryakovskiy, IvanandYakovleva, AlexandraandBuchnev, ValentinandIsaev, TemurandOdinokikh, Gleb



研究问题:如何有效地压缩神经网络模型,同时保持其性能?
动机:现代硬件支持混合精度模式下的量化,可以大大提高压缩率,但需要寻找最优位宽,这是一项具有挑战性的任务。
方法:本文提出了一种基于梯度优化的张量位宽查找方法,包括理论推导和一种新的一触即发方法,可以在O(1)时间内找到一组多样化的帕累托前沿架构。
效果:在两个分类和超分辨率模型上验证了该方法,预测模型性能与实际模型性能之间的相关分数超过0.93。帕累托前沿架构选择简单直观,只需进行20到40次超级网络评估,这是目前我们所知的最佳结果。

Neural network quantization is a popular approach for model compression. Modern hardware supports quantization in mixed-precision mode, which allows for greater compression rates but adds the challenging task of searching for the optimal bit width. The majority of existing searchers find a single mixed-precision architecture. To select an architecture that is suitable in terms of performance and resource consumption, one has to restart searching multiple times. We focus on a specific class of methods that find tensor bit width using gradient-based optimization. First, we theoretically derive several methods that were empirically proposed earlier. Second, we present a novel One-Shot method that finds a diverse set of Pareto-front architectures in O(1) time. For large models, the proposed method is 5 times more efficient than existing methods. We verify the method on two classification and super-resolution models and show above 0.93 correlation score between the predicted and actual model performance. The Pareto-front architecture selection is straightforward and takes only 20 to 40 supernet evaluations, which is the new state-of-the-art result to the best of our knowledge.

Boundary Unlearning: Rapid Forgetting of Deep Networks via Shifting the Decision Boundary
Chen, MinandGao, WeizhuoandLiu, GaoyangandPeng, KaiandWang, Chen



研究问题:如何有效地让机器学习模型忘记一部分训练数据及其谱系,以满足“被遗忘权”和清除有毒数据的实际需求。
动机:现有的深度学习网络(DNNs)的机器学习方法通过清洗模型参数来消除遗忘数据的影响,但由于参数空间的维度过大,这种方法的成本过高。
方法:本文将注意力从参数空间转移到DNN模型的决策空间,提出了边界学习(Boundary Unlearning),这是一种快速而有效的从已训练的DNN模型中完全忘记一个类别的方法。其核心思想是通过改变原始DNN模型的决策边界,模仿从零开始重新训练的模型的决策行为。
效果:在CIFAR-10和Vggface2数据集上广泛评估了边界学习的效果,结果显示,边界学习可以在图像分类和面部识别任务上有效地忘记遗忘的类别,与从零开始重新训练相比,预计速度提高了17倍和19倍。

The practical needs of the "right to be forgotten" and poisoned data removal call for efficient machine unlearning techniques, which enable machine learning models to unlearn, or to forget a fraction of training data and its lineage. Recent studies on machine unlearning for deep neural networks (DNNs) attempt to destroy the influence of the forgetting data by scrubbing the model parameters. However, it is prohibitively expensive due to the large dimension of the parameter space. In this paper, we refocus our attention from the parameter space to the decision space of the DNN model, and propose Boundary Unlearning, a rapid yet effective way to unlearn an entire class from a trained DNN model. The key idea is to shift the decision boundary of the original DNN model to imitate the decision behavior of the model retrained from scratch. We develop two novel boundary shift methods, namely Boundary Shrink and Boundary Expanding, both of which can rapidly achieve the utility and privacy guarantees. We extensively evaluate Boundary Unlearning on CIFAR-10 and Vggface2 datasets, and the results show that Boundary Unlearning can effectively forget the forgetting class on image classification and face recognition tasks, with an expected speed-up of 17x and 19x, respectively, compared with retraining from the scratch.

Latency Matters: Real-Time Action Forecasting Transformer
Girase, HarshayuandAgarwal, NakulandChoi, ChihoandMangalam, Karttikeya



研究问题:如何实现低延迟、高性能的实时动作预测?
动机:现有的方法在实时动作预测中,往往存在计算量大、延迟高的问题。
方法:提出了一种名为RAFTformer的实时动作预测转换器,该模型采用两阶段全转换器架构,包括一个处理高分辨率短片段的视频转换器主干和一个通过时间聚合多个短片段信息来覆盖长期视野的头部转换器编码器。同时,还提出了一种自监督的洗牌因果掩蔽方案以提高训练过程中的模型泛化能力。
效果:实验结果表明,RAFTformer的推理延迟比现有工作小9倍,且在相同的预测准确度下,其网络设计简单,训练计算量和参数分别减少了94%和90%,在离线设置中以Top-5 recall (T5R)为指标,性能超过了现有最先进的基线约4.9个点;在实时设置中,其在EPIC-Kitchens-100数据集上的性能超过现有工作高达4.4个T5R点。

We present RAFTformer, a real-time action forecasting transformer for latency aware real-world action forecasting applications. RAFTformer is a two-stage fully transformer based architecture which consists of a video transformer backbone that operates on high resolution, short range clips and a head transformer encoder that temporally aggregates information from multiple short range clips to span a long-term horizon. Additionally, we propose a self-supervised shuffled causal masking scheme to improve model generalization during training. Finally, we also propose a real-time evaluation setting that directly couples model inference latency to overall forecasting performance and brings forth an hitherto overlooked trade-off between latency and action forecasting performance. Our parsimonious network design facilitates RAFTformer inference latency to be 9x smaller than prior works at the same forecasting accuracy. Owing to its two-staged design, RAFTformer uses 94% less training compute and 90% lesser training parameters to outperform prior state-of-the-art baselines by 4.9 points on EGTEA Gaze+ and by 1.4 points on EPIC-Kitchens-100 dataset, as measured by Top-5 recall (T5R) in the offline setting. In the real-time setting, RAFTformer outperforms prior works by an even greater margin of upto 4.4 T5R points on the EPIC-Kitchens-100 dataset. Project Webpage: https://karttikeya.github.io/publication/RAFTformer/

CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning
Smith, JamesSealeandKarlinsky, LeonidandGutta, VyshnaviandCascante-Bonilla, PaolaandKim, DonghyunandArbelle, AssafandPanda, RameswarandFeris, RogerioandKira, Zsolt



研究问题:计算机视觉模型在从持续变化的训练数据中学习新概念时,会遭遇灾难性遗忘现象。
动机:传统的解决连续学习问题的方法需要大量回顾之前看过的数据,这增加了内存成本并可能违反数据隐私。
方法:提出一种新的基于注意力的端到端键-查询机制,通过学习一组提示组件,然后与输入条件权重组装以生成输入条件的提示。
效果:实验表明,这种方法在现有的基准测试上比当前最先进的DualPrompt方法平均最终准确率高出4.5%,并且在包含类别增量和领域增量任务转移的连续学习基准测试上,准确率高出4.4%。

Computer vision models suffer from a phenomenon known as catastrophic forgetting when learning novel concepts from continuously shifting training data. Typical solutions for this continual learning problem require extensive rehearsal of previously seen data, which increases memory costs and may violate data privacy. Recently, the emergence of large-scale pre-trained vision transformer models has enabled prompting approaches as an alternative to data-rehearsal. These approaches rely on a key-query mechanism to generate prompts and have been found to be highly resistant to catastrophic forgetting in the well-established rehearsal-free continual learning setting. However, the key mechanism of these methods is not trained end-to-end with the task sequence. Our experiments show that this leads to a reduction in their plasticity, hence sacrificing new task accuracy, and inability to benefit from expanded parameter capacity. We instead propose to learn a set of prompt components which are assembled with input-conditioned weights to produce input-conditioned prompts, resulting in a novel attention-based end-to-end key-query scheme. Our experiments show that we outperform the current SOTA method DualPrompt on established benchmarks by as much as 4.5% in average final accuracy. We also outperform the state of art by as much as 4.4% accuracy on a continual learning benchmark which contains both class-incremental and domain-incremental task shifts, corresponding to many practical settings. Our code is available at https://github.com/GT-RIPL/CODA-Prompt

EfficientSCI: Densely Connected Network With Space-Time Factorization for Large-Scale Video Snapshot Compressive Imaging
Wang, LishunandCao, MiaoandYuan, Xin



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Video snapshot compressive imaging (SCI) uses a two-dimensional detector to capture consecutive video frames during a single exposure time. Following this, an efficient reconstruction algorithm needs to be designed to reconstruct the desired video frames. Although recent deep learning-based state-of-the-art (SOTA) reconstruction algorithms have achieved good results in most tasks, they still face the following challenges due to excessive model complexity and GPU memory limitations: 1) these models need high computational cost, and 2) they are usually unable to reconstruct large-scale video frames at high compression ratios. To address these issues, we develop an efficient network for video SCI by using dense connections and space-time factorization mechanism within a single residual block, dubbed EfficientSCI. The EfficientSCI network can well establish spatial-temporal correlation by using convolution in the spatial domain and Transformer in the temporal domain, respectively. We are the first time to show that an UHD color video with high compression ratio can be reconstructed from a snapshot 2D measurement using a single end-to-end deep learning model with PSNR above 32 dB. Extensive results on both simulation and real data show that our method significantly outperforms all previous SOTA algorithms with better real-time performance.

Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures
Iofinova, EugeniaandPeste, AlexandraandAlistarh, Dan



研究问题:神经网络剪枝可能会在压缩模型的输出中引发或加剧偏见,但这一现象与神经网络剪枝之间的关系尚未得到充分理解。
动机:尽管存在证据表明这种现象的存在,但是神经网络剪枝和引发的偏见之间的关系尚不清楚。
方法:在本研究中,我们对计算机视觉中的卷积神经网络进行了系统的研究,以了解和描述这种现象。我们首先展示了实际上可以获得高度稀疏的模型,例如保留少于10%的权重,这些模型在准确性上不会降低,也不会显著增加偏见。同时,我们还发现,在更高的稀疏度下,剪枝后的模型在其输出中表现出更高的不确定性,以及增加的相关性,这与增加的偏见直接相关。
效果:我们提出了易于使用的准则,仅基于未压缩的模型就可以确定是否会因剪枝而增加偏见,并识别出哪些样本最容易出现压缩后的有偏预测。

Pruning - that is, setting a significant subset of the parameters of a neural network to zero - is one of the most popular methods of model compression. Yet, several recent works have raised the issue that pruning may induce or exacerbate bias in the output of the compressed model. Despite existing evidence for this phenomenon, the relationship between neural network pruning and induced bias is not well-understood. In this work, we systematically investigate and characterize this phenomenon in Convolutional Neural Networks for computer vision. First, we show that it is in fact possible to obtain highly-sparse models, e.g. with less than 10% remaining weights, which do not decrease in accuracy nor substantially increase in bias when compared to dense models. At the same time, we also find that, at higher sparsities, pruned models exhibit higher uncertainty in their outputs, as well as increased correlations, which we directly link to increased bias. We propose easy-to-use criteria which, based only on the uncompressed model, establish whether bias will increase with pruning, and identify the samples most susceptible to biased predictions post-compression.

Boundary-Aware Backward-Compatible Representation via Adversarial Learning in Image Retrieval
Pan, TanandXu, FurongandYang, XudongandHe, SifengandJiang, ChenandGuo, QingpeiandQian, FengandZhang, XiaoboandCheng, YuanandYang, LeiandChu, Wei



研究问题:如何提高新旧模型之间的兼容性,同时保持检索性能不受影响。
动机:传统的模型更新方式需要重新计算数据库中所有图像的嵌入,过程耗时长。
方法:提出一种具有弹性边界约束的对抗性后向兼容训练(AdvBCT)方法,通过对抗学习最小化新模型和旧模型嵌入分布的差异,同时在训练过程中加入弹性边界约束以提高兼容性和区分度。
效果:在GLDv2、ROxford和RParis数据集上的实验表明,该方法在兼容性和区分度上都优于其他后向兼容训练方法。

Image retrieval plays an important role in the Internet world. Usually, the core parts of mainstream visual retrieval systems include an online service of the embedding model and a large-scale vector database. For traditional model upgrades, the old model will not be replaced by the new one until the embeddings of all the images in the database are re-computed by the new model, which takes days or weeks for a large amount of data. Recently, backward-compatible training (BCT) enables the new model to be immediately deployed online by making the new embeddings directly comparable to the old ones. For BCT, improving the compatibility of two models with less negative impact on retrieval performance is the key challenge. In this paper, we introduce AdvBCT, an Adversarial Backward-Compatible Training method with an elastic boundary constraint that takes both compatibility and discrimination into consideration. We first employ adversarial learning to minimize the distribution disparity between embeddings of the new model and the old model. Meanwhile, we add an elastic boundary constraint during training to improve compatibility and discrimination efficiently. Extensive experiments on GLDv2, Revisited Oxford (ROxford), and Revisited Paris (RParis) demonstrate that our method outperforms other BCT methods on both compatibility and discrimination. The implementation of AdvBCT will be publicly available at https://github.com/Ashespt/AdvBCT.

Sliced Optimal Partial Transport
Bai, YikunandSchmitzer, BernhardandThorpe, MatthewandKolouri, Soheil



研究问题:本文旨在解决最优传输(OT)在机器学习、数据科学和计算机视觉中应用受限的问题,特别是当源和目标测量的总质量相等时。
动机:最优传输(OT)的核心假设是源和目标测量的总质量相等,这限制了其在一些情况下的应用。为了解决这个问题,作者提出了最优部分传输(OPT)。然而,OPT的计算依赖于解决高维线性规划问题,这可能会变得计算上不可行。
方法:本文提出了一种有效的算法,用于计算一维两个非负测量之间的最优部分传输问题。然后,作者借鉴切片最优传输距离的思想,利用切片定义切片最优部分传输距离。
效果:通过各种数值实验,作者证明了切片最优部分传输方法在计算效率和准确性方面的优势。特别是在有噪声的点云注册中,作者展示了他们提出的切片最优部分传输方法的应用。

Optimal transport (OT) has become exceedingly popular in machine learning, data science, and computer vision. The core assumption in the OT problem is the equal total amount of mass in source and target measures, which limits its application. Optimal Partial Transport (OPT) is a recently proposed solution to this limitation. Similar to the OT problem, the computation of OPT relies on solving a linear programming problem (often in high dimensions), which can become computationally prohibitive. In this paper, we propose an efficient algorithm for calculating the OPT problem between two non-negative measures in one dimension. Next, following the idea of sliced OT distances, we utilize slicing to define the sliced OPT distance. Finally, we demonstrate the computational and accuracy benefits of the sliced OPT-based method in various numerical experiments. In particular, we show an application of our proposed Sliced-OPT in noisy point cloud registration.

NVTC: Nonlinear Vector Transform Coding
Feng, RunsenandGuo, ZongyuandLi, WeipingandChen, Zhibo



研究问题:如何提高神经网络图像压缩的性能。
动机:尽管现代神经网络显著提高了标量量化的压缩性能,但与矢量量化相比,仍存在无法逾越的差距。
方法:提出一种新的神经网络图像压缩框架——非线性矢量变换编码(NVTC),通过多阶段量化策略和非线性矢量变换解决矢量量化的复杂性问题,并在潜在空间应用熵约束的矢量量化自适应确定量化边界以优化联合率失真。
效果:实验表明,NVTC在率失真性能、解码速度和模型大小方面均优于以往的NTC方法。

In theory, vector quantization (VQ) is always better than scalar quantization (SQ) in terms of rate-distortion (R-D) performance. Recent state-of-the-art methods for neural image compression are mainly based on nonlinear transform coding (NTC) with uniform scalar quantization, overlooking the benefits of VQ due to its exponentially increased complexity. In this paper, we first investigate on some toy sources, demonstrating that even if modern neural networks considerably enhance the compression performance of SQ with nonlinear transform, there is still an insurmountable chasm between SQ and VQ. Therefore, revolving around VQ, we propose a novel framework for neural image compression named Nonlinear Vector Transform Coding (NVTC). NVTC solves the critical complexity issue of VQ through (1) a multi-stage quantization strategy and (2) nonlinear vector transforms. In addition, we apply entropy-constrained VQ in latent space to adaptively determine the quantization boundaries for joint rate-distortion optimization, which improves the performance both theoretically and experimentally. Compared to previous NTC approaches, NVTC demonstrates superior rate-distortion performance, faster decoding speed, and smaller model size. Our code is available at https://github.com/USTC-IMCL/NVTC.

On the Effectiveness of Partial Variance Reduction in Federated Learning With Heterogeneous Data
Li, BoandSchmidt, MikkelN.andAlstr{\o



研究问题:联邦学习中客户端数据异构性是一个关键挑战。
动机:尽管已有的方法在凸问题或简单的非凸问题上实现了快速收敛,但在深度神经网络等过参数化模型上的性能却不尽如人意。
方法:我们重新审视了深度神经网络中广泛使用的FedAvg算法,发现虽然特征提取层被FedAvg有效地学习,但客户端之间最终分类层的大量差异阻碍了性能。因此,我们提出只在最后几层进行模型漂移的修正。
效果:实验结果表明,这种方法在类似的或更低的通信成本下,显著优于现有的基准测试。我们还提供了该算法的收敛速度证明。

Data heterogeneity across clients is a key challenge in federated learning. Prior works address this by either aligning client and server models or using control variates to correct client model drift. Although these methods achieve fast convergence in convex or simple non-convex problems, the performance in over-parameterized models such as deep neural networks is lacking. In this paper, we first revisit the widely used FedAvg algorithm in a deep neural network to understand how data heterogeneity influences the gradient updates across the neural network layers. We observe that while the feature extraction layers are learned efficiently by FedAvg, the substantial diversity of the final classification layers across clients impedes the performance. Motivated by this, we propose to correct model drift by variance reduction only on the final layers. We demonstrate that this significantly outperforms existing benchmarks at a similar or lower communication cost. We furthermore provide proof for the convergence rate of our algorithm.

LVQAC: Lattice Vector Quantization Coupled With Spatially Adaptive Companding for Efficient Learned Image Compression
Zhang, XiandWu, Xiaolin



研究问题:如何优化端到端图像压缩神经网络,提高其率失真性能。
动机:大多数现有的端到端优化方法采用均匀标量量化器而非信息论上最优的矢量量化器,这限制了其性能。
方法:提出一种新颖的格矢量量化与空间自适应压扩映射(LVQAC)方案。LVQ能更好地利用特征间的依赖关系,且计算复杂度与均匀标量量化相当。同时,为了提高LVQ对源统计的适应性,将其与空间自适应压扩映射相结合。
效果:实验表明,对于任何端到端CNN图像压缩模型,用LVQAC替换均匀量化器可以在不明显增加模型复杂度的情况下获得更好的率失真性能。

Recently, numerous end-to-end optimized image compression neural networks have been developed and proved themselves as leaders in rate-distortion performance. The main strength of these learnt compression methods is in powerful nonlinear analysis and synthesis transforms that can be facilitated by deep neural networks. However, out of operational expediency, most of these end-to-end methods adopt uniform scalar quantizers rather than vector quantizers, which are information-theoretically optimal. In this paper, we present a novel Lattice Vector Quantization scheme coupled with a spatially Adaptive Companding (LVQAC) mapping. LVQ can better exploit the inter-feature dependencies than scalar uniform quantization while being computationally almost as simple as the latter. Moreover, to improve the adaptability of LVQ to source statistics, we couple a spatially adaptive companding (AC) mapping with LVQ. The resulting LVQAC design can be easily embedded into any end-to-end optimized image compression system. Extensive experiments demonstrate that for any end-to-end CNN image compression models, replacing uniform quantizer by LVQAC achieves better rate-distortion performance without significantly increasing the model complexity.

Genie: Show Me the Data for Quantization
Jeon, YongkweonandLee, ChungmanandKim, Ho-young



研究问题:如何利用预训练模型的参数,在数据不可用的情况下,进行轻量级深度神经网络的零样本量化。
动机:由于成本和隐私等问题,数据不可用的情况时有发生。零样本量化是一种有前景的方法,可以在这种情况下开发轻量级的深度神经网络。
方法:通过利用预训练模型中批量归一化层的学习参数(u和sigma),生成合成数据,然后从预训练模型(教师)提炼知识到量化模型(学生),使量化模型可以用合成数据集进行优化。同时,提出了一种后训练量化方案,可以在几个小时内产生高质量的量化网络。
效果:提出的GENIE框架能生成适合量化的数据,使用这些数据,我们可以在没有真实数据集的情况下生成鲁棒的量化模型,其性能与少样本量化相当。结合后训练量化算法,我们可以弥合零样本和少样本量化之间的差距,显著提高量化性能,从而获得一种独特的最先进的零样本量化方法。

Zero-shot quantization is a promising approach for developing lightweight deep neural networks when data is inaccessible owing to various reasons, including cost and issues related to privacy. By exploiting the learned parameters (u and sigma) of batch normalization layers in an FP32-pre-trained model, zero-shot quantization schemes focus on generating synthetic data. Subsequently, they distill knowledge from the pre-trained model (teacher) to the quantized model (student) such that the quantized model can be optimized with the synthetic dataset. However, thus far, zero-shot quantization has primarily been discussed in the context of quantization-aware training methods, which require task-specific losses and long-term optimization as much as retraining. We thus introduce a post-training quantization scheme for zero-shot quantization that produces high-quality quantized networks within a few hours. Furthermore, we propose a framework called GENIE that generates data suited for quantization. With the data synthesized by GENIE, we can produce robust quantized models without real datasets, which is comparable to few-shot quantization. We also propose a post-training quantization algorithm to enhance the performance of quantized models. By combining them, we can bridge the gap between zero-shot and few-shot quantization while significantly improving the quantization performance compared to that of existing approaches. In other words, we can obtain a unique state-of-the-art zero-shot quantization approach.

Multi-Agent Automated Machine Learning
Wang, ZhaozhiandSu, KefanandZhang, JianandJia, HuizhuandYe, QixiangandXie, XiaodongandLu, Zongqing



研究问题:如何有效地处理自动化机器学习(AutoML)中各模块的联合优化。
动机:现有的自动化机器学习系统在优化过程中,各模块间的协作并不理想。
方法:提出多智能体自动化机器学习(MA2ML),将每个学习模块(如数据增强、神经结构搜索或超参数优化)视为一个智能体,以最终性能为奖励,形成一个多智能体强化学习问题。通过明确分配每个智能体的边际贡献来提高模块间的协作,并结合离线学习以提高搜索效率。
效果:实验证明,MA2ML在满足计算成本限制的情况下,实现了ImageNet上的最先进的Top-1准确率,例如,当浮点运算次数少于6亿/8亿时,准确率分别为79.7%/80.5%。大量的消融研究表明,MA2ML的信用分配和离线学习确实带来了效益。

In this paper, we propose multi-agent automated machine learning (MA2ML) with the aim to effectively handle joint optimization of modules in automated machine learning (AutoML). MA2ML takes each machine learning module, such as data augmentation (AUG), neural architecture search (NAS), or hyper-parameters (HPO), as an agent and the final performance as the reward, to formulate a multi-agent reinforcement learning problem. MA2ML explicitly assigns credit to each agent according to its marginal contribution to enhance cooperation among modules, and incorporates off-policy learning to improve search efficiency. Theoretically, MA2ML guarantees monotonic improvement of joint optimization. Extensive experiments show that MA2ML yields the state-of-the-art top-1 accuracy on ImageNet under constraints of computational cost, e.g., 79.7%/80.5% with FLOPs fewer than 600M/800M. Extensive ablation studies verify the benefits of credit assignment and off-policy learning of MA2ML.

StructVPR: Distill Structural Knowledge With Weighting Samples for Visual Place Recognition
Shen, YanqingandZhou, SanpingandFu, JingwenandWang, RuotongandChen, ShitaoandZheng, Nanning



研究问题:本文旨在解决视觉地点识别(VPR)问题,即如何从RGB图像中提取稳定的全局特征并利用空间结构信息提高性能。
动机:由于现有的训练框架限制,大多数基于深度学习的VPR方法无法充分提取稳定的全局特征,需要依赖耗时的重排步骤来利用空间结构信息以提高性能。
方法:本文提出了一种新的VPR训练架构StructVPR,通过将分割图像作为更具决定性的结构知识输入CNN网络,并应用知识蒸馏技术避免在线分割和测试中的seg-branch推理,从而增强RGB全局特征中的知识结构,提高特征稳定性。
效果:实验结果表明,StructVPR在几个基准测试上取得了令人印象深刻的性能,仅使用全局检索就能超越许多两阶段方法。在添加额外的重排后,StructVPR实现了最先进的性能,同时保持了较低的计算成本。

Visual place recognition (VPR) is usually considered as a specific image retrieval problem. Limited by existing training frameworks, most deep learning-based works cannot extract sufficiently stable global features from RGB images and rely on a time-consuming re-ranking step to exploit spatial structural information for better performance. In this paper, we propose StructVPR, a novel training architecture for VPR, to enhance structural knowledge in RGB global features and thus improve feature stability in a constantly changing environment. Specifically, StructVPR uses segmentation images as a more definitive source of structural knowledge input into a CNN network and applies knowledge distillation to avoid online segmentation and inference of seg-branch in testing. Considering that not all samples contain high-quality and helpful knowledge, and some even hurt the performance of distillation, we partition samples and weigh each sample's distillation loss to enhance the expected knowledge precisely. Finally, StructVPR achieves impressive performance on several benchmarks using only global retrieval and even outperforms many two-stage approaches by a large margin. After adding additional re-ranking, ours achieves state-of-the-art performance while maintaining a low computational cost.

Elastic Aggregation for Federated Optimization
Chen, DengshengandHu, JieandTan, VinceJunkaiandWei, XiaomingandWu, Enhua



研究问题:如何在数据异构(非独立同分布)的情况下,通过联邦学习进行隐私保护的神经网络模型训练。
动机:现有的联邦学习优化器FedAvg在数据异构时会出现客户端漂移,导致训练不稳定且收敛缓慢。
方法:提出一种新的聚合方法——弹性聚合,该方法根据参数敏感性自适应地插值客户端模型,通过计算每个参数变化时总体预测函数输出的变化来测量参数敏感性。
效果:实验结果和分析结果表明,弹性聚合在凸和非凸设置中都能实现有效训练,对客户端异构性完全无感知,对大量客户端、部分参与和不平衡数据具有鲁棒性,同时与其他联邦学习优化器配合良好,能显著提高性能。

Federated learning enables the privacy-preserving training of neural network models using real-world data across distributed clients. FedAvg has become the preferred optimizer for federated learning because of its simplicity and effectiveness. FedAvg uses naive aggregation to update the server model, interpolating client models based on the number of instances used in their training. However, naive aggregation suffers from client-drift when the data is heterogenous (non-IID), leading to unstable and slow convergence. In this work, we propose a novel aggregation approach, elastic aggregation, to overcome these issues. Elastic aggregation interpolates client models adaptively according to parameter sensitivity, which is measured by computing how much the overall prediction function output changes when each parameter is changed. This measurement is performed in an unsupervised and online manner. Elastic aggregation reduces the magnitudes of updates to the more sensitive parameters so as to prevent the server model from drifting to any one client distribution, and conversely boosts updates to the less sensitive parameters to better explore different client distributions. Empirical results on real and synthetic data as well as analytical results show that elastic aggregation leads to efficient training in both convex and non-convex settings, while being fully agnostic to client heterogeneity and robust to large numbers of clients, partial participation, and imbalanced data. Finally, elastic aggregation works well with other federated optimizers and achieves significant improvements across the board.

CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network With Large Input
Tian, SenmaoandLu, MingandLiu, JiamingandGuo, YandongandChen, YurongandZhang, Shunli



研究问题:如何有效地减少高分辨率图像超分辨率重建的计算和内存成本。
动机:现有的方法将大输入分割成局部补丁,然后将SR补丁合并到输出中,这种方法需要为每个补丁分配一个子网,且存在过拟合和欠拟合的问题。
方法:提出一种名为内容感知位映射(CABM)的新方法,该方法在训练过程中学习每层的比特选择器,并在训练后分析输入补丁的边缘信息与每层比特的关系,设计了一种边缘到比特的查找表策略,通过所有层的查找表确定SR网络的比特配置。
效果:实验结果表明,该方法能够找到更好的比特配置,形成更高效的混合精度网络。

With the development of high-definition display devices, the practical scenario of Super-Resolution (SR) usually needs to super-resolve large input like 2K to higher resolution (4K/8K). To reduce the computational and memory cost, current methods first split the large input into local patches and then merge the SR patches into the output. These methods adaptively allocate a subnet for each patch. Quantization is a very important technique for network acceleration and has been used to design the subnets. Current methods train an MLP bit selector to determine the propoer bit for each layer. However, they uniformly sample subnets for training, making simple subnets overfitted and complicated subnets underfitted. Therefore, the trained bit selector fails to determine the optimal bit. Apart from this, the introduced bit selector brings additional cost to each layer of the SR network. In this paper, we propose a novel method named Content-Aware Bit Mapping (CABM), which can remove the bit selector without any performance loss. CABM also learns a bit selector for each layer during training. After training, we analyze the relation between the edge information of an input patch and the bit of each layer. We observe that the edge information can be an effective metric for the selected bit. Therefore, we design a strategy to build an Edge-to-Bit lookup table that maps the edge score of a patch to the bit of each layer during inference. The bit configuration of SR network can be determined by the lookup tables of all layers. Our strategy can find better bit configuration, resulting in more efficient mixed precision networks. We conduct detailed experiments to demonstrate the generalization ability of our method. The code will be released.

Generalizing Dataset Distillation via Deep Generative Prior
Cazenavette, GeorgeandWang, TongzhouandTorralba, AntonioandEfros, AlexeiA.andZhu, Jun-Yan



研究问题:现有的数据集蒸馏方法无法适应新的架构,并且无法扩展到高分辨率的数据集。
动机:通过预训练的深度生成模型学习先验知识,用于合成蒸馏数据。
方法:提出一种新的优化算法,将大量图像蒸馏成生成模型潜在空间中的少数几个中间特征向量。
效果:该方法显著提高了所有设置中跨架构泛化的能力,并改进了现有技术。

Dataset Distillation aims to distill an entire dataset's knowledge into a few synthetic images. The idea is to synthesize a small number of synthetic data points that, when given to a learning algorithm as training data, result in a model approximating one trained on the original data. Despite a recent upsurge of progress in the field, existing dataset distillation methods fail to generalize to new architectures and scale to high-resolution datasets. To overcome the above issues, we propose to use the learned prior from pre-trained deep generative models to synthesize the distilled data. To achieve this, we present a new optimization algorithm that distills a large number of images into a few intermediate feature vectors in the generative model's latent space. Our method augments existing techniques, significantly improving cross-architecture generalization in all settings.

Event-Based Shape From Polarization
Muglikar, ManasiandBauersfeld, LeonardandMoeys, DiederikPaulandScaramuzza, Davide



研究问题:现有的基于极化的形状恢复(SfP)方法存在速度-分辨率权衡问题,即牺牲测量的极化角度数量或由于帧率限制需要较长的采集时间,从而影响准确性或延迟。
动机:本文利用事件相机解决这一权衡问题。
方法:提出一种由线性极化器在高速旋转前的事件相机组成的设置。该方法使用旋转引起的连续事件流重建多个极化器角度的相对强度。
效果:实验表明,该方法优于使用帧的物理基础基准,在合成和真实世界数据集上将平均绝对误差降低了25%。在现实世界中,我们发现挑战性条件(即生成的事件较少)会损害物理基础解决方案的性能。为克服这一问题,我们提出了一种学习基础的方法,即使在低事件速率下也能估计表面法线,从而在实际世界数据集上将物理基础方法提高了52%。所提出的系统实现了与50 fps相当的采集速度(>商用极化传感器的帧率两倍),同时保留了1MP的空间分辨率。

State-of-the-art solutions for Shape-from-Polarization (SfP) suffer from a speed-resolution tradeoff: they either sacrifice the number of polarization angles measured or necessitate lengthy acquisition times due to framerate constraints, thus compromising either accuracy or latency. We tackle this tradeoff using event cameras. Event cameras operate at microseconds resolution with negligible motion blur, and output a continuous stream of events that precisely measures how light changes over time asynchronously. We propose a setup that consists of a linear polarizer rotating at high speeds in front of an event camera. Our method uses the continuous event stream caused by the rotation to reconstruct relative intensities at multiple polarizer angles. Experiments demonstrate that our method outperforms physics-based baselines using frames, reducing the MAE by 25% in synthetic and real-world datasets. In the real world, we observe, however, that the challenging conditions (i.e., when few events are generated) harm the performance of physics-based solutions. To overcome this, we propose a learning-based approach that learns to estimate surface normals even at low event-rates, improving the physics-based approach by 52% on the real world dataset. The proposed system achieves an acquisition speed equivalent to 50 fps (>twice the framerate of the commercial polarization sensor) while retaining the spatial resolution of 1MP. Our evaluation is based on the first large-scale dataset for event-based SfP.

Making Vision Transformers Efficient From a Token Sparsification View
Chang, ShuningandWang, PichaoandLin, MingandWang, FanandZhang, DavidJunhaoandJin, RongandShou, MikeZheng



研究问题:本文旨在解决视觉转换器(ViTs)的二次计算复杂性对标记数量的限制,以及现有方法在效率、准确性和通用性上的问题。
动机:现有的修剪冗余标记的方法通常会导致准确性大幅下降,且难以应用于局部视觉转换器,也无法作为下游任务的主干网络。
方法:本文提出了一种新的语义标记视觉转换器(STViT),用于有效全局和局部视觉转换器,也可以修订为下游任务的主干网络。语义标记代表聚类中心,通过空间中的图像标记进行初始化和注意力恢复,可以自适应地表示全局或局部语义信息。由于聚类特性,少数几个语义标记可以达到与大量图像标记相同的效果。
效果:实验结果表明,该方法在图像分类中取得了巨大成功,并在视频识别中进行了扩展。此外,设计了一个STViT-R(recovery)网络来恢复基于STViT的详细空间信息,使其能够执行下游任务,这是以前的标记稀疏化方法无法做到的。实验证明,该方法在物体检测和实例分割等下游任务中可以获得与原始网络相当的结果,主干网络的FLOPs减少了30%以上。

The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) application difficulty in the local vision transformer, and (iii) non-general-purpose networks for downstream tasks. In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks. The semantic tokens represent cluster centers, and they are initialized by pooling image tokens in space and recovered by attention, which can adaptively represent global or local semantic information. Due to the cluster properties, a few semantic tokens can attain the same effect as vast image tokens, for both global and local vision transformers. For instance, only 16 semantic tokens on DeiT-(Tiny,Small,Base) can achieve the same accuracy with more than 100% inference speed improvement and nearly 60% FLOPs reduction; on Swin-(Tiny,Small,Base), we can employ 16 semantic tokens in each window to further speed it up by around 20% with slight accuracy increase. Besides great success in image classification, we also extend our method to video recognition. In addition, we design a STViT-R(ecovery) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks, which is powerless for previous token sparsification methods. Experiments demonstrate that our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.

Post-Processing Temporal Action Detection
Nag, SauradipandZhu, XiatianandSong, Yi-ZheandXiang, Tao



研究问题:现有的时序动作检测(TAD)方法在将输入的可变长度视频转换为固定长度片段表示序列之前,通常需要进行预处理步骤,这会降低视频的推理分辨率,影响原始时间分辨率下的检测性能。
动机:这种预处理步骤引入的时间量化误差会严重影响TAD的性能,但被现有方法大大忽视。
方法:本文提出了一种无需模型重新设计和重新训练的新的模型无关后处理方法。具体来说,我们使用高斯分布对动作实例的开始和结束点进行建模,以实现子片段级别的时间边界推断。我们还引入了一种高效的泰勒展开近似法,称为高斯近似后处理(GAP)。
效果:大量实验证明,我们的GAP可以在具有挑战性的ActivityNet和THUMOS基准上持续提高各种预训练的现成的TAD模型的性能(平均mAP分别提高了0.2%到0.7%和0.2%到0.5%)。这种性能提升已经相当显著,与新模型设计获得的提升相当。此外,GAP可以与模型训练相结合,进一步提高性能。重要的是,GAP可以实现更低的时间分辨率,以实现更高效的推理,有利于低资源应用。代码可在https://github.com/sauradip/GAP获取。

Existing Temporal Action Detection (TAD) methods typically take a pre-processing step in converting an input varying-length video into a fixed-length snippet representation sequence, before temporal boundary estimation and action classification. This pre-processing step would temporally downsample the video, reducing the inference resolution and hampering the detection performance in the original temporal resolution. In essence, this is due to a temporal quantization error introduced during the resolution downsampling and recovery. This could negatively impact the TAD performance, but is largely ignored by existing methods. To address this problem, in this work we introduce a novel model-agnostic post-processing method without model redesign and retraining. Specifically, we model the start and end points of action instances with a Gaussian distribution for enabling temporal boundary inference at a sub-snippet level. We further introduce an efficient Taylor-expansion based approximation, dubbed as Gaussian Approximated Post-processing (GAP). Extensive experiments demonstrate that our GAP can consistently improve a wide variety of pre-trained off-the-shelf TAD models on the challenging ActivityNet (+0.2% 0.7% in average mAP) and THUMOS (+0.2% 0.5% in average mAP) benchmarks. Such performance gains are already significant and highly comparable to those achieved by novel model designs. Also, GAP can be integrated with model training for further performance gain. Importantly, GAP enables lower temporal resolutions for more efficient inference, facilitating low-resource applications. The code is available in https://github.com/sauradip/GAP.

MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID
Gu, JianyangandWang, KaiandLuo, HaoandChen, ChenandJiang, WeiandFang, YuqiangandZhang, ShanghangandYou, YangandZhao, Jian



研究问题:本文旨在解决物体重识别(ReID)中任务特定架构对检索性能提升的问题。
动机:尽管现有的物体重识别方法通过优化目标和搜索空间进行改进,但他们忽视了图像分类和ReID之间训练方案的差异。
方法:本文提出了一种新颖的双胞胎对比机制(TCM),以提供更适合ReID架构搜索的监督。同时设计了一个多尺度交互(MSI)搜索空间来寻找多尺度特征之间的合理交互操作。此外,引入了空间对齐模块(SAM)以增强面对不同来源图像的注意力一致性。
效果:在提出的NAS方案下,自动搜索出了一个特定的架构,命名为MSINet。大量实验证明,该方法在同域和跨域场景上都超过了最先进的ReID方法。

Neural Architecture Search (NAS) has been increasingly appealing to the society of object Re-Identification (ReID), for that task-specific architectures significantly improve the retrieval performance. Previous works explore new optimizing targets and search spaces for NAS ReID, yet they neglect the difference of training schemes between image classification and ReID. In this work, we propose a novel Twins Contrastive Mechanism (TCM) to provide more appropriate supervision for ReID architecture search. TCM reduces the category overlaps between the training and validation data, and assists NAS in simulating real-world ReID training schemes. We then design a Multi-Scale Interaction (MSI) search space to search for rational interaction operations between multi-scale features. In addition, we introduce a Spatial Alignment Module (SAM) to further enhance the attention consistency confronted with images from different sources. Under the proposed NAS scheme, a specific architecture is automatically searched, named as MSINet. Extensive experiments demonstrate that our method surpasses state-of-the-art ReID methods on both in-domain and cross-domain scenarios.

Memory-Friendly Scalable Super-Resolution via Rewinding Lottery Ticket Hypothesis
Lin, JinandLuo, XiaotongandHong, MingandQu, YanyunandXie, YuanandWu, Zongze



研究问题:本文旨在解决现有的动态可伸缩SR模型内存使用过大的问题,以及如何通过优化模型结构来提高其性能。
动机:目前的动态可伸缩SR方法由于需要保存固定大小的多尺度模型,因此内存使用过大。受Lottery Tickets Hypothesis(LTH)在图像分类上成功的启发,我们探索了未结构化的可伸缩SR深度模型的存在,即找到极度稀疏的渐进收缩子网络,命名为winning tickets。
方法:本文提出了一种内存友好的可伸缩SR框架(MSSR)。该框架只包含一个可伸缩模型,就可以覆盖不同大小的多个SR模型,而无需重新加载不同大小的SR模型。具体来说,MSSR由前向和后向阶段组成,前者用于模型压缩,后者用于模型扩展。在前向阶段,我们利用LTH和回卷权重逐步缩小SR模型,并形成嵌套集合的剪枝掩码。此外,我们还进行了随机自我蒸馏(SSD)以提升子网络的性能。在后向阶段,较小的SR模型可以通过根据在前向阶段获得的剪枝掩码恢复和微调剪枝参数进行扩展。
效果:大量实验表明了MSSR的有效性。最小的子网络可以达到94%的稀疏性,并且优于比较的轻量级SR方法。

Scalable deep Super-Resolution (SR) models are increasingly in demand, whose memory can be customized and tuned to the computational recourse of the platform. The existing dynamic scalable SR methods are not memory-friendly enough because multi-scale models have to be saved with a fixed size for each model. Inspired by the success of Lottery Tickets Hypothesis (LTH) on image classification, we explore the existence of unstructured scalable SR deep models, that is, we find gradual shrinkage sub-networks of extreme sparsity named winning tickets. In this paper, we propose a Memory-friendly Scalable SR framework (MSSR). The advantage is that only a single scalable model covers multiple SR models with different sizes, instead of reloading SR models of different sizes. Concretely, MSSR consists of the forward and backward stages, the former for model compression and the latter for model expansion. In the forward stage, we take advantage of LTH with rewinding weights to progressively shrink the SR model and the pruning-out masks that form nested sets. Moreover, stochastic self-distillation (SSD) is conducted to boost the performance of sub-networks. By stochastically selecting multiple depths, the current model inputs the selected features into the corresponding parts in the larger model and improves the performance of the current model based on the feedback results of the larger model. In the backward stage, the smaller SR model could be expanded by recovering and fine-tuning the pruned parameters according to the pruning-out masks obtained in the forward. Extensive experiments show the effectiveness of MMSR. The smallest-scale sub-network could achieve the sparsity of 94% and outperforms the compared lightweight SR methods.

YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors
Wang, Chien-YaoandBochkovskiy, AlexeyandLiao, Hong-YuanMark



研究问题:实时物体检测是计算机视觉中最重要的研究主题之一。
动机:随着关于架构优化和训练优化的新方法不断开发,我们发现在处理最新的最先进技术时出现了两个研究主题。
方法:我们提出了一种可训练的“免费赠品”导向解决方案,将灵活高效的训练工具与提出的架构和复合缩放方法相结合。
效果:YOLOv7在5FPS到120FPS的速度范围内超越了所有已知的物体检测器,并在GPU V100上以30FPS或更高的速度具有最高的精度56.8% AP,成为所有已知的实时物体检测器中准确度最高的。源代码已在https://github.com/WongKinYiu/yolov7上发布。

Real-time object detection is one of the most important research topics in computer vision. As new approaches regarding architecture optimization and training optimization are continually being developed, we have found two research topics that have spawned when dealing with these latest state-of-the-art methods. To address the topics, we propose a trainable bag-of-freebies oriented solution. We combine the flexible and efficient training tools with the proposed architecture and the compound scaling method. YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 120 FPS and has the highest accuracy 56.8% AP among all known realtime object detectors with 30 FPS or higher on GPU V100. Source code is released in https://github.com/ WongKinYiu/yolov7.

InstantAvatar: Learning Avatars From Monocular Video in 60 Seconds
Jiang, TianjianandChen, XuandSong, JieandHilliges, Otmar



研究问题:如何利用单目视频快速重建人类化身。
动机:目前的单目神经网络化身重建方法效率低下,需要数小时的训练时间。
方法:提出一种精心设计和构建的系统,利用新兴的神经场加速结构和动态场景的空空间跳跃策略,实现高效率的化身重建。
效果:新方法比现有方法快130倍,可以在几分钟内训练完成,且重建质量和新姿态合成结果相当或更好。在相同的时间预算下,新方法的性能明显优于最先进的方法。

In this paper, we take one step further towards real-world applicability of monocular neural avatar reconstruction by contributing InstantAvatar, a system that can reconstruct human avatars from a monocular video within seconds, and these avatars can be animated and rendered at an interactive rate. To achieve this efficiency we propose a carefully designed and engineered system, that leverages emerging acceleration structures for neural fields, in combination with an efficient empty-space skipping strategy for dynamic scenes. We also contribute an efficient implementation that we will make available for research purposes. Compared to existing methods, InstantAvatar converges 130x faster and can be trained in minutes instead of hours. It achieves comparable or even better reconstruction quality and novel pose synthesis results. When given the same time budget, our method significantly outperforms SoTA methods. InstantAvatar can yield acceptable visual quality in as little as 10 seconds training time. For code and more demo results, please refer to https://ait.ethz.ch/InstantAvatar.

Learned Two-Plane Perspective Prior Based Image Resampling for Efficient Object Detection
Ghosh, AnuragandReddy, N.DineshandMertz, ChristophandNarasimhan, SrinivasaG.



研究问题:如何提高自主导航和城市规模感知的实时高效感知?
动机:为了解决实时检测性能的问题,现有的流感知方法利用了自适应采样技术。
方法:本文提出了一种可学习的几何引导先验方法,该方法将3D场景的粗略几何信息(地面平面和上方平面)纳入图像重采样中,以实现高效的物体检测。
效果:在自主导航方面,使用相同的探测器和尺度,该方法比最先进的方法提高了小物体的检测率和实时性能。对于固定交通摄像头,该方法可以在其他方法无法实现的图像尺度上检测小物体。在同一尺度下,该方法比朴素降采样和最先进的方法分别提高了195%和63%的小物体检测率。

Real-time efficient perception is critical for autonomous navigation and city scale sensing. Orthogonal to architectural improvements, streaming perception approaches have exploited adaptive sampling improving real-time detection performance. In this work, we propose a learnable geometry-guided prior that incorporates rough geometry of the 3D scene (a ground plane and a plane above) to resample images for efficient object detection. This significantly improves small and far-away object detection performance while also being more efficient both in terms of latency and memory. For autonomous navigation, using the same detector and scale, our approach improves detection rate by +4.1 AP_S or +39% and in real-time performance by +5.3 sAP_S or +63% for small objects over state-of-the-art (SOTA). For fixed traffic cameras, our approach detects small objects at image scales other methods cannot. At the same scale, our approach improves detection of small objects by 195% (+12.5 AP_S) over naive-downsampling and 63% (+4.2 AP_S) over SOTA.

Train-Once-for-All Personalization
Chen, Hong-YouandLi, YandongandCui, YinandZhang, MingdaandChao, Wei-LunandZhang, Li



研究问题:如何训练一种"个性化友好"的模型,仅根据任务描述就能适应不同用户的需求。
动机:现有的方法需要先训练一个通用模型,然后进行分类选择,但这种方法可能不是最优的,因为模型的权重在没有个性化的情况下被冻结。
方法:提出一种名为TAPER的框架,该框架只需训练一次,之后可以根据用户的任务描述定制不同的模型。TAPER学习一组“基础”模型和一个混合预测器,这样给定任务描述后,基础模型的权重(而不是预测结果!)可以实时组合成一个单一的“个性化”模型。
效果:通过在多个识别任务上的大量实验,TAPER始终优于基线方法,实现了更高的个性化精度。此外,TAPER可以合成一个更小的模型,达到与大型通用模型相当的性能,使其对资源有限的终端设备更具“部署友好性”。有趣的是,即使没有用户的任务描述,TAPER仍然可以根据其过去的预测专门化到已部署的上下文中,使其更具“个性化友好性”。

We study the problem of how to train a "personalization-friendly" model such that given only the task descriptions, the model can be adapted to different end-users' needs, e.g., for accurately classifying different subsets of objects. One baseline approach is to train a "generic" model for classifying a wide range of objects, followed by class selection. In our experiments, we however found it suboptimal, perhaps because the model's weights are kept frozen without being personalized. To address this drawback, we propose Train-once-for-All PERsonalization (TAPER), a framework that is trained just once and can later customize a model for different end-users given their task descriptions. TAPER learns a set of "basis" models and a mixer predictor, such that given the task description, the weights (not the predictions!) of the basis models can be on the fly combined into a single "personalized" model. Via extensive experiments on multiple recognition tasks, we show that TAPER consistently outperforms the baseline methods in achieving a higher personalized accuracy. Moreover, we show that TAPER can synthesize a much smaller model to achieve comparable performance to a huge generic model, making it "deployment-friendly" to resource-limited end devices. Interestingly, even without end-users' task descriptions, TAPER can still be specialized to the deployed context based on its past predictions, making it even more "personalization-friendly".

DepGraph: Towards Any Structural Pruning
Fang, GongfanandMa, XinyinandSong, MingliandMi, MichaelBiandWang, Xinchao



研究问题:如何实现神经网络的结构剪枝,以移除结构相关的参数并加速模型?
动机:现有的结构剪枝方法依赖于手动设计的参数分组方案,因此无法适用于新的网络架构。
方法:提出了一种通用的全自动方法——依赖图(DepGraph),通过显式地建模层之间的依赖关系,对耦合的参数进行综合分组以便剪枝。
效果:在多种网络架构和任务上进行了广泛评估,包括ResNe(X)t、DenseNet、MobileNet、视觉变换器等图像相关网络,GAT、DGCNN等图形相关网络,以及LSTM等语言相关网络。实验结果表明,即使在简单的归一化准则下,该方法也能持续产生令人满意的性能。

Structural pruning enables model acceleration by removing structurally-grouped parameters from neural networks. However, the parameter-grouping patterns vary widely across different models, making architecture-specific pruners, which rely on manually-designed grouping schemes, non-generalizable to new architectures. In this work, we study a highly-challenging yet barely-explored task, any structural pruning, to tackle general structural pruning of arbitrary architecture like CNNs, RNNs, GNNs and Transformers. The most prominent obstacle towards this goal lies in the structural coupling, which not only forces different layers to be pruned simultaneously, but also expects all removed parameters to be consistently unimportant, thereby avoiding structural issues and significant performance degradation after pruning. To address this problem, we propose a general and fully automatic method, Dependency Graph (DepGraph), to explicitly model the dependency between layers and comprehensively group coupled parameters for pruning. In this work, we extensively evaluate our method on several architectures and tasks, including ResNe(X)t, DenseNet, MobileNet and Vision transformer for images, GAT for graph, DGCNN for 3D point cloud, alongside LSTM for language, and demonstrate that, even with a simple norm-based criterion, the proposed method consistently yields gratifying performances.

Network Expansion for Practical Training Acceleration
Ding, NingandTang, YehuiandHan, KaiandXu, ChaoandWang, Yunhe



研究问题:如何加速深度神经网络的训练过程。
动机:随着深度学习网络和训练数据集的急剧增长,以及在视觉任务中基于变压器模型的普及,对GPU平台的压力越来越大,这些重型模型消耗大量的时间和计算资源。
方法:提出一种通用的网络扩展方法来减少模型训练过程的实际时间成本。具体来说,利用密集模型的宽度和深度级别的稀疏性来加速深度神经网络的训练。首先,从原始密集模型中选择一个稀疏子网络作为训练的起点,然后该稀疏架构将在训练过程中逐渐扩展,最终成长为一个密集模型。
效果:广泛的实验表明,我们的加速方法可以在普通的GPU设备上显著加快现代视觉模型的训练过程,性能下降可以忽略不计(例如,在ImageNet-1k上,ResNet-101快1.42倍,DeiT-base快1.34倍)。代码可在华为云和码云上获取。

Recently, the sizes of deep neural networks and training datasets both increase drastically to pursue better performance in a practical sense. With the prevalence of transformer-based models in vision tasks, even more pressure is laid on the GPU platforms to train these heavy models, which consumes a large amount of time and computing resources as well. Therefore, it's crucial to accelerate the training process of deep neural networks. In this paper, we propose a general network expansion method to reduce the practical time cost of the model training process. Specifically, we utilize both width- and depth-level sparsity of dense models to accelerate the training of deep neural networks. Firstly, we pick a sparse sub-network from the original dense model by reducing the number of parameters as the starting point of training. Then the sparse architecture will gradually expand during the training procedure and finally grow into a dense one. We design different expanding strategies to grow CNNs and ViTs respectively, due to the great heterogeneity in between the two architectures. Our method can be easily integrated into popular deep learning frameworks, which saves considerable training time and hardware resources. Extensive experiments show that our acceleration method can significantly speed up the training process of modern vision models on general GPU devices with negligible performance drop (e.g. 1.42x faster for ResNet-101 and 1.34x faster for DeiT-base on ImageNet-1k). The code is available at https://github.com/huawei-noah/Efficient-Computing/tree/master/TrainingAcceleration/NetworkExpansion and https://gitee.com/mindspore/hub/blob/master/mshub_res/assets/noah-cvlab/gpu/1.8/networkexpansion_v1.0_imagenet2012.md.

Boosting Accuracy and Robustness of Student Models via Adaptive Adversarial Distillation
Huang, BoandChen, MingyangandWang, YiandLu, JundaandCheng, MinhaoandWang, Wei



研究问题:如何提高学生模型在边缘设备上的预测准确性和对抗鲁棒性。
动机:现有的增强方案如对抗训练在压缩网络上表现有限,而教师-学生架构中的学生模型在边缘设备上更容易受到对抗攻击。
方法:提出一种自适应对抗蒸馏(AdaAD)方法,让教师模型参与知识优化过程并与学生模型互动,以自适应地搜索内部结果。
效果:与现有方法相比,AdaAD能显著提高学生模型在大多数场景下的预测准确性和对抗鲁棒性,特别是使用AdaAD训练的ResNet-18模型在RobustBench上取得了最高的鲁棒准确率(54.23%)。

Distilled student models in teacher-student architectures are widely considered for computational-effective deployment in real-time applications and edge devices. However, there is a higher risk of student models to encounter adversarial attacks at the edge. Popular enhancing schemes such as adversarial training have limited performance on compressed networks. Thus, recent studies concern about adversarial distillation (AD) that aims to inherit not only prediction accuracy but also adversarial robustness of a robust teacher model under the paradigm of robust optimization. In the min-max framework of AD, existing AD methods generally use fixed supervision information from the teacher model to guide the inner optimization for knowledge distillation which often leads to an overcorrection towards model smoothness. In this paper, we propose an adaptive adversarial distillation (AdaAD) that involves the teacher model in the knowledge optimization process in a way interacting with the student model to adaptively search for the inner results. Comparing with state-of-the-art methods, the proposed AdaAD can significantly boost both the prediction accuracy and adversarial robustness of student models in most scenarios. In particular, the ResNet-18 model trained by AdaAD achieves top-rank performance (54.23% robust accuracy) on RobustBench under AutoAttack.

RGB No More: Minimally-Decoded JPEG Vision Transformers
Park, JeongsooandJohnson, Justin



研究问题:如何直接从JPEG编码的特征中训练视觉变换器(ViT),以减少解码开销并加速数据加载。
动机:现有的计算机视觉神经网络通常使用RGB图像进行推断,但这些图像在保存到磁盘之前通常会被编码为JPEG格式,这给RGB网络带来了不可避免的解码开销。
方法:本文直接从JPEG编码的特征中训练视觉变换器(ViT),避免了大部分解码开销,加快了数据加载速度。同时,我们还对这种编码特征进行了直接的数据增强处理。
效果:通过这两种改进——使用ViT和数据增强——我们的ViT-Ti模型在训练速度上提高了39.2%,在推理速度上提高了17.9%,并且没有损失准确率。

Most neural networks for computer vision are designed to infer using RGB images. However, these RGB images are commonly encoded in JPEG before saving to disk; decoding them imposes an unavoidable overhead for RGB networks. Instead, our work focuses on training Vision Transformers (ViT) directly from the encoded features of JPEG. This way, we can avoid most of the decoding overhead, accelerating data load. Existing works have studied this aspect but they focus on CNNs. Due to how these encoded features are structured, CNNs require heavy modification to their architecture to accept such data. Here, we show that this is not the case for ViTs. In addition, we tackle data augmentation directly on these encoded features, which to our knowledge, has not been explored in-depth for training in this setting. With these two improvements -- ViT and data augmentation -- we show that our ViT-Ti model achieves up to 39.2% faster training and 17.9% faster inference with no accuracy loss compared to the RGB counterpart.

CaPriDe Learning: Confidential and Private Decentralized Learning Based on Encryption-Friendly Distillation Loss
Tastan, NurbekandNandakumar, Karthik



研究问题:如何训练准确的深度神经网络,同时保护数据的隐私和机密性。
动机:由于隐私问题和严格的数据法规,实体之间往往无法共享大量数据进行学习。
方法:提出了一种名为“Confidential and Private Decentralized”学习的框架,利用全同态加密技术在不泄露数据的情况下进行协作学习。
效果:实验表明,该方法可以在没有中央协调的情况下提高本地模型的准确性,同时保证数据的保密性和隐私性,但存在对模型架构的限制、有限的可扩展性和加密领域推理的计算复杂性等主要限制。

Large volumes of data required to train accurate deep neural networks (DNNs) are seldom available with any single entity. Often, privacy concerns and stringent data regulations prevent entities from sharing data with each other or with a third-party learning service provider. While cross-silo federated learning (FL) allows collaborative learning of large DNNs without sharing the data itself, most existing cross-silo FL algorithms have an unacceptable utility-privacy trade-off. In this work, we propose a framework called Confidential and Private Decentralized (CaPriDe) learning, which optimally leverages the power of fully homomorphic encryption (FHE) to enable collaborative learning without compromising on the confidentiality and privacy of data. In CaPriDe learning, participating entities release their private data in an encrypted form allowing other participants to perform inference in the encrypted domain. The crux of CaPriDe learning is mutual knowledge distillation between multiple local models through a novel distillation loss, which is an approximation of the Kullback-Leibler (KL) divergence between the local predictions and encrypted inferences of other participants on the same data that can be computed in the encrypted domain. Extensive experiments on three datasets show that CaPriDe learning can improve the accuracy of local models without any central coordination, provide strong guarantees of data confidentiality and privacy, and has the ability to handle statistical heterogeneity. Constraints on the model architecture (arising from the need to be FHE-friendly), limited scalability, and computational complexity of encrypted domain inference are the main limitations of the proposed approach. The code can be found at https://github.com/tnurbek/capride-learning.

Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers
Wei, CongandDuke, BrendanandJiang, RuoweiandAarabi, ParhamandTaylor, GrahamW.andShkurti, Florian



研究问题:如何降低视觉转换器(ViT)的计算成本,同时保持其性能。
动机:虽然视觉转换器在性能上优于卷积神经网络,但其高计算成本是一个问题。现有的方法通过限制固定数量的空间邻近令牌来加速视觉转换器的多头自我注意力操作,但这种结构的注意力模式忽视了从全注意力掩码中学习到的语义连接。
方法:提出一种学习实例依赖注意力模式的方法,通过设计一个轻量级的连通性预测器模块来估计每对令牌的连通性分数。如果特征在空间或语义上被认为是相关的,那么两个令牌就会有高的连通性分数。由于每个令牌只关注少数其他令牌,因此二进制连通性掩码通常是非常稀疏的,从而提供了通过稀疏计算减少网络FLOPs的机会。
效果:配备学习的非结构化注意力模式,稀疏注意力视觉转换器(Sparsifiner)在ImageNet上的FLOPs和top-1准确率之间产生了优越的Pareto前沿,与令牌稀疏性相比,减少了48%-69%的FLOPs,而准确率下降不超过0.4%。我们还表明,结合注意力和令牌稀疏性可以将视觉转换器的FLOPs减少超过60%。

Vision Transformers (ViT) have shown competitive advantages in terms of performance compared to convolutional neural networks (CNNs), though they often come with high computational costs. To this end, previous methods explore different attention patterns by limiting a fixed number of spatially nearby tokens to accelerate the ViT's multi-head self-attention (MHSA) operations. However, such structured attention patterns limit the token-to-token connections to their spatial relevance, which disregards learned semantic connections from a full attention mask. In this work, we propose an approach to learn instance-dependent attention patterns, by devising a lightweight connectivity predictor module that estimates the connectivity score of each pair of tokens. Intuitively, two tokens have high connectivity scores if the features are considered relevant either spatially or semantically. As each token only attends to a small number of other tokens, the binarized connectivity masks are often very sparse by nature and therefore provide the opportunity to reduce network FLOPs via sparse computations. Equipped with the learned unstructured attention pattern, sparse attention ViT (Sparsifiner) produces a superior Pareto frontier between FLOPs and top-1 accuracy on ImageNet compared to token sparsity. Our method reduces 48% 69% FLOPs of MHSA while the accuracy drop is within 0.4%. We also show that combining attention and token sparsity reduces ViT FLOPs by over 60%.

Structured Sparsity Learning for Efficient Video Super-Resolution
Xia, BinandHe, JingwenandZhang, YulunandWang, YitongandTian, YapengandYang, WenmingandVanGool, Luc



研究问题:视频超分辨率(VSR)模型的高计算成本阻碍了其在资源有限的设备上的部署。
动机:现有的VSR模型中存在大量冗余的过滤器,降低了推理效率。
方法:开发了一种名为结构稀疏学习(SSL)的结构剪枝方案,针对VSR模型中的几个关键组件设计了剪枝方案,包括残差块、循环网络和上采样网络。
效果:实验表明,SSL在数量和质量上都显著优于最近的方法。

The high computational costs of video super-resolution (VSR) models hinder their deployment on resource-limited devices, e.g., smartphones and drones. Existing VSR models contain considerable redundant filters, which drag down the inference efficiency. To prune these unimportant filters, we develop a structured pruning scheme called Structured Sparsity Learning (SSL) according to the properties of VSR. In SSL, we design pruning schemes for several key components in VSR models, including residual blocks, recurrent networks, and upsampling networks. Specifically, we develop a Residual Sparsity Connection (RSC) scheme for residual blocks of recurrent networks to liberate pruning restrictions and preserve the restoration information. For upsampling networks, we design a pixel-shuffle pruning scheme to guarantee the accuracy of feature channel-space conversion. In addition, we observe that pruning error would be amplified as the hidden states propagate along with recurrent networks. To alleviate the issue, we design Temporal Finetuning (TF). Extensive experiments show that SSL can significantly outperform recent methods quantitatively and qualitatively. The code is available at https://github.com/Zj-BinXia/SSL.

MMVC: Learned Multi-Mode Video Compression With Block-Based Prediction Mode Selection and Density-Adaptive Entropy Coding
Liu, BowenandChen, YuandMachineni, RakeshChowdaryandLiu, ShiyuandKim, Hun-Seok



研究问题:现有的基于学习的视频压缩方法在适应不同的运动模式和熵模型方面存在限制。
动机:提出一种多模态视频压缩(MMVC)方法,通过选择最优的模式进行特征域预测,以适应不同的运动模式。
方法:该方法包括基于ConvLSTM的特征域预测、光流条件的特征域预测和特征传播等多模态,并将特征空间划分为块进行时空预测。对于熵编码,考虑了密集和稀疏的后量化残差块,并对稀疏残差应用可选的游程编码以提高压缩率。
效果:通过一些流行的基准数据集验证了该方法,与最先进的视频压缩方案和标准编解码器相比,该方法在PSNR和MS-SSIM指标上获得了更好的或相当的结果。

Learning-based video compression has been extensively studied over the past years, but it still has limitations in adapting to various motion patterns and entropy models. In this paper, we propose multi-mode video compression (MMVC), a block wise mode ensemble deep video compression framework that selects the optimal mode for feature domain prediction adapting to different motion patterns. Proposed multi-modes include ConvLSTM-based feature domain prediction, optical flow conditioned feature domain prediction, and feature propagation to address a wide range of cases from static scenes without apparent motions to dynamic scenes with a moving camera. We partition the feature space into blocks for temporal prediction in spatial block-based representations. For entropy coding, we consider both dense and sparse post-quantization residual blocks, and apply optional run-length coding to sparse residuals to improve the compression rate. In this sense, our method uses a dual-mode entropy coding scheme guided by a binary density map, which offers significant rate reduction surpassing the extra cost of transmitting the binary selection map. We validate our scheme with some of the most popular benchmarking datasets. Compared with state-of-the-art video compression schemes and standard codecs, our method yields better or competitive results measured with PSNR and MS-SSIM.

DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network
Shen, XuanandWang, YaohuaandLin, MingandHuang, YilunandTang, HaoandSun, XiuyuandWang, Yanzhi



研究问题:如何设计高性能的CNN模型,以在各种视觉任务上取得与ViT模型相媲美的性能。
动机:尽管ViT模型在各种视觉任务上取得了显著的进步,但设计这样的高性能CNN模型仍然具有挑战性,需要对网络设计有深入的了解。
方法:提出了一种新的框架——深度CNN的数学架构设计(DeepMAD),该框架将CNN网络建模为一个信息处理系统,其表达性和有效性可以通过其结构参数进行解析。然后提出一个约束的数学规划(MP)问题来优化这些结构参数。这个MP问题可以在CPU上使用小型内存的现成的MP求解器轻松解决。
效果:DeepMAD在多个大规模的计算机视觉基准数据集上进行了验证,证明了其优越性。特别是在ImageNet-1k上,仅使用传统的卷积层,DeepMAD在Tiny级别上的top-1准确率比ConvNeXt和Swin高出0.7%和1.5%,在Small级别上高出0.8%和0.9%。

The rapid advances in Vision Transformer (ViT) refresh the state-of-the-art performances in various vision tasks, overshadowing the conventional CNN-based models. This ignites a few recent striking-back research in the CNN world showing that pure CNN models can achieve as good performance as ViT models when carefully tuned. While encouraging, designing such high-performance CNN models is challenging, requiring non-trivial prior knowledge of network design. To this end, a novel framework termed Mathematical Architecture Design for Deep CNN (DeepMAD) is proposed to design high-performance CNN models in a principled way. In DeepMAD, a CNN network is modeled as an information processing system whose expressiveness and effectiveness can be analytically formulated by their structural parameters. Then a constrained mathematical programming (MP) problem is proposed to optimize these structural parameters. The MP problem can be easily solved by off-the-shelf MP solvers on CPUs with a small memory footprint. In addition, DeepMAD is a pure mathematical framework: no GPU or training data is required during network design. The superiority of DeepMAD is validated on multiple large-scale computer vision benchmark datasets. Notably on ImageNet-1k, only using conventional convolutional layers, DeepMAD achieves 0.7% and 1.5% higher top-1 accuracy than ConvNeXt and Swin on Tiny level, and 0.8% and 0.9% higher on Small level.

Batch Model Consolidation: A Multi-Task Model Consolidation Framework
Fostiropoulos, IordanisandZhu, JiayeandItti, Laurent



研究问题:如何在连续学习中,使模型在处理一系列任务时,不会对之前学习的任务的性能产生显著的下降。
动机:现有的连续学习方法在处理长时间序列和跨领域难度大的任务时效果不佳,且许多方法由于内存成本高、训练时间长或与单一设备紧密耦合而难以应用。
方法:提出批量模型整合(BMC)方法,通过并行训练多个专家模型来支持更真实的连续学习环境。每个专家通过稳定性损失保持与基础模型的权重相似性,并从部分任务数据构建缓冲区。在整合阶段,使用聚合所有缓冲区的内存数据的批量整合损失来整合“批量”专家模型的学习知识。
效果:在标准化基准数据集Split-CIFAR-100、Tiny-ImageNet和包含71个图像分类任务的Stream数据集上进行评估,该方法比次优的连续学习方法提高了70%,并且是唯一一个能在完成71个任务后仍保持性能的方法。

In Continual Learning (CL), a model is required to learn a stream of tasks sequentially without significant performance degradation on previously learned tasks. Current approaches fail for a long sequence of tasks from diverse domains and difficulties. Many of the existing CL approaches are difficult to apply in practice due to excessive memory cost or training time, or are tightly coupled to a single device. With the intuition derived from the widely applied mini-batch training, we propose Batch Model Consolidation (BMC) to support more realistic CL under conditions where multiple agents are exposed to a range of tasks. During a regularization phase, BMC trains multiple expert models in parallel on a set of disjoint tasks. Each expert maintains weight similarity to a base model through a stability loss, and constructs a buffer from a fraction of the task's data. During the consolidation phase, we combine the learned knowledge on 'batches' of expert models using a batched consolidation loss in memory data that aggregates all buffers. We thoroughly evaluate each component of our method in an ablation study and demonstrate the effectiveness on standardized benchmark datasets Split-CIFAR-100, Tiny-ImageNet, and the Stream dataset composed of 71 image classification tasks from diverse domains and difficulties. Our method outperforms the next best CL approach by 70% and is the only approach that can maintain performance at the end of 71 tasks.

FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning
Xiong, YuanhaoandWang, RuochenandCheng, MinhaoandYu, FelixandHsieh, Cho-Jui



研究问题:如何在隐私和通信限制下实现协作训练。
动机:现有的迭代模型平均的联邦学习算法需要大量的通信轮次来获得性能良好的模型,因为不同客户端之间的数据分配非常不平衡和非独立同分布。
方法:我们提出了FedDM,通过在每个客户端上构建合成数据集,局部匹配原始数据的损失景观,从多个局部替代函数建立全局训练目标,使服务器能够更全面地了解损失景观。
效果:我们在三个图像分类数据集上进行了广泛的实验,结果显示,与其它联邦学习方法相比,FedDM在效率和模型性能方面表现更好。此外,我们还证明FedDM可以适应保护差分隐私的高斯机制,并在相同的隐私预算下训练出更好的模型。

Federated learning (FL) has recently attracted increasing attention from academia and industry, with the ultimate goal of achieving collaborative training under privacy and communication constraints. Existing iterative model averaging based FL algorithms require a large number of communication rounds to obtain a well-performed model due to extremely unbalanced and non-i.i.d data partitioning among different clients. Thus, we propose FedDM to build the global training objective from multiple local surrogate functions, which enables the server to gain a more global view of the loss landscape. In detail, we construct synthetic sets of data on each client to locally match the loss landscape from original data through distribution matching. FedDM reduces communication rounds and improves model quality by transmitting more informative and smaller synthesized data compared with unwieldy model weights. We conduct extensive experiments on three image classification datasets, and results show that our method can outperform other FL counterparts in terms of efficiency and model performance. Moreover, we demonstrate that FedDM can be adapted to preserve differential privacy with Gaussian mechanism and train a better model under the same privacy budget.

Bit-Shrinking: Limiting Instantaneous Sharpness for Improving Post-Training Quantization
Lin, ChenandPeng, BoandLi, ZheyangandTan, WenmingandRen, YeandXiao, JunandPu, Shiliang



研究问题:如何有效地压缩模型大小和计算成本,同时保持较好的性能?
动机:目前的量化方法在低比特量化时,容易陷入局部最优解,导致性能下降。
方法:通过分析不同比特宽度量化网络的损失表面,发现粗糙的表面是由于过度的量化噪声引起的。因此,提出一种平滑损失表面的Bit-shrinking策略,通过限制锐度项的大小和稳定性来优化量化网络。
效果:实验结果表明,该方法在分类和检测任务上取得了较好的效果,并在视觉变换器模型和传统CNN网络上实现了最先进的性能。

Post-training quantization (PTQ) is an effective compression method to reduce the model size and computational cost. However, quantizing a model into a low-bit one, e.g., lower than 4, is difficult and often results in nonnegligible performance degradation. To address this, we investigate the loss landscapes of quantized networks with various bit-widths. We show that the network with more ragged loss surface, is more easily trapped into bad local minima, which mostly appears in low-bit quantization. A deeper analysis indicates, the ragged surface is caused by the injection of excessive quantization noise. To this end, we detach a sharpness term from the loss which reflects the impact of quantization noise. To smooth the rugged loss surface, we propose to limit the sharpness term small and stable during optimization. Instead of directly optimizing the target bit network, the bit-width of quantized network has a self-adapted shrinking scheduler in continuous domain from high bit-width to the target by limiting the increasing sharpness term within a proper range. It can be viewed as iteratively adding small "instant" quantization noise and adjusting the network to eliminate its impact. Widely experiments including classification and detection tasks demonstrate the effectiveness of the Bit-shrinking strategy in PTQ. On the Vision Transformer models, our INT8 and INT6 models drop within 0.5% and 1.5% Top-1 accuracy, respectively. On the traditional CNN networks, our INT4 quantized models drop within 1.3% and 3.5% Top-1 accuracy on ResNet18 and MobileNetV2 without fine-tuning, which achieves the state-of-the-art performance.

PIVOT: Prompting for Video Continual Learning
Villa, Andr\'esandAlc\'azar, JuanLe\'onandAlfarra, MotasemandAlhamoud, KumailandHurtado, JulioandHeilbron, FabianCabaandSoto, AlvaroandGhanem, Bernard



研究问题:本文旨在解决现代机器学习管道在数据可用性、存储配额、隐私法规和昂贵的注释过程等方面的限制,特别是在动态注释集上训练和更新大型模型的问题。
动机:持续学习直接解决了这个问题,其最终目标是设计出一种方法,让深度神经网络有效地学习新(未见过的)类别的相关模式,同时不显著改变其对之前学过的模式的性能。
方法:本文提出了一种新的方法PIVOT,该方法利用预训练模型中的大量知识,从而减少可训练参数的数量和相关的遗忘。与以往的方法不同,我们的方法首次有效地使用了提示机制进行持续学习,而无需进行任何领域内预训练。
效果:实验表明,PIVOT在20个任务的活动网络设置中将最先进的方法提高了27%。

Modern machine learning pipelines are limited due to data availability, storage quotas, privacy regulations, and expensive annotation processes. These constraints make it difficult or impossible to train and update large-scale models on such dynamic annotated sets. Continual learning directly approaches this problem, with the ultimate goal of devising methods where a deep neural network effectively learns relevant patterns for new (unseen) classes, without significantly altering its performance on previously learned ones. In this paper, we address the problem of continual learning for video data. We introduce PIVOT, a novel method that leverages extensive knowledge in pre-trained models from the image domain, thereby reducing the number of trainable parameters and the associated forgetting. Unlike previous methods, ours is the first approach that effectively uses prompting mechanisms for continual learning without any in-domain pre-training. Our experiments show that PIVOT improves state-of-the-art methods by a significant 27% on the 20-task ActivityNet setup.

Focused and Collaborative Feedback Integration for Interactive Image Segmentation
Wei, QiaoqiaoandZhang, HuiandYong, Jun-Hai



研究问题:如何有效地利用用户反馈进行交互式图像分割。
动机:现有的方法忽视了反馈的重要性,或者只是简单地将其与原始输入连接起来,导致反馈未被充分利用,需要更多的标注。
方法:提出了一种名为“聚焦和协作反馈集成”(FCFI)的方法,通过关注新点击点周围的局部区域并基于高级特征的相似性来校正反馈,然后交替和协作地更新反馈和深层特征以将反馈集成到特征中。
效果:在四个基准测试(GrabCut、Berkeley、SBD和DAVIS)上验证了FCFI的有效性和效率,实验结果表明,FCFI在计算开销低于以往方法的情况下实现了新的最先进的性能。

Interactive image segmentation aims at obtaining a segmentation mask for an image using simple user annotations. During each round of interaction, the segmentation result from the previous round serves as feedback to guide the user's annotation and provides dense prior information for the segmentation model, effectively acting as a bridge between interactions. Existing methods overlook the importance of feedback or simply concatenate it with the original input, leading to underutilization of feedback and an increase in the number of required annotations. To address this, we propose an approach called Focused and Collaborative Feedback Integration (FCFI) to fully exploit the feedback for click-based interactive image segmentation. FCFI first focuses on a local area around the new click and corrects the feedback based on the similarities of high-level features. It then alternately and collaboratively updates the feedback and deep features to integrate the feedback into the features. The efficacy and efficiency of FCFI were validated on four benchmarks, namely GrabCut, Berkeley, SBD, and DAVIS. Experimental results show that FCFI achieved new state-of-the-art performance with less computational overhead than previous methods. The source code is available at https://github.com/veizgyauzgyauz/FCFI.

Dynamic Neural Network for Multi-Task Learning Searching Across Diverse Network Topologies
Choi, WonhyeokandIm, Sunghoon



研究问题:本文提出了一种新的多任务学习框架,用于寻找优化多种任务的结构,这些任务具有不同的图拓扑结构,并在任务之间共享特征。
动机:现有的多任务学习框架在处理具有不同图拓扑结构的任务时,往往需要为每个任务单独设计网络,这既耗时又耗资源。
方法:我们设计了一个受限的基于有向无环图的中心网络,该网络具有读取输入/输出层的层,以构建拓扑结构多样的任务自适应结构,同时限制搜索空间和时间。我们使用三阶段训练过程来寻找一个单一的优化网络,该网络可以作为多个任务自适应子网络。为了使网络紧凑和离散化,我们提出了一种基于流的缩减算法和一种压缩损失函数,用于训练过程。
效果:我们在各种公共多任务学习数据集上评估我们的优化网络,结果显示我们的方法实现了最先进的性能。一项广泛的消融实验验证了我们框架中的子模块和方案的有效性。

In this paper, we present a new MTL framework that searches for structures optimized for multiple tasks with diverse graph topologies and shares features among tasks. We design a restricted DAG-based central network with read-in/read-out layers to build topologically diverse task-adaptive structures while limiting search space and time. We search for a single optimized network that serves as multiple task adaptive sub-networks using our three-stage training process. To make the network compact and discretized, we propose a flow-based reduction algorithm and a squeeze loss used in the training process. We evaluate our optimized network on various public MTL datasets and show ours achieves state-of-the-art performance. An extensive ablation study experimentally validates the effectiveness of the sub-module and schemes in our framework.

Re-GAN: Data-Efficient GANs Training via Architectural Reconfiguration
Saxena, DivyaandCao, JiannongandXu, JiahaoandKulshrestha, Tarun



研究问题:训练高保真图像的生成对抗网络(GANs)通常需要大量的训练图像,而寻找有效的子网络结构可以提高模型的训练效率。
动机:最近的研究发现,稠密的GANs模型中存在稀疏的子网络或"彩票票",当单独训练时,可以在有限的数据下获得更好的结果。然而,找到这些"彩票票"需要进行昂贵的训练-剪枝-再训练过程。
方法:本文提出了Re-GAN,一种动态重构GANs架构的数据高效GANs训练方法,以在训练过程中探索不同的子网络结构。该方法通过反复剪枝不重要的连接来正则化GANs网络,并在需要时重新生长它们,以降低过早剪枝重要连接的风险。
效果:实验结果表明,Re-GAN是一种通用的训练方法,可以在不同规模、领域和分辨率的数据集(如CIFAR-10、Tiny-ImageNet和多个少样本生成数据集)以及不同的GANs架构(如SNGAN、ProGAN、StyleGAN2和AutoGAN)上实现稳定。此外,当与最近的增强方法结合使用时,Re-GAN还可以提高性能。同时,通过在GANs训练过程中删除不重要的连接,Re-GAN所需的浮点运算次数更少,训练时间也更短,同时保持了相当甚至更高的样本质量。与最先进的StyleGAN2相比,我们的方法无需任何额外的微调步骤就能取得更好的效果。

Training Generative Adversarial Networks (GANs) on high-fidelity images usually requires a vast number of training images. Recent research on GAN tickets reveals that dense GANs models contain sparse sub-networks or "lottery tickets" that, when trained separately, yield better results under limited data. However, finding GANs tickets requires an expensive process of train-prune-retrain. In this paper, we propose Re-GAN, a data-efficient GANs training that dynamically reconfigures GANs architecture during training to explore different sub-network structures in training time. Our method repeatedly prunes unimportant connections to regularize GANs network and regrows them to reduce the risk of prematurely pruning important connections. Re-GAN stabilizes the GANs models with less data and offers an alternative to the existing GANs tickets and progressive growing methods. We demonstrate that Re-GAN is a generic training methodology which achieves stability on datasets of varying sizes, domains, and resolutions (CIFAR-10, Tiny-ImageNet, and multiple few-shot generation datasets) as well as different GANs architectures (SNGAN, ProGAN, StyleGAN2 and AutoGAN). Re-GAN also improves performance when combined with the recent augmentation approaches. Moreover, Re-GAN requires fewer floating-point operations (FLOPs) and less training time by removing the unimportant connections during GANs training while maintaining comparable or even generating higher-quality samples. When compared to state-of-the-art StyleGAN2, our method outperforms without requiring any additional fine-tuning step. Code can be found at this link: https://github.com/IntellicentAI-Lab/Re-GAN

Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers
Wei, SiyuanandYe, TianzhuandZhang, ShenandTang, YaoandLiang, Jiajun



研究问题:视觉转换器(ViTs)在计算机视觉任务中表现出色,但其高昂的计算成本限制了实际应用。
动机:先前的剪枝冗余令牌的方法虽然在性能和计算成本之间取得了良好的平衡,但由剪枝策略引起的错误可能导致重大信息损失。
方法:我们提出了一种新的联合令牌剪枝和压缩模块(TPS),用于更高效地压缩视觉转换器。首先,TPS采用剪枝来获取保留和被剪枝的子集。其次,TPS通过单向最近邻匹配和相似性导向融合步骤将剪枝令牌的信息压缩到部分保留令牌中。
效果:与最先进的方法相比,我们的方法在所有令牌剪枝强度下都表现更好。特别是在将DeiT-tiny&small的计算预算缩小到35%时,它在ImageNet分类上比基线提高了1%-6%的准确率。该方法可以加速DeiT-small的处理速度超过DeiT-tiny,同时其准确率超过DeiT-tiny 4.78%。对各种转换器的实验证明了我们方法的有效性,而分析实验证明了我们对令牌剪枝策略的错误具有更高的鲁棒性。代码可在https://github.com/megvii-research/TPS-CVPR2023获取。

Although vision transformers (ViTs) have shown promising results in various computer vision tasks recently, their high computational cost limits their practical applications. Previous approaches that prune redundant tokens have demonstrated a good trade-off between performance and computation costs. Nevertheless, errors caused by pruning strategies can lead to significant information loss. Our quantitative experiments reveal that the impact of pruned tokens on performance should be noticeable. To address this issue, we propose a novel joint Token Pruning & Squeezing module (TPS) for compressing vision transformers with higher efficiency. Firstly, TPS adopts pruning to get the reserved and pruned subsets. Secondly, TPS squeezes the information of pruned tokens into partial reserved tokens via the unidirectional nearest-neighbor matching and similarity-oriented fusing steps. Compared to state-of-the-art methods, our approach outperforms them under all token pruning intensities. Especially while shrinking DeiT-tiny&small computational budgets to 35%, it improves the accuracy by 1%-6% compared with baselines on ImageNet classification. The proposed method can accelerate the throughput of DeiT-small beyond DeiT-tiny, while its accuracy surpasses DeiT-tiny by 4.78%. Experiments on various transformers demonstrate the effectiveness of our method, while analysis experiments prove our higher robustness to the errors of the token pruning policy. Code is available at https://github.com/megvii-research/TPS-CVPR2023.

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective
Ma, YuexiaoandLi, HuixiaandZheng, XiawuandXiao, XuefengandWang, RuiandWen, ShileiandPan, XinandChao, FeiandJi, Rongrong



研究问题:本文旨在解决预训练量化(PTQ)中常被忽视的振荡问题。
动机:振荡问题在PTQ中是一个关键问题,由于其数据隐私和低计算成本的优点,PTQ被认为是最有效的压缩方法之一。
方法:我们首先定义了PTQ中的振荡现象,并证明该问题是由模块容量的差异引起的。然后,我们通过选择前k个差异值来解决这个问题,对应的模块将进行联合优化和量化。
效果:实验结果表明,我们的方法成功地减少了性能下降,并且可以推广到不同的神经网络和PTQ方法上。例如,使用2/4位ResNet-50量化,我们的方法比之前最先进的方法提高了1.9%。在小型模型量化方面,例如在MobileNetV2*0.5上,我们的方法比BRECQ方法提高了6.61%。

Post-training quantization (PTQ) is widely regarded as one of the most efficient compression methods practically, benefitting from its data privacy and low computation costs. We argue that an overlooked problem of oscillation is in the PTQ methods. In this paper, we take the initiative to explore and present a theoretical proof to explain why such a problem is essential in PTQ. And then, we try to solve this problem by introducing a principled and generalized framework theoretically. In particular, we first formulate the oscillation in PTQ and prove the problem is caused by the difference in module capacity. To this end, we define the module capacity (ModCap) under data-dependent and data-free scenarios, where the differentials between adjacent modules are used to measure the degree of oscillation. The problem is then solved by selecting top-k differentials, in which the corresponding modules are jointly optimized and quantized. Extensive experiments demonstrate that our method successfully reduces the performance drop and is generalized to different neural networks and PTQ methods. For example, with 2/4 bit ResNet-50 quantization, our method surpasses the previous state-of-the-art method by 1.9%. It becomes more significant on small model quantization, e.g. surpasses BRECQ method by 6.61% on MobileNetV2*0.5.

Masked Image Modeling With Local Multi-Scale Reconstruction
Wang, HaoqingandTang, YehuiandWang, YunheandGuo, JianyuanandDeng, Zhi-HongandHan, Kai



研究问题:现有的掩蔽图像建模(MIM)模型虽然在自监督表示学习上取得了成功,但计算负担大且学习过程缓慢,限制了其在工业应用中的使用。
动机:为了解决MIM模型的问题,我们提出了一种将重建任务应用于多个局部层(包括低层和高层)的方法,并设计了局部多尺度重建策略,以加速表示学习过程并促进对输入的多尺度语义理解。
方法:我们将重建任务应用于编码器的多个局部层,包括低层和高层,并设计了局部多尺度重建策略,其中低层和高层分别重建细粒度和粗粒度的监督信号。
效果:实验表明,我们的模型在分类、检测和分割任务上的表现与现有的MIM模型相当或更好,而且预训练负担显著减少。

Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning. Unfortunately, MIM models typically have huge computational burden and slow learning process, which is an inevitable obstacle for their industrial applications. Although the lower layers play the key role in MIM, existing MIM models conduct reconstruction task only at the top layer of encoder. The lower layers are not explicitly guided and the interaction among their patches is only used for calculating new activations. Considering the reconstruction task requires non-trivial inter-patch interactions to reason target signals, we apply it to multiple local layers including lower and upper layers. Further, since the multiple layers expect to learn the information of different scales, we design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively. This design not only accelerates the representation learning process by explicitly guiding multiple layers, but also facilitates multi-scale semantical understanding to the input. Extensive experiments show that with significantly less pre-training burden, our model achieves comparable or better performance on classification, detection and segmentation tasks than existing MIM models.

Learning To Zoom and Unzoom
Thavamani, ChitteshandLi, MengtianandFerroni, FrancescoandRamanan, Deva



研究问题:移动计算、自主导航和AR/VR中的许多感知系统面临严格的计算限制,这对高分辨率输入图像尤其具有挑战性。
动机:先前的研究表明,非均匀下采样器可以"学习缩放"显著的图像区域,减少计算量的同时保留与任务相关的图像信息。然而,对于具有空间标签的任务(如2D/3D物体检测和语义分割),这种变形可能会损害性能。
方法:在这项工作中(LZU),我们首先"学习缩放"输入图像,计算空间特征,然后"取消缩放"以恢复任何变形。为了实现高效且可微的取消缩放,我们使用可逆的分段双线性映射来近似缩放变换。
效果:LZU可以应用于任何具有2D空间输入和任何具有2D空间特征的模型,我们在各种任务和数据集上进行评估:Argoverse-HD上的目标检测、Cityscapes上的语义分割以及nuScenes上的单眼3D目标检测。有趣的是,即使没有高分辨率传感器数据,我们也观察到性能的提升,这意味着LZU也可以用于"学习上采样"。

Many perception systems in mobile computing, autonomous navigation, and AR/VR face strict compute constraints that are particularly challenging for high-resolution input images. Previous works propose nonuniform downsamplers that "learn to zoom" on salient image regions, reducing compute while retaining task-relevant image information. However, for tasks with spatial labels (such as 2D/3D object detection and semantic segmentation), such distortions may harm performance. In this work (LZU), we "learn to zoom" in on the input image, compute spatial features, and then "unzoom" to revert any deformations. To enable efficient and differentiable unzooming, we approximate the zooming warp with a piecewise bilinear mapping that is invertible. LZU can be applied to any task with 2D spatial input and any model with 2D spatial features, and we demonstrate this versatility by evaluating on a variety of tasks and datasets: object detection on Argoverse-HD, semantic segmentation on Cityscapes, and monocular 3D object detection on nuScenes. Interestingly, we observe boosts in performance even when high-resolution sensor data is unavailable, implying that LZU can be used to "learn to upsample" as well. Code and additional visuals are available at https://tchittesh.github.io/lzu/.

Task Difficulty Aware Parameter Allocation \& Regularization for Lifelong Learning
Wang, WenjinandHu, YunqingandChen, QianglongandZhang, Yin



研究问题:本文旨在解决终身学习中灾难性遗忘的问题,特别是在处理不同难度任务时,参数正则化或分配方法的不足。
动机:现有的参数正则化或分配方法在处理终身学习中的不同任务时,存在一些问题。例如,当学习与已学任务非常不同的新任务时,参数正则化方法会面临显著的遗忘问题;而当学习简单任务时,参数分配方法会面临不必要的参数开销。
方法:本文提出了一种参数分配和正则化(PAR)的方法,该方法根据任务的学习难度自适应地为每个任务选择适当的策略。我们提出了一种基于最近原型距离的发散估计方法,仅使用新任务的特征来测量任务相关性。此外,我们还提出了一种时间高效的相关感知采样基础架构搜索策略,以减少分配的参数开销。
效果:实验结果表明,与现有技术相比,我们的方法具有可扩展性,可以显著减少模型的冗余,同时提高模型的性能。进一步的定性分析表明,PAR可以获得合理的任务相关性。

Parameter regularization or allocation methods are effective in overcoming catastrophic forgetting in lifelong learning. However, they solve all tasks in a sequence uniformly and ignore the differences in the learning difficulty of different tasks. So parameter regularization methods face significant forgetting when learning a new task very different from learned tasks, and parameter allocation methods face unnecessary parameter overhead when learning simple tasks. In this paper, we propose the Parameter Allocation & Regularization (PAR), which adaptively select an appropriate strategy for each task from parameter allocation and regularization based on its learning difficulty. A task is easy for a model that has learned tasks related to it and vice versa. We propose a divergence estimation method based on the Nearest-Prototype distance to measure the task relatedness using only features of the new task. Moreover, we propose a time-efficient relatedness-aware sampling-based architecture search strategy to reduce the parameter overhead for allocation. Experimental results on multiple benchmarks demonstrate that, compared with SOTAs, our method is scalable and significantly reduces the model's redundancy while improving the model's performance. Further qualitative analysis indicates that PAR obtains reasonable task-relatedness.

Polynomial Implicit Neural Representations for Large Diverse Datasets
Singh, RajhansandShukla, AnkitaandTuraga, Pavan



研究问题:如何提高隐式神经表示(INR)模型在信号和图像表示上的表现力,以应对更复杂任务的需求。
动机:现有的INR架构主要依赖正弦位置编码来捕捉数据中的高频信息,但其有限的编码大小限制了模型的表示能力。为了从表示单个图像到表示大型和多样化的数据集,需要更高的表示能力。
方法:提出一种新方法,通过使用多项式函数来表示图像,并消除了位置编码的需求。通过在每个ReLU层之后的特征和仿射变换后的坐标位置之间进行逐元素乘法,实现逐步提高多项式表示的阶数。
效果:在ImageNet等大型数据集上对所提出的方法进行了定性和定量评估。结果表明,提出的Poly-INR模型在无需任何卷积、归一化或自注意力层的情况下,其表现与最先进的生成模型相当,且训练参数数量大大减少。这种方法为INR模型在复杂领域的生成建模任务中的更广泛应用铺平了道路。

Implicit neural representations (INR) have gained significant popularity for signal and image representation for many end-tasks, such as superresolution, 3D modeling, and more. Most INR architectures rely on sinusoidal positional encoding, which accounts for high-frequency information in data. However, the finite encoding size restricts the model's representational power. Higher representational power is needed to go from representing a single given image to representing large and diverse datasets. Our approach addresses this gap by representing an image with a polynomial function and eliminates the need for positional encodings. Therefore, to achieve a progressively higher degree of polynomial representation, we use element-wise multiplications between features and affine-transformed coordinate locations after every ReLU layer. The proposed method is evaluated qualitatively and quantitatively on large datasets like ImageNet. The proposed Poly-INR model performs comparably to state-of-the-art generative models without any convolution, normalization, or self-attention layers, and with far fewer trainable parameters. With much fewer training parameters and higher representative power, our approach paves the way for broader adoption of INR models for generative modeling tasks in complex domains. The code is available at https://github.com/Rajhans0/Poly_INR

System-Status-Aware Adaptive Network for Online Streaming Video Understanding
Foo, LinGengandGong, JiaandFan, ZhipengandLiu, Jun



研究问题:大多数现有的深度学习网络模型并未考虑到设备状态和可用资源的实时变化,也未研究或解决计算资源变动对在线视频理解任务的影响。
动机:本文提出了一个系统状态感知的自适应网络(SAN),该网络考虑设备的实时状态,以低延迟提供高质量的预测。
方法:通过使用我们的代理策略,SAN在两个广泛使用的在线视频理解任务上获得了最先进的性能,同时保持了低处理延迟。此外,由于标记的训练数据可能不可用或计算成本过高,因此在各种类型的硬件配置上训练这样的代理并不容易。为此,我们提出了元自监督适应(MSA)方法,使代理策略能够在测试时适应新的硬件配置,从而允许模型轻松部署到其他未见过的平台。
效果:实验结果表明,SAN在各种在线视频理解任务上都取得了最先进的性能,同时保持了低延迟。而MSA方法则成功地解决了在不同硬件配置上训练代理的问题。

Recent years have witnessed great progress in deep neural networks for real-time applications. However, most existing works do not explicitly consider the general case where the device's state and the available resources fluctuate over time, and none of them investigate or address the impact of varying computational resources for online video understanding tasks. This paper proposes a System-status-aware Adaptive Network (SAN) that considers the device's real-time state to provide high-quality predictions with low delay. Usage of our agent's policy improves efficiency and robustness to fluctuations of the system status. On two widely used video understanding tasks, SAN obtains state-of-the-art performance while constantly keeping processing delays low. Moreover, training such an agent on various types of hardware configurations is not easy as the labeled training data might not be available, or can be computationally prohibitive. To address this challenging problem, we propose a Meta Self-supervised Adaptation (MSA) method that adapts the agent's policy to new hardware configurations at test-time, allowing for easy deployment of the model onto other unseen hardware platforms.

FFCV: Accelerating Training by Removing Data Bottlenecks
Leclerc, GuillaumeandIlyas, AndrewandEngstrom, LoganandPark, SungMinandSalman, HadiandM\k{a



研究问题:如何提高机器学习模型的训练效率和资源利用率?
动机:现有的训练方法存在数据瓶颈,导致GPU资源无法充分利用,训练效率低下。
方法:开发FFCV库,采用高效的文件存储格式、缓存、预加载数据、异步数据传输和即时编译等技术,提高数据加载和传输的效率,尽可能将数据处理任务异步转移到CPU上,释放GPU的容量用于训练。
效果:使用FFCV库,在ImageNet数据集上训练ResNet-18和ResNet-50模型,取得了优秀的准确率与训练时间之间的权衡。测试结果显示,使用FFCV库训练的ResNet-50模型在一半的时间内达到了与最佳基线相同的准确率。通过多个案例研究展示了FFCV的性能、易用性、可扩展性和适应资源限制的能力。

We present FFCV, a library for easy, fast, resource-efficient training of machine learning models. FFCV speeds up model training by eliminating (often subtle) data bottlenecks from the training process. In particular, we combine techniques such as an efficient file storage format, caching, data pre-loading, asynchronous data transfer, and just-in-time compilation to (a) make data loading and transfer significantly more efficient, ensuring that GPUs can reach full utilization; and (b) offload as much data processing as possible to the CPU asynchronously, freeing GPU up capacity for training. Using FFCV, we train ResNet-18 and ResNet-50 on the ImageNet dataset with a state-of-the-art tradeoff between accuracy and training time. For example, across the range of ResNet-50 models we test, we obtain the same accuracy as the best baselines in half the time. We demonstrate FFCV's performance, ease-of-use, extensibility, and ability to adapt to resource constraints through several case studies.

Adaptive Channel Sparsity for Federated Learning Under System Heterogeneity
Liao, DongpingandGao, XitongandZhao, YirenandXu, Cheng-Zhong



研究问题:由于客户端数据的非独立同分布特性,联邦学习模型的通道神经元可能为不同的客户端专门化不同的特征。然而,现有的稀疏联邦学习方法对客户端模型规定了固定的稀疏策略,可能会阻止客户端协同训练通道神经元。
动机:为了最小化稀疏性对联邦学习收敛的影响,我们提出了Flado方法,通过调整每个客户端中每个神经元的稀疏性来改善客户端模型更新轨迹的对齐。
方法:Flado方法通过调整每个客户端中每个神经元的稀疏性来改善客户端模型更新轨迹的对齐。
效果:实验结果表明,虽然其他稀疏方法对收敛影响显著,但Flado不仅可以在各种数据集上以无限的预算达到最高的任务准确率,而且在相同的通信预算下,还可以显著减少超过10倍的训练FLOPs需求,并将通信/计算权衡的帕累托前沿推向比竞争联邦学习算法更远的位置。

Owing to the non-i.i.d. nature of client data, channel neurons in federated-learned models may specialize to distinct features for different clients. Yet, existing channel-sparse federated learning (FL) algorithms prescribe fixed sparsity strategies for client models, and may thus prevent clients from training channel neurons collaboratively. To minimize the impact of sparsity on FL convergence, we propose Flado to improve the alignment of client model update trajectories by tailoring the sparsities of individual neurons in each client. Empirical results show that while other sparse methods are surprisingly impactful to convergence, Flado can not only attain the highest task accuracies with unlimited budget across a range of datasets, but also significantly reduce the amount of FLOPs required for training more than by 10x under the same communications budget, and push the Pareto frontier of communication/computation trade-off notably further than competing FL algorithms.

NIPQ: Noise Proxy-Based Integrated Pseudo-Quantization
Shin, JuncheolandSo, JunhyukandPark, SeinandKang, SeungyeopandYoo, SungjooandPark, Eunhyeok



研究问题:如何通过近似方法解决非可微函数的梯度流问题,以提高量化感知训练(QAT)的稳定性和精度。
动机:尽管直通估计器(STE)在QAT中受到欢迎,但其在训练过程中的不稳定性导致低精度表示的质量下降。
方法:提出一种新的基于噪声代理的综合伪量化(NIPQ)方法,该方法通过整合截断思想,在伪量化框架上实现了激活和权重的伪量化的统一支持,并通过梯度下降更新所有量化参数和网络参数,无需STE不稳定,大大简化了但可靠的精度分配,无需人工干预。
效果:实验表明,NIPQ在各种视觉和语言应用中优于现有的量化算法。

Straight-through estimator (STE), which enables the gradient flow over the non-differentiable function via approximation, has been favored in studies related to quantization-aware training (QAT). However, STE incurs unstable convergence during QAT, resulting in notable quality degradation in low-precision representation. Recently, pseudo-quantization training has been proposed as an alternative approach to updating the learnable parameters using the pseudo-quantization noise instead of STE. In this study, we propose a novel noise proxy-based integrated pseudo-quantization (NIPQ) that enables unified support of pseudo-quantization for both activation and weight with minimal error by integrating the idea of truncation on the pseudo-quantization framework. NIPQ updates all of the quantization parameters (e.g., bit-width and truncation boundary) as well as the network parameters via gradient descent without STE instability, resulting in greatly-simplified but reliable precision allocation without human intervention. Our extensive experiments show that NIPQ outperforms existing quantization algorithms in various vision and language applications by a large margin.

Poly-PC: A Polyhedral Network for Multiple Point Cloud Tasks at Once
Xie, TaoandWang, ShiguangandWang, KeandYang, LinqiandJiang, ZhiqiangandZhang, XingchengandDai, KunandLi, RuifengandCheng, Jian



研究问题:本文旨在解决在点云上同时进行多项任务的困难。
动机:现有的方法在处理点云上的多任务学习时存在诸多难题,如任务偏见导致的不同模型架构和多个数据集域之间的冲突梯度等。
方法:本文提出了一种名为Poly-PC的框架,通过设计一种高效的残差集抽象(Res-SA)层来适应网络的宽度和深度需求,以适应各种任务的需求。同时,开发了一种基于权重纠缠的一次性NAS技术来寻找所有任务的最佳架构。此外,这种技术将多个任务的权重在每一层中纠缠在一起,提供用于有效存储部署的任务共享参数,同时提供用于学习任务相关特征的任务特定参数。最后,为了便于Poly-PC的训练,引入了一种基于任务优先级的梯度平衡算法,利用任务优先级来解决冲突的梯度,确保所有任务的高性能。
效果:受益于所提出的技术,由Poly-PC优化的所有任务模型的总FLOPs和参数更少,性能超过了之前的方法。我们还证明,当调整到新任务时,Poly-PC可以进行增量学习和避免灾难性遗忘。

In this work, we show that it is feasible to perform multiple tasks concurrently on point cloud with a straightforward yet effective multi-task network. Our framework, Poly-PC, tackles the inherent obstacles (e.g., different model architectures caused by task bias and conflicting gradients caused by multiple dataset domains, etc.) of multi-task learning on point cloud. Specifically, we propose a residual set abstraction (Res-SA) layer for efficient and effective scaling in both width and depth of the network, hence accommodating the needs of various tasks. We develop a weight-entanglement-based one-shot NAS technique to find optimal architectures for all tasks. Moreover, such technique entangles the weights of multiple tasks in each layer to offer task-shared parameters for efficient storage deployment while providing ancillary task-specific parameters for learning task-related features. Finally, to facilitate the training of Poly-PC, we introduce a task-prioritization-based gradient balance algorithm that leverages task prioritization to reconcile conflicting gradients, ensuring high performance for all tasks. Benefiting from the suggested techniques, models optimized by Poly-PC collectively for all tasks keep fewer total FLOPs and parameters and outperform previous methods. We also demonstrate that Poly-PC allows incremental learning and evades catastrophic forgetting when tuned to a new task.

Efficient Verification of Neural Networks Against LVM-Based Specifications
Hanspal, HarleenandLomuscio, Alessio



研究问题:如何确保基于神经网络的感知系统在安全关键应用中的稳健性。
动机:神经网络的稳健性需要形式化验证,但现有的标准方法只能分析对解析定义的转换的不变性,无法处理物体姿态、场景视点、遮挡等多样化和普遍存在的变化。
方法:提出了一种有效的方法,通过在待验证的网络中添加一个可逆编码头,以最小化重建开销来验证能够捕获这些多样化变化的潜变量模型(Latent Variable Models)所定义的规范。
效果:对于三种不同类型的现实输入变化,进行了验证实验,并报告了结果。与现有工作不同的是,该方法相对独立于输入维度和规模,并通过减轻当前最先进技术中的效率低下和解码器表达能力依赖性,适用于广泛的深度网络和真实世界数据集。

The deployment of perception systems based on neural networks in safety critical applications requires assurance on their robustness. Deterministic guarantees on network robustness require formal verification. Standard approaches for verifying robustness analyse invariance to analytically defined transformations, but not the diverse and ubiquitous changes involving object pose, scene viewpoint, occlusions, etc. To this end, we present an efficient approach for verifying specifications definable using Latent Variable Models that capture such diverse changes. The approach involves adding an invertible encoding head to the network to be verified, enabling the verification of latent space sets with minimal reconstruction overhead. We report verification experiments for three classes of proposed latent space specifications, each capturing different types of realistic input variations. Differently from previous work in this area, the proposed approach is relatively independent of input dimensionality and scales to a broad class of deep networks and real-world datasets by mitigating the inefficiency and decoder expressivity dependence in the present state-of-the-art.

A Unified Knowledge Distillation Framework for Deep Directed Graphical Models
Chen, YizhuoandLiang, KaizhaoandZeng, ZheandYao, ShuochaoandShao, Huajie



研究问题:现有的知识蒸馏方法无法泛化到具有任意层随机变量的深度有向图模型(DGM)。
动机:为了解决这一问题,我们提出了一种针对深度DGM的统一知识蒸馏框架。
方法:我们利用重参数化技巧隐藏中间潜在变量,从而得到一个紧凑的DGM。然后,我们开发了一种替代蒸馏损失函数,以减少通过多层随机变量的错误累积。
效果:在四个应用中评估了我们的框架,包括无数据分层变分自动编码器(VAE)压缩、无数据变分循环神经网络(VRNN)压缩、无数据亥姆霍兹机(HM)压缩和VAE持续学习。实验结果表明,我们的蒸馏方法在无数据模型压缩任务上优于基线,并在基于KD的持续学习数据生成方面显著提高了性能。

Knowledge distillation (KD) is a technique that transfers the knowledge from a large teacher network to a small student network. It has been widely applied to many different tasks, such as model compression and federated learning. However, existing KD methods fail to generalize to general deep directed graphical models (DGMs) with arbitrary layers of random variables. We refer by deep DGMs to DGMs whose conditional distributions are parameterized by deep neural networks. In this work, we propose a novel unified knowledge distillation framework for deep DGMs on various applications. Specifically, we leverage the reparameterization trick to hide the intermediate latent variables, resulting in a compact DGM. Then we develop a surrogate distillation loss to reduce error accumulation through multiple layers of random variables. Moreover, we present the connections between our method and some existing knowledge distillation approaches. The proposed framework is evaluated on four applications: data-free hierarchical variational autoencoder (VAE) compression, data-free variational recurrent neural networks (VRNN) compression, data-free Helmholtz Machine (HM) compression, and VAE continual learning. The results show that our distillation method outperforms the baselines in data-free model compression tasks. We further demonstrate that our method significantly improves the performance of KD-based continual learning for data generation. Our source code is available at https://github.com/YizhuoChen99/KD4DGM-CVPR.

DKT: Diverse Knowledge Transfer Transformer for Class Incremental Learning
Gao, XinyuanandHe, YuhangandDong, SonglinandCheng, JieandWei, XingandGong, Yihong



研究问题:深度神经网络在类别增量学习中存在灾难性遗忘问题,即在新类别知识学习过程中,旧类别的分类准确度会大幅下降。
动机:目前解决类别增量学习问题的方法要么存在严重的灾难性遗忘和稳定性-可塑性两难问题,要么需要过多的额外参数和计算。
方法:提出一种新颖的框架——多样化知识转移Transformer(DKT),包含两种基于注意力机制的任务特定知识和任务通用知识转移,以减轻灾难性遗忘。同时,提出一个双工分类器来解决稳定性-可塑性两难问题,以及一种新的损失函数来在特征空间中对相同类别进行聚类,并在新旧任务之间区分特征,以强制任务特定知识更加多样化。
效果:在CIFAR100、ImageNet100/1000数据集上进行了全面实验,结果表明该方法优于其他竞争方法,并取得了最先进的性能。

Deep neural networks suffer from catastrophic forgetting in class incremental learning, where the classification accuracy of old classes drastically deteriorates when the networks learn the knowledge of new classes. Many works have been proposed to solve the class incremental learning problem. However, most of them either suffer from serious catastrophic forgetting and stability-plasticity dilemma or need too many extra parameters and computations. To meet the challenge, we propose a novel framework, Diverse Knowledge Transfer Transformer (DKT). which contains two novel knowledge transfers based on the attention mechanism to transfer the task-general knowledge and task-specific knowledge to the current task to alleviate catastrophic forgetting. Besides, we propose a duplex classifier to address the stability-plasticity dilemma, and a novel loss function to cluster the same categories in feature space and discriminate the features between old and new tasks to force the task specific knowledge to be more diverse. Our method needs only a few extra parameters, which are negligible, to tackle the increasing number of tasks. We conduct comprehensive experimental results on CIFAR100, ImageNet100/1000 datasets. The experiment results show that our method outperforms other competitive methods and achieves state-of-the-art performance.

DynamicDet: A Unified Dynamic Architecture for Object Detection
Lin, ZhihaoandWang, YongtaoandZhang, JinheandChu, Xiaojie



研究问题:设计一种强大的动态探测器,以解决对象检测任务中没有合适的动态架构和退出标准的问题。
动机:动态神经网络是深度学习中的新兴研究课题,其自适应推理能力可以实现显著的准确率和计算效率。
方法:提出了一种名为DynamicDet的对象检测动态框架。首先,根据对象检测任务的性质精心设计了一种动态架构。然后,提出了一种自适应路由器来分析多尺度信息并自动决定推理路线。还提出了一种基于检测损失的新型优化策略和一种可变速度推理策略。
效果:在COCO基准测试上进行的大量实验表明,所提出的DynamicDet实现了新的最先进的准确率-速度权衡。例如,在相当的准确率下,我们的动态探测器Dy-YOLOv7-W6的推理速度比YOLOv7-E6快12%,比YOLOv7-D6快17%,比YOLOv7-E6E快39%。代码可在https://github.com/VDIGPKU/DynamicDet获取。

Dynamic neural network is an emerging research topic in deep learning. With adaptive inference, dynamic models can achieve remarkable accuracy and computational efficiency. However, it is challenging to design a powerful dynamic detector, because of no suitable dynamic architecture and exiting criterion for object detection. To tackle these difficulties, we propose a dynamic framework for object detection, named DynamicDet. Firstly, we carefully design a dynamic architecture based on the nature of the object detection task. Then, we propose an adaptive router to analyze the multi-scale information and to decide the inference route automatically. We also present a novel optimization strategy with an exiting criterion based on the detection losses for our dynamic detectors. Last, we present a variable-speed inference strategy, which helps to realize a wide range of accuracy-speed trade-offs with only one dynamic detector. Extensive experiments conducted on the COCO benchmark demonstrate that the proposed DynamicDet achieves new state-of-the-art accuracy-speed trade-offs. For instance, with comparable accuracy, the inference speed of our dynamic detector Dy-YOLOv7-W6 surpasses YOLOv7-E6 by 12%, YOLOv7-D6 by 17%, and YOLOv7-E6E by 39%. The code is available at https://github.com/VDIGPKU/DynamicDet.

MDL-NAS: A Joint Multi-Domain Learning Framework for Vision Transformer
Wang, ShiguangandXie, TaoandCheng, JianandZhang, XingchengandLiu, Haijun



研究问题:如何将多个视觉任务整合到一个可管理的超级网络中,并在不同数据集领域中进行集体优化?
动机:现有的方法在处理多任务学习时,通常需要为每个任务单独设计模型,存储和计算效率低下。
方法:提出MDL-NAS框架,将多个任务集成到一个管理性强的超级网络中,通过粗到细的搜索空间进行联合优化。在细粒度搜索空间中,提出了顺序共享策略和掩码共享策略,实现真正的细粒度参数共享。
效果:实验证明,MDL-NAS在保持高效存储部署和计算的同时,对所有任务的性能与最先进的方法相当。同时,MDL-NAS还支持增量学习和新任务泛化时的遗忘规避。

In this work, we introduce MDL-NAS, a unified framework that integrates multiple vision tasks into a manageable supernet and optimizes these tasks collectively under diverse dataset domains. MDL-NAS is storage-efficient since multiple models with a majority of shared parameters can be deposited into a single one. Technically, MDL-NAS constructs a coarse-to-fine search space, where the coarse search space offers various optimal architectures for different tasks while the fine search space provides fine-grained parameter sharing to tackle the inherent obstacles of multi-domain learning. In the fine search space, we suggest two parameter sharing policies, i.e., sequential sharing policy and mask sharing policy. Compared with previous works, such two sharing policies allow for the partial sharing and non-sharing of parameters at each layer of the network, hence attaining real fine-grained parameter sharing. Finally, we present a joint-subnet search algorithm that finds the optimal architecture and sharing parameters for each task within total resource constraints, challenging the traditional practice that downstream vision tasks are typically equipped with backbone networks designed for image classification. Experimentally, we demonstrate that MDL-NAS families fitted with non-hierarchical or hierarchical transformers deliver competitive performance for all tasks compared with state-of-the-art methods while maintaining efficient storage deployment and computation. We also demonstrate that MDL-NAS allows incremental learning and evades catastrophic forgetting when generalizing to a new task.

ScaleFL: Resource-Adaptive Federated Learning With Heterogeneous Clients
Ilhan, FatihandSu, GongandLiu, Ling



研究问题:本文旨在解决联邦学习中资源异构性的问题,即部分客户端计算能力有限,只能训练较小的本地模型。
动机:在现实生活中,一些客户端的计算资源非常有限,无法参与深度神经网络的学习。因此,需要一种能够处理资源异构性的联邦学习方法。
方法:本文提出了一种新的联邦学习方法——ScaleFL。该方法通过早期退出策略,自适应地缩小深度神经网络的宽度和深度,以适应不同规模的本地模型。同时,通过自我蒸馏在退出预测中进行知识转移,提高子网络之间的聚合效果。
效果:实验结果表明,ScaleFL在全局/局部模型性能上优于现有的代表性异构联邦学习方法,并且在保持性能下降不超过2%的情况下,推理效率提高了2倍,模型大小减少了4倍。

Federated learning (FL) is an attractive distributed learning paradigm supporting real-time continuous learning and client privacy by default. In most FL approaches, all edge clients are assumed to have sufficient computation capabilities to participate in the learning of a deep neural network (DNN) model. However, in real-life applications, some clients may have severely limited resources and can only train a much smaller local model. This paper presents ScaleFL, a novel FL approach with two distinctive mechanisms to handle resource heterogeneity and provide an equitable FL framework for all clients. First, ScaleFL adaptively scales down the DNN model along width and depth dimensions by leveraging early exits to find the best-fit models for resource-aware local training on distributed clients. In this way, ScaleFL provides an efficient balance of preserving basic and complex features in local model splits with various sizes for joint training while enabling fast inference for model deployment. Second, ScaleFL utilizes self-distillation among exit predictions during training to improve aggregation through knowledge transfer among subnetworks. We conduct extensive experiments on benchmark CV (CIFAR-10/100, ImageNet) and NLP datasets (SST-2, AgNews). We demonstrate that ScaleFL outperforms existing representative heterogeneous FL approaches in terms of global/local model performance and provides inference efficiency, with up to 2x latency and 4x model size reduction with negligible performance drop below 2%.

Reliable and Interpretable Personalized Federated Learning
Qin, ZixuanandYang, LiuandWang, QilongandHan, YahongandHu, Qinghua



研究问题:如何设计一种可靠的个性化联邦学习方法,以更好地利用群体知识。
动机:在数据分布存在较大差异的情况下,联邦学习需要设计可靠的客户端选择策略和可解释的客户端通信框架。
方法:提出了一种称为RIPFL的可靠个性化联邦学习方法,该方法通过贝叶斯决策规则和证据理论将个人信息与全局模型生成的社会信息有效整合。
效果:实验结果表明,该方法比其他最先进的联邦学习算法具有更强的鲁棒性和准确性。

Federated learning can coordinate multiple users to participate in data training while ensuring data privacy. The collaboration of multiple agents allows for a natural connection between federated learning and collective intelligence. When there are large differences in data distribution among clients, it is crucial for federated learning to design a reliable client selection strategy and an interpretable client communication framework to better utilize group knowledge. Herein, a reliable personalized federated learning approach, termed RIPFL, is proposed and fully interpreted from the perspective of social learning. RIPFL reliably selects and divides the clients involved in training such that each client can use different amounts of social information and more effectively communicate with other clients. Simultaneously, the method effectively integrates personal information with the social information generated by the global model from the perspective of Bayesian decision rules and evidence theory, enabling individuals to grow better with the help of collective wisdom. An interpretable federated learning mind is well scalable, and the experimental results indicate that the proposed method has superior robustness and accuracy than other state-of-the-art federated learning algorithms.

Equivalent Transformation and Dual Stream Network Construction for Mobile Image Super-Resolution
Chao, JiahaoandZhou, ZhouandGao, HongfanandGong, JialiandYang, ZhengfengandZeng, ZhenbingandDehbi, Lydia



研究问题:近年来,对移动设备上的实时超分辨率网络的需求日益增长。
动机:尽管已经提出了许多轻量级的超分辨率模型,但这些模型仍然包含增加推理延迟的耗时组件,限制了它们在移动设备上的现实应用。
方法:本文提出了一种基于等效变换和双流网络构建(ETDS)的新型单图像超分辨率模型。等效变换方法将耗时的操作转换为移动设备上的友好操作,如卷积和ReLU。然后设计了一个双流网络来减轻由等效变换产生的冗余参数并增强特征提取能力。
效果:通过充分利用等效变换和双流网络结构的进步,我们开发了用于移动设备的高效SR模型ETDS。实验结果表明,与先前的轻量级SR方法相比,我们的ETDS在移动设备上实现了优越的推理速度和重建质量。代码可在https://github.com/ECNUSR/ETDS获取。

In recent years, there has been an increasing demand for real-time super-resolution networks on mobile devices. To address this issue, many lightweight super-resolution models have been proposed. However, these models still contain time-consuming components that increase inference latency, limiting their real-world applications on mobile devices. In this paper, we propose a novel model for singleimage super-resolution based on Equivalent Transformation and Dual Stream network construction (ETDS). ET method is proposed to transform time-consuming operators into time-friendly ones such as convolution and ReLU on mobile devices. Then, a dual stream network is designed to alleviate redundant parameters yielded from ET and enhance the feature extraction ability. Taking full advantage of the advance of ET and the dual stream network structure, we develop the efficient SR model ETDS for mobile devices. The experimental results demonstrate that our ETDS achieves superior inference speed and reconstruction quality compared to prior lightweight SR methods on mobile devices. The code is available at https://github.com/ECNUSR/ETDS.

DyNCA: Real-Time Dynamic Texture Synthesis Using Neural Cellular Automata
Pajouheshgar, EhsanandXu, YitaoandZhang, TongandS\"usstrunk, Sabine



研究问题:目前的动态纹理合成模型需要慢速迭代优化过程来合成固定大小的短视频,并且无法对合成过程进行后训练控制。
动机:提出一种实时、可控的动态神经网络细胞自动机(DyNCA)框架,用于真实感视频纹理的实时合成。
方法:基于最近引入的NCA模型构建,可以在实时内合成无限长和任意大小的真实感视频纹理。
效果:通过定量和定性评估,证明我们的模型生成的视频比现有结果更真实。在SOTA DyTS性能上提高了2-4个数量级。此外,我们的模型提供了包括运动速度、运动方向和编辑刷工具在内的几种实时视频控制功能。我们在本地硬件上运行的在线交互式演示中展示了我们的训练模型,可在个人电脑和智能手机上访问。

Current Dynamic Texture Synthesis (DyTS) models can synthesize realistic videos. However, they require a slow iterative optimization process to synthesize a single fixed-size short video, and they do not offer any post-training control over the synthesis process. We propose Dynamic Neural Cellular Automata (DyNCA), a framework for real-time and controllable dynamic texture synthesis. Our method is built upon the recently introduced NCA models and can synthesize infinitely long and arbitrary-size realistic video textures in real-time. We quantitatively and qualitatively evaluate our model and show that our synthesized videos appear more realistic than the existing results. We improve the SOTA DyTS performance by 2 4 orders of magnitude. Moreover, our model offers several real-time video controls including motion speed, motion direction, and an editing brush tool. We exhibit our trained models in an online interactive demo that runs on local hardware and is accessible on personal computers and smartphones.

Ultrahigh Resolution Image/Video Matting With Spatio-Temporal Sparsity
Sun, YananandTang, Chi-KeungandTai, Yu-Wing



研究问题:如何有效地对超高清(UHR)图像/视频进行高质量的抠图?
动机:现有的抠图算法无法直接处理全分辨率的超高清图像,而基于补丁的方法可能会引入不美观的人工痕迹。
方法:提出了一种名为SparseMat的新方法,该方法利用空间和时间稀疏性来解决通用的超高清抠图问题。在处理视频时,通过合理利用空间和时间稀疏性,可以大大减少计算冗余。
效果:实验证明,SparseMat可以在一次处理中有效地为超高清图像和视频生成高质量的alpha通道。

Commodity ultra-high definition (UHD) displays are becoming more affordable which demand imaging in ultra high resolution (UHR). This paper proposes SparseMat, a computationally efficient approach for UHR image/video matting. Note that it is infeasible to directly process UHR images at full resolution in one shot using existing matting algorithms without running out of memory on consumer-level computational platforms, e.g., Nvidia 1080Ti with 11G memory, while patch-based approaches can introduce unsightly artifacts due to patch partitioning. Instead, our method resorts to spatial and temporal sparsity for solving general UHR matting. During processing videos, huge computation redundancy can be reduced through the rational use of spatial and temporal sparsity. In this paper, we show how to effectively estimate spatio-temporal sparsity, which serves as a gate to activate input pixels for the matting model. Under the guidance of such sparsity, our method discards patch-based inference in lieu of memory-efficient and full-resolution matte refinement. Extensive experiments demonstrate that SparseMat can effectively and efficiently generate high-quality alpha matte for UHR images and videos in one shot.

Tunable Convolutions With Parametric Multi-Loss Optimization
Maggioni, MatteoandTanay, ThomasandBabiloni, FrancescaandMcDonagh, StevenandLeonardis, Ale\v{s



研究问题:如何根据外部因素在推理时调整神经网络的行为,特别是在图像到图像的转换任务中平衡感知失真。
动机:传统的卷积神经网络在训练时损失和数据的选择是固定的,但在推理时需要根据用户偏好或数据的动态特性进行调整。
方法:提出一种参数化的可调卷积层,该层包含多个不同的核,并使用一个包含相同数量目标的参数化多损失进行优化。通过共享一组参数来动态地插值目标和内核,从而在训练时随机采样这些参数以明确优化所有可能的目标组合,并在推理时将这些参数作为模型的交互输入,从而实现对模型行为的可靠和一致的控制。
效果:实验结果表明,这种可调卷积可以有效地替代传统卷积神经网络中的现有卷积,几乎不需要额外的计算成本,并在包括图像去噪、去模糊、超分辨率和风格转换在内的广泛应用中优于最先进的控制策略。

Behavior of neural networks is irremediably determined by the specific loss and data used during training. However it is often desirable to tune the model at inference time based on external factors such as preferences of the user or dynamic characteristics of the data. This is especially important to balance the perception-distortion trade-off of ill-posed image-to-image translation tasks. In this work, we propose to optimize a parametric tunable convolutional layer, which includes a number of different kernels, using a parametric multi-loss, which includes an equal number of objectives. Our key insight is to use a shared set of parameters to dynamically interpolate both the objectives and the kernels. During training, these parameters are sampled at random to explicitly optimize all possible combinations of objectives and consequently disentangle their effect into the corresponding kernels. During inference, these parameters become interactive inputs of the model hence enabling reliable and consistent control over the model behavior. Extensive experimental results demonstrate that our tunable convolutions effectively work as a drop-in replacement for traditional convolutions in existing neural networks at virtually no extra computational cost, outperforming state-of-the-art control strategies in a wide range of applications; including image denoising, deblurring, super-resolution, and style transfer.

Dense Network Expansion for Class Incremental Learning
Hu, ZhiyuanandLi, YunshengandLyu, JianchengandGao, DashanandVasconcelos, Nuno



研究问题:考虑类别增量学习(CIL)的问题。
动机:现有的方法使用基于网络扩展(NE)的动态架构,每增加一个任务就添加一个任务专家,虽然有效但导致模型快速增大。
方法:提出密集网络扩展(DNE)的新方法以实现准确性和模型复杂性之间的更好平衡。通过在任务专家网络的中间层之间引入密集连接,使旧任务和新任务的知识转移通过特征共享和重用实现。
效果:实验结果表明,DNE方法严格保持旧类别的特征空间,同时网络和特征规模的增长速度远低于以往的方法。在准确性方面,DNE方法比先前的最先进方法高出4%,并且模型规模相似甚至更小。

The problem of class incremental learning (CIL) is considered. State-of-the-art approaches use a dynamic architecture based on network expansion (NE), in which a task expert is added per task. While effective from a computational standpoint, these methods lead to models that grow quickly with the number of tasks. A new NE method, dense network expansion (DNE), is proposed to achieve a better trade-off between accuracy and model complexity. This is accomplished by the introduction of dense connections between the intermediate layers of the task expert networks, that enable the transfer of knowledge from old to new tasks via feature sharing and reusing. This sharing is implemented with a cross-task attention mechanism, based on a new task attention block (TAB), that fuses information across tasks. Unlike traditional attention mechanisms, TAB operates at the level of the feature mixing and is decoupled with spatial attentions. This is shown more effective than a joint spatial-and-task attention for CIL. The proposed DNE approach can strictly maintain the feature space of old classes while growing the network and feature scale at a much slower rate than previous methods. In result, it outperforms the previous SOTA methods by a margin of 4% in terms of accuracy, with similar or even smaller model scale.

Rethinking Gradient Projection Continual Learning: Stability / Plasticity Feature Space Decoupling
Zhao, ZhenandZhang, ZhizhongandTan, XinandLiu, JunandQu, YanyunandXie, YuanandMa, Lizhuang



研究问题:如何在持续学习中,使模型在不断学习新类别的同时,不忘记已学知识。
动机:现有的方法需要梯度与整个特征空间完全正交,导致在新任务到来时,特征空间无限扩大,可行的梯度方向变窄,影响了模型的可塑性。
方法:提出空间解耦(SD)算法,将特征空间解耦为互补的两个子空间:稳定性空间I和可塑性空间R。I通过历史和当前特征空间的交集建立,包含更多任务共享的基础;R是I的正交补空间,主要包含更多特定任务的基础。通过对R和I施加区分性约束,实现稳定性和可塑性之间的更好平衡。
效果:实验表明,SD对梯度投影基线的应用具有模型无关性,并在公开数据集上取得了最先进的结果。

Continual learning aims to incrementally learn novel classes over time, while not forgetting the learned knowledge. Recent studies have found that learning would not forget if the updated gradient is orthogonal to the feature space. However, previous approaches require the gradient to be fully orthogonal to the whole feature space, leading to poor plasticity, as the feasible gradient direction becomes narrow when the tasks continually come, i.e., feature space is unlimitedly expanded. In this paper, we propose a space decoupling (SD) algorithm to decouple the feature space into a pair of complementary subspaces, i.e., the stability space I, and the plasticity space R. I is established by conducting space intersection between the historic and current feature space, and thus I contains more task-shared bases. R is constructed by seeking the orthogonal complementary subspace of I, and thus R mainly contains more task-specific bases. By putting the distinguishing constraints on R and I, our method achieves a better balance between stability and plasticity. Extensive experiments are conducted by applying SD to gradient projection baselines, and show SD is model-agnostic and achieves SOTA results on publicly available datasets.

DisWOT: Student Architecture Search for Distillation WithOut Training
Dong, PeijieandLi, LujunandWei, Zimian



研究问题:如何有效地训练轻量级的学生模型,以提高其性能?
动机:现有的知识蒸馏(KD)策略在教师-学生对架构差异大的情况下,限制了蒸馏的收益。
方法:我们提出了一种无需训练的框架,通过遗传算法寻找给定教师的最佳学生架构。我们首先实证发现,基础训练下的最优模型不一定是蒸馏中的赢家。其次,我们发现随机初始化的教师-学生网络的特征语义和样本关系的相似性与最终的蒸馏性能有良好相关性。因此,我们通过条件于语义激活图的相似性矩阵来选择最佳学生,从而显著提高了模型在蒸馏阶段的性能,并至少加速了180倍的训练。
效果:我们在CIFAR、ImageNet和NAS-Bench-201上进行实验,证明我们的技术在不同的搜索空间上都取得了最先进的结果。

Knowledge distillation (KD) is an effective training strategy to improve the lightweight student models under the guidance of cumbersome teachers. However, the large architecture difference across the teacher-student pairs limits the distillation gains. In contrast to previous adaptive distillation methods to reduce the teacher-student gap, we explore a novel training-free framework to search for the best student architectures for a given teacher. Our work first empirically show that the optimal model under vanilla training cannot be the winner in distillation. Secondly, we find that the similarity of feature semantics and sample relations between random-initialized teacher-student networks have good correlations with final distillation performances. Thus, we efficiently measure similarity matrixs conditioned on the semantic activation maps to select the optimal student via an evolutionary algorithm without any training. In this way, our student architecture search for Distillation WithOut Training (DisWOT) significantly improves the performance of the model in the distillation stage with at least 180x training acceleration. Additionally, we extend similarity metrics in DisWOT as new distillers and KD-based zero-proxies. Our experiments on CIFAR, ImageNet and NAS-Bench-201 demonstrate that our technique achieves state-of-the-art results on different search spaces. Our project and code are available at https://lilujunai.github.io/DisWOT-CVPR2023/.

Independent Component Alignment for Multi-Task Learning
Senushkin, DmitryandPatakin, NikolayandKuznetsov, ArsenyandKonushin, Anton



研究问题:多任务学习(MTL)优化中存在冲突和主导梯度的问题。
动机:提出一种新的MTL优化方法,通过消除训练过程中的不稳定性来提高性能。
方法:提出了一种基于线性梯度系统条件数的稳定性标准,并据此提出了新的MTL优化方法Aligned-MTL。
效果:实验证明,该方法在语义和实例分割、深度估计、表面法线估计和强化学习等多个MTL基准测试中,都能持续提升性能。

In a multi-task learning (MTL) setting, a single model is trained to tackle a diverse set of tasks jointly. Despite rapid progress in the field, MTL remains challenging due to optimization issues such as conflicting and dominating gradients. In this work, we propose using a condition number of a linear system of gradients as a stability criterion of an MTL optimization. We theoretically demonstrate that a condition number reflects the aforementioned optimization issues. Accordingly, we present Aligned-MTL, a novel MTL optimization approach based on the proposed criterion, that eliminates instability in the training process by aligning the orthogonal components of the linear system of gradients. While many recent MTL approaches guarantee convergence to a minimum, task trade-offs cannot be specified in advance. In contrast, Aligned-MTL provably converges to an optimal point with pre-defined task-specific weights, which provides more control over the optimization result. Through experiments, we show that the proposed approach consistently improves performance on a diverse set of MTL benchmarks, including semantic and instance segmentation, depth estimation, surface normal estimation, and reinforcement learning.

Improved Distribution Matching for Dataset Condensation
Zhao, GanlongandLi, GuanbinandQin, YipengandYu, Yizhou



研究问题:本文旨在解决传统数据集压缩方法在优化过程中计算量大、效率低的问题。
动机:现有的数据集压缩方法主要通过在模型优化过程中进行梯度或参数匹配来进行数据压缩,这种方法即使在小数据集和模型上也非常消耗计算资源。
方法:本文提出了一种基于分布匹配的新型数据集压缩方法,该方法更加高效且有前景。我们识别了朴素分布匹配的两个重要缺点(即特征数量不平衡和距离计算的未验证嵌入),并通过三种新技术(即分区和扩展增强、有效且丰富的模型采样以及类别感知的分布正则化)来解决这些问题。
效果:我们的简单而有效的方法以更少的计算资源优于大多数先前的优化导向方法,从而将数据压缩扩展到更大的数据集和模型。大量实验证明了我们的方法的有效性。代码可在https://github.com/uitrbn/IDM获取。

Dataset Condensation aims to condense a large dataset into a smaller one while maintaining its ability to train a well-performing model, thus reducing the storage cost and training effort in deep learning applications. However, conventional dataset condensation methods are optimization-oriented and condense the dataset by performing gradient or parameter matching during model optimization, which is computationally intensive even on small datasets and models. In this paper, we propose a novel dataset condensation method based on distribution matching, which is more efficient and promising. Specifically, we identify two important shortcomings of naive distribution matching (i.e., imbalanced feature numbers and unvalidated embeddings for distance computation) and address them with three novel techniques (i.e., partitioning and expansion augmentation, efficient and enriched model sampling, and class-aware distribution regularization). Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources, thereby scaling data condensation to larger datasets and models. Extensive experiments demonstrate the effectiveness of our method. Codes are available at https://github.com/uitrbn/IDM

Slimmable Dataset Condensation
Liu, SonghuaandYe, JingwenandYu, RunpengandWang, Xinchao



研究问题:现有的数据集蒸馏方法在预算改变时,需要重新访问原始数据集进行合成过程,这既繁琐又可能无法实现。
动机:为了解决这一问题,本文提出了可调整的数据集蒸馏方法,通过仅使用之前的压缩结果来提取更小的合成数据集。
方法:我们首先研究了现有数据集蒸馏算法在这种连续压缩设置下的局限性,并确定了两个关键因素:(1)神经网络在不同压缩时间下的不一致性(2)合成数据的欠定解空间。因此,我们提出了一种新的可调整的数据集蒸馏训练目标,以明确考虑这两个因素。此外,我们的合成数据集采用重要性感知参数化。理论推导表明,通过丢弃次要成分可以在不训练的情况下实现上限误差。或者,如果允许训练,这种策略可以作为快速收敛的强大初始化。
效果:广泛的比较和消融实验证明,所提出的方法在多个基准测试上优于现有方法。

Dataset distillation, also known as dataset condensation, aims to compress a large dataset into a compact synthetic one. Existing methods perform dataset condensation by assuming a fixed storage or transmission budget. When the budget changes, however, they have to repeat the synthesizing process with access to original datasets, which is highly cumbersome if not infeasible at all. In this paper, we explore the problem of slimmable dataset condensation, to extract a smaller synthetic dataset given only previous condensation results. We first study the limitations of existing dataset condensation algorithms on such a successive compression setting and identify two key factors: (1) the inconsistency of neural networks over different compression times and (2) the underdetermined solution space for synthetic data. Accordingly, we propose a novel training objective for slimmable dataset condensation to explicitly account for both factors. Moreover, synthetic datasets in our method adopt an significance-aware parameterization. Theoretical derivation indicates that an upper-bounded error can be achieved by discarding the minor components without training. Alternatively, if training is allowed, this strategy can serve as a strong initialization that enables a fast convergence. Extensive comparisons and ablations demonstrate the superiority of the proposed solution over existing methods on multiple benchmarks.

Data-Free Knowledge Distillation via Feature Exchange and Activation Region Constraint
Yu, ShikangandChen, JiachenandHan, HuandJiang, Shuqiang



研究问题:尽管基于合成数据生成的数据自由知识蒸馏(DFKD)取得了巨大进展,但在多样化和高效数据合成方面仍存在限制。
动机:简单结合基于生成网络的数据合成和数据增强并不能解决这些问题。因此,本文提出了一种基于通道特征交换(CFE)和多尺度空间激活区域一致性(mSARC)约束的新型数据自由知识蒸馏方法(SpaceshipNet)。
方法:具体来说,CFE使我们的生成网络更好地从特征空间中采样,并有效地合成多样化的图像以学习学生网络。然而,仅使用CFE可能会严重放大合成图像中的不需要的噪声,这可能导致蒸馏学习无法改进甚至产生负面影响。因此,我们提出mSARC以确保学生网络不仅可以模仿教师网络的输出,还可以模仿其空间激活区域,以减轻不同合成图像中不需要的噪声对蒸馏学习的影响。
效果:在CIFAR-10、CIFAR-100、Tiny-ImageNet、Imagenette和ImageNet100上进行的大量实验表明,我们的方法可以与不同的主干网络一起工作,并且优于最先进的DFKD方法。代码将在https://github.com/skgyu/SpaceshipNet上提供。

Despite the tremendous progress on data-free knowledge distillation (DFKD) based on synthetic data generation, there are still limitations in diverse and efficient data synthesis. It is naive to expect that a simple combination of generative network-based data synthesis and data augmentation will solve these issues. Therefore, this paper proposes a novel data-free knowledge distillation method (SpaceshipNet) based on channel-wise feature exchange (CFE) and multi-scale spatial activation region consistency (mSARC) constraint. Specifically, CFE allows our generative network to better sample from the feature space and efficiently synthesize diverse images for learning the student network. However, using CFE alone can severely amplify the unwanted noises in the synthesized images, which may result in failure to improve distillation learning and even have negative effects. Therefore, we propose mSARC to assure the student network can imitate not only the logit output but also the spatial activation region of the teacher network in order to alleviate the influence of unwanted noises in diverse synthetic images on distillation learning. Extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet, Imagenette, and ImageNet100 show that our method can work well with different backbone networks, and outperform the state-of-the-art DFKD methods. Code will be available at: https://github.com/skgyu/SpaceshipNet.

FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation
He, JunjieandLi, PengyuandGeng, YifengandXie, Xuansong



研究问题:本文旨在解决目前实例分割中,查询模型在高效实时基准测试上的优势尚未得到充分证明的问题。
动机:尽管查询模型无需最大抑制(NMS)并且是端到端的,但它们在高准确度实时基准测试上的优越性尚未得到充分证明。
方法:本文提出了一种名为FastInst的简单有效的查询框架,用于实时实例分割。FastInst遵循最近引入的Mask2Former的元架构,其关键设计包括实例激活引导的查询、双路径更新策略和真实掩码引导的学习,使得我们可以使用更轻量的像素解码器和更少的Transformer解码器层,同时实现更好的性能。
效果:实验表明,FastInst在速度和准确性上都优于大多数最先进的实时对应模型,包括强大的全卷积基线。

Recent attention in instance segmentation has focused on query-based models. Despite being non-maximum suppression (NMS)-free and end-to-end, the superiority of these models on high-accuracy real-time benchmarks has not been well demonstrated. In this paper, we show the strong potential of query-based models on efficient instance segmentation algorithm designs. We present FastInst, a simple, effective query-based framework for real-time instance segmentation. FastInst can execute at a real-time speed (i.e., 32.5 FPS) while yielding an AP of more than 40 (i.e., 40.5 AP) on COCO test-dev without bells and whistles. Specifically, FastInst follows the meta-architecture of recently introduced Mask2Former. Its key designs include instance activation-guided queries, dual-path update strategy, and ground truth mask-guided learning, which enable us to use lighter pixel decoders, fewer Transformer decoder layers, while achieving better performance. The experiments show that FastInst outperforms most state-of-the-art real-time counterparts, including strong fully convolutional baselines, in both speed and accuracy. Code can be found at https://github.com/junjiehe96/FastInst.

Transformer-Based Learned Optimization
G\"artner, ErikandMetz, LukeandAndriluka, MykhayloandFreeman, C.DanielandSminchisescu, Cristian



研究问题:提出一种新的学习优化方法,通过神经网络表示优化器的更新步骤。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We propose a new approach to learned optimization where we represent the computation of an optimizer's update step using a neural network. The parameters of the optimizer are then learned by training on a set of optimization tasks with the objective to perform minimization efficiently. Our innovation is a new neural network architecture, Optimus, for the learned optimizer inspired by the classic BFGS algorithm. As in BFGS, we estimate a preconditioning matrix as a sum of rank-one updates but use a Transformer-based neural network to predict these updates jointly with the step length and direction. In contrast to several recent learned optimization-based approaches, our formulation allows for conditioning across the dimensions of the parameter space of the target problem while remaining applicable to optimization tasks of variable dimensionality without retraining. We demonstrate the advantages of our approach on a benchmark composed of objective functions traditionally used for the evaluation of optimization algorithms, as well as on the real world-task of physics-based visual reconstruction of articulated 3d human motion.

Dealing With Cross-Task Class Discrimination in Online Continual Learning
Guo, YiduoandLiu, BingandZhao, Dongyan



研究问题:本文探讨了现有持续学习研究中忽视的问题,即跨任务类别判别(CTCD),即如何在无法(或有限)访问旧任务数据的情况下,在新任务的类别和旧任务之间建立决策边界。
动机:现有的再播放方法虽然部分解决了CTCD问题,但其在在线持续学习过程中存在动态训练偏见问题,这降低了重播数据解决CTCD问题的有效性。
方法:提出了一种新的优化目标和基于梯度的自适应方法,以动态处理在线持续学习过程中的问题。
效果:实验结果表明,新方法在在线持续学习中取得了更好的结果。

Existing continual learning (CL) research regards catastrophic forgetting (CF) as almost the only challenge. This paper argues for another challenge in class-incremental learning (CIL), which we call cross-task class discrimination (CTCD), i.e., how to establish decision boundaries between the classes of the new task and old tasks with no (or limited) access to the old task data. CTCD is implicitly and partially dealt with by replay-based methods. A replay method saves a small amount of data (replay data) from previous tasks. When a batch of current task data arrives, the system jointly trains the new data and some sampled replay data. The replay data enables the system to partially learn the decision boundaries between the new classes and the old classes as the amount of the saved data is small. However, this paper argues that the replay approach also has a dynamic training bias issue which reduces the effectiveness of the replay data in solving the CTCD problem. A novel optimization objective with a gradient-based adaptive method is proposed to dynamically deal with the problem in the online CL process. Experimental results show that the new method achieves much better results in online CL.

A-La-Carte Prompt Tuning (APT): Combining Distinct Data via Composable Prompting
Bowman, BenjaminandAchille, AlessandroandZancato, LucaandTrager, MatthewandPerera, PramudithaandPaolini, GiovanniandSoatto, Stefano



研究问题:如何利用不同的数据源训练独立的提示,并在推理时任意组合它们?
动机:为了解决在特定数据源上训练的模型无法适应其他数据源的问题。
方法:提出了一种基于变压器的方案A-la-carte Prompt Tuning(APT),可以在不同的设备、时间和分布或领域上独立地训练各个提示。每个提示只包含其在训练期间接触的数据子集的信息。在推理时,可以根据任意选择的数据源组装模型,这被称为a-la-carte学习。
效果:实验证明,a-la-carte构建的模型在各自数据源上的准确率达到了5%,并且在训练和推理时间上具有可比的成本。在持续学习基准测试Split CIFAR-100和CORe50上,实现了最先进的性能。

We introduce A-la-carte Prompt Tuning (APT), a transformer-based scheme to tune prompts on distinct data so that they can be arbitrarily composed at inference time. The individual prompts can be trained in isolation, possibly on different devices, at different times, and on different distributions or domains. Furthermore each prompt only contains information about the subset of data it was exposed to during training. During inference, models can be assembled based on arbitrary selections of data sources, which we call a-la-carte learning. A-la-carte learning enables constructing bespoke models specific to each user's individual access rights and preferences. We can add or remove information from the model by simply adding or removing the corresponding prompts without retraining from scratch. We demonstrate that a-la-carte built models achieve accuracy within 5% of models trained on the union of the respective sources, with comparable cost in terms of training and inference time. For the continual learning benchmarks Split CIFAR-100 and CORe50, we achieve state-of-the-art performance.

Computationally Budgeted Continual Learning: What Does Matter?
Prabhu, AmeyaandAlKaderHammoud, HasanAbedandDokania, PuneetK.andTorr, PhilipH.S.andLim, Ser-NamandGhanem, BernardandBibi, Adel



研究问题:本文旨在解决持续学习(CL)在实际应用中的问题,即如何在有限的计算和时间预算下进行有效的模型训练。
动机:目前的持续学习方法主要关注于保留先前看到的数据,而对训练的计算预算没有任何限制。然而,对于实际应用来说,系统主要受限于计算和时间预算,而不是存储。
方法:本文通过大规模的基准测试,分析了传统持续学习方法在计算受限环境下的性能。作者评估了各种持续学习的采样策略、蒸馏损失和部分微调等方法。
效果:实验结果表明,在计算受限的环境下,传统的持续学习方法无法超越简单的基线方法。这一结论在不同的数据流时间步长和不同的计算预算下都是一致的,表明大多数现有的持续学习方法对于实际的预算部署来说过于昂贵。

Continual Learning (CL) aims to sequentially train models on streams of incoming data that vary in distribution by preserving previous knowledge while adapting to new data. Current CL literature focuses on restricted access to previously seen data, while imposing no constraints on the computational budget for training. This is unreasonable for applications in-the-wild, where systems are primarily constrained by computational and time budgets, not storage. We revisit this problem with a large-scale benchmark and analyze the performance of traditional CL approaches in a compute-constrained setting, where effective memory samples used in training can be implicitly restricted as a consequence of limited computation. We conduct experiments evaluating various CL sampling strategies, distillation losses, and partial fine-tuning on two large-scale datasets, namely ImageNet2K and Continual Google Landmarks V2 in data incremental, class incremental, and time incremental settings. Through extensive experiments amounting to a total of over 1500 GPU-hours, we find that, under compute-constrained setting, traditional CL approaches, with no exception, fail to outperform a simple minimal baseline that samples uniformly from memory. Our conclusions are consistent in a different number of stream time steps, e.g., 20 to 200, and under several computational budgets. This suggests that most existing CL methods are particularly too computationally expensive for realistic budgeted deployment. Code for this project is available at: https://github.com/drimpossible/BudgetCL.

Decentralized Learning With Multi-Headed Distillation
Zhmoginov, AndreyandSandler, MarkandMiller, NolanandKristiansen, GusandVladymyrov, Max



研究问题:分散式学习中,多个拥有私有非独立同分布数据的代理如何在不共享数据、权重或权重更新的情况下相互学习。
动机:解决机器学习中的分散式学习问题,允许多个代理在不共享数据的情况下从彼此那里学习。
方法:提出一种新的基于蒸馏的分散式学习方法,该方法利用未标记的公共数据集和每个客户端的多个辅助头,大大提高了异构数据处理的训练效率。
效果:这种方法使各个模型能够在保留和提高其私有任务性能的同时,也显著提高了其在全局聚合数据分布上的性能。研究表明,与孤立学习相比,我们的代理可以显著提高其性能。

Decentralized learning with private data is a central problem in machine learning. We propose a novel distillation-based decentralized learning technique that allows multiple agents with private non-iid data to learn from each other, without having to share their data, weights or weight updates. Our approach is communication efficient, utilizes an unlabeled public dataset and uses multiple auxiliary heads for each client, greatly improving training efficiency in the case of heterogeneous data. This approach allows individual models to preserve and enhance performance on their private tasks while also dramatically improving their performance on the global aggregated data distribution. We study the effects of data and model architecture heterogeneity and the impact of the underlying communication graph topology on learning efficiency and show that our agents can significantly improve their performance compared to learning in isolation.

Heterogeneous Continual Learning
Madaan, DivyamandYin, HongxuandByeon, WonminandKautz, JanandMolchanov, Pavlo



研究问题:本文旨在解决随着网络架构的快速进步,如何将现有的解决方案适应到新的架构中的持续学习(CL)问题。
动机:大多数CL方法都集中在通过修改权重来适应新任务/类别的单一架构上,但随着架构设计的迅速发展,如何将现有解决方案适应到新的架构中的问题变得相关。
方法:我们提出了异构持续学习(HCL),其中各种不断发展的网络架构与新的数据/任务一起不断出现。作为解决方案,我们在蒸馏技术系列的基础上进行了修改,使较弱的模型扮演教师的角色;同时,一个新的更强的架构充当学生的角色。此外,我们还考虑了对以前的数据访问有限的设置,并提出了快速深度反演(QDI)以恢复先前任务的视觉特征以支持知识转移。
效果:我们的评估表明,与各种网络架构上的最先进方法相比,我们的方案在准确性方面有了显著的提高。

We propose a novel framework and a solution to tackle the continual learning (CL) problem with changing network architectures. Most CL methods focus on adapting a single architecture to a new task/class by modifying its weights. However, with rapid progress in architecture design, the problem of adapting existing solutions to novel architectures becomes relevant. To address this limitation, we propose Heterogeneous Continual Learning (HCL), where a wide range of evolving network architectures emerge continually together with novel data/tasks. As a solution, we build on top of the distillation family of techniques and modify it to a new setting where a weaker model takes the role of a teacher; meanwhile, a new stronger architecture acts as a student. Furthermore, we consider a setup of limited access to previous data and propose Quick Deep Inversion (QDI) to recover prior task visual features to support knowledge transfer. QDI significantly reduces computational costs compared to previous solutions and improves overall performance. In summary, we propose a new setup for CL with a modified knowledge distillation paradigm and design a quick data inversion method to enhance distillation. Our evaluation of various benchmarks shows a significant improvement on accuracy in comparison to state-of-the-art methods over various networks architectures.

Deep Graph Reprogramming
Jing, YongchengandYuan, ChongbinandJu, LiandYang, YidingandWang, XinchaoandTao, Dacheng



研究问题:本文旨在探索一种针对图神经网络(GNNs)的新颖模型重用任务,称为“深度图重编程”。
动机:为了在不修改原始节点特征或模型参数的情况下,重新编程预训练的GNN以处理各种领域中的跨级别下游任务。
方法:提出了创新的数据重编程和模型重编程两种范式。数据重编程旨在解决输入端不同任务的异构图形特征维度的挑战,而模型重编程则缓解了固定每任务每模型行为的困境。
效果:实验结果表明,所提出的方法在14个数据集上产生了令人满意的结果,与从头开始重新训练的结果相当。

In this paper, we explore a novel model reusing task tailored for graph neural networks (GNNs), termed as "deep graph reprogramming". We strive to reprogram a pre-trained GNN, without amending raw node features nor model parameters, to handle a bunch of cross-level downstream tasks in various domains. To this end, we propose an innovative Data Reprogramming paradigm alongside a Model Reprogramming paradigm. The former one aims to address the challenge of diversified graph feature dimensions for various tasks on the input side, while the latter alleviates the dilemma of fixed per-task-per-model behavior on the model side. For data reprogramming, we specifically devise an elaborated Meta-FeatPadding method to deal with heterogeneous input dimensions, and also develop a transductive Edge-Slimming as well as an inductive Meta-GraPadding approach for diverse homogenous samples. Meanwhile, for model reprogramming, we propose a novel task-adaptive Reprogrammable-Aggregator, to endow the frozen model with larger expressive capacities in handling cross-domain tasks. Experiments on fourteen datasets across node/graph classification/regression, 3D object recognition, and distributed action recognition, demonstrate that the proposed methods yield gratifying results, on par with those by re-training from scratch.

Compacting Binary Neural Networks by Sparse Kernel Selection
Wang, YikaiandHuang, WenbingandDong, YinpengandSun, FuchunandYao, Anbang



研究问题:如何通过学习非重复的二进制核子空间来压缩典型的二值神经网络,并进一步优化性能。
动机:成功的二值神经网络中的二进制核通常呈幂律分布,其值大多聚集在少数几个码字中,这种现象鼓励我们压缩典型的二值神经网络并通过学习二进制核子空间内的非重复核子来获得更接近的性能。
方法:我们将二值化过程视为基于二进制码本的核分组,任务是学习从完整码本中选择较小的码字子集。然后利用Gumbel-Sinkhorn技术近似码字选择过程,并开发排列直接估计器(PSTE),该估计器不仅可以端到端优化选择过程,还可以保持所选码字的非重复占用。
效果:实验证明,我们的方法减少了模型大小和位计算成本,并在可比预算下实现了与最先进的二值神经网络相比的准确性改进。

Binary Neural Network (BNN) represents convolution weights with 1-bit values, which enhances the efficiency of storage and computation. This paper is motivated by a previously revealed phenomenon that the binary kernels in successful BNNs are nearly power-law distributed: their values are mostly clustered into a small number of codewords. This phenomenon encourages us to compact typical BNNs and obtain further close performance through learning non-repetitive kernels within a binary kernel subspace. Specifically, we regard the binarization process as kernel grouping in terms of a binary codebook, and our task lies in learning to select a smaller subset of codewords from the full codebook. We then leverage the Gumbel-Sinkhorn technique to approximate the codeword selection process, and develop the Permutation Straight-Through Estimator (PSTE) that is able to not only optimize the selection process end-to-end but also maintain the non-repetitive occupancy of selected codewords. Experiments verify that our method reduces both the model size and bit-wise computational costs, and achieves accuracy improvements compared with state-of-the-art BNNs under comparable budgets.

EMT-NAS:Transferring Architectural Knowledge Between Tasks From Different Datasets
Liao, PengandJin, YaochuandDu, Wenli



研究问题:如何通过联合训练多个相关任务来提高深度学习模型的性能,同时避免负迁移的问题。
动机:多任务学习的成功主要归功于相关任务的共享表示,使模型能够更好地泛化。然而,在多个相关任务上联合训练权重参数可能会导致性能下降,即负迁移。
方法:本文提出了一种进化多任务神经架构搜索(EMT-NAS)算法,通过在不同相关任务之间转移架构知识来加速搜索过程。与传统的多任务学习不同,EMT-NAS中的每个任务都有个性化的网络架构和自己的权重,从而有效地减轻了负迁移的影响。
效果:通过在CIFAR-10、CIFAR-100和四个MedMNIST数据集上进行分类任务的实验,证明了EMT-NAS在找到具有竞争力的神经网络架构方面比单任务版本更快,在CIFAR和MedMNIST上分别节省了8%和40%的时间。

The success of multi-task learning (MTL) can largely be attributed to the shared representation of related tasks, allowing the models to better generalise. In deep learning, this is usually achieved by sharing a common neural network architecture and jointly training the weights. However, the joint training of weighting parameters on multiple related tasks may lead to performance degradation, known as negative transfer. To address this issue, this work proposes an evolutionary multi-tasking neural architecture search (EMT-NAS) algorithm to accelerate the search process by transferring architectural knowledge across multiple related tasks. In EMT-NAS, unlike the traditional MTL, the model for each task has a personalised network architecture and its own weights, thus offering the capability of effectively alleviating negative transfer. A fitness re-evaluation method is suggested to alleviate fluctuations in performance evaluations resulting from parameter sharing and the mini-batch gradient descent training method, thereby avoiding losing promising solutions during the search process. To rigorously verify the performance of EMT-NAS, the classification tasks used in the empirical assessments are derived from different datasets, including the CIFAR-10 and CIFAR-100, and four MedMNIST datasets. Extensive comparative experiments on different numbers of tasks demonstrate that EMT-NAS takes 8% and up to 40% on CIFAR and MedMNIST, respectively, less time to find competitive neural architectures than its single-task counterparts.

Hierarchical B-Frame Video Coding Using Two-Layer CANF Without Motion Coding
Alexandre, DavidandHang, Hsueh-MingandPeng, Wen-Hsiao



研究问题:本文旨在提出一种新颖的B-frame编码架构,该架构无需传输任何运动信息。
动机:传统的视频压缩系统通常包括运动编码和残差编码两个主要模块,而深度学习基础的编码方案也采用了这种通用架构。作者提出了一种新的基于两层条件增强正规化流(CANF)的视频压缩架构,其显著特点是不需要传输任何运动信息。
方法:作者提出的视频压缩无运动编码的想法为学习视频编码提供了新的方向。基本层是一个低分辨率图像压缩器,取代了全分辨率的运动压缩器。低分辨率编码的图像与扭曲的高分辨率图像合并,生成高质量的图像,作为全分辨率增强层图像编码的条件信号。
效果:虽然该方案的率失真性能略低于最先进的学习B-frame编码方案B-CANF,但优于其他学习B-frame编码方案。相比于B-CANF,该方案在编码和解码过程中分别节省了45%和27%的乘法累加运算(MACs)。

Typical video compression systems consist of two main modules: motion coding and residual coding. This general architecture is adopted by classical coding schemes (such as international standards H.265 and H.266) and deep learning-based coding schemes. We propose a novel B-frame coding architecture based on two-layer Conditional Augmented Normalization Flows (CANF). It has the striking feature of not transmitting any motion information. Our proposed idea of video compression without motion coding offers a new direction for learned video coding. Our base layer is a low-resolution image compressor that replaces the full-resolution motion compressor. The low-resolution coded image is merged with the warped high-resolution images to generate a high-quality image as a conditioning signal for the enhancement-layer image coding in full resolution. One advantage of this architecture is significantly reduced computational complexity due to eliminating the motion information compressor. In addition, we adopt a skip-mode coding technique to reduce the transmitted latent samples. The rate-distortion performance of our scheme is slightly lower than that of the state-of-the-art learned B-frame coding scheme, B-CANF, but outperforms other learned B-frame coding schemes. However, compared to B-CANF, our scheme saves 45% of multiply-accumulate operations (MACs) for encoding and 27% of MACs for decoding. The code is available at https://nycu-clab.github.io.

SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
Chen, XuanyaoandLiu, ZhijianandTang, HaotianandYi, LiandZhao, HangandHan, Song



研究问题:如何降低高分辨率图像在神经网络中学习丰富视觉表示的计算复杂性,以适应对延迟敏感的应用。
动机:虽然高分辨率图像可以提升神经网络的学习效果,但其带来的计算复杂性增加却阻碍了其在延迟敏感应用中的使用。
方法:本文提出了SparseViT,通过激活稀疏化重新审视了基于窗口的视觉转换器(ViTs)。由于窗口注意力自然地批量处理块,因此实际的窗口激活剪枝加速成为可能。
效果:实验结果表明,与密集模型相比,SparseViT在单目3D对象检测、2D实例分割和2D语义分割任务上分别实现了1.5倍、1.4倍和1.3倍的速度提升,同时精度损失可忽略不计。

High-resolution images enable neural networks to learn richer visual representations. However, this improved performance comes at the cost of growing computational complexity, hindering their usage in latency-sensitive applications. As not all pixels are equal, skipping computations for less-important regions offers a simple and effective measure to reduce the computation. This, however, is hard to be translated into actual speedup for CNNs since it breaks the regularity of the dense convolution workload. In this paper, we introduce SparseViT that revisits activation sparsity for recent window-based vision transformers (ViTs). As window attentions are naturally batched over blocks, actual speedup with window activation pruning becomes possible: i.e., 50% latency reduction with 60% sparsity. Different layers should be assigned with different pruning ratios due to their diverse sensitivities and computational costs. We introduce sparsity-aware adaptation and apply the evolutionary search to efficiently find the optimal layerwise sparsity configuration within the vast search space. SparseViT achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation, respectively, with negligible to no loss of accuracy.

Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos
Hu, YubinandHe, YuzeandLi, YanghaoandLi, JishengandHan, YuxingandWen, JiangtaoandLiu, Yong-Jin



研究问题:视频语义分割(VSS)是一项计算密集型任务,由于需要对高帧率的视频进行逐帧预测。
动机:现有的VSS模型或策略没有考虑到影响计算成本的一个重要因素——输入分辨率。
方法:本文提出了一种名为AR-Seg的可变分辨率框架,通过降低非关键帧的分辨率来有效进行VSS。设计了一个跨分辨率特征融合(CReFF)模块,并采用新颖的特征相似性训练(FST)策略进行监督。
效果:在CamVid和Cityscapes上的大量实验表明,AR-Seg实现了最先进的性能,并且与不同的分割骨干网络兼容。在CamVid上,AR-Seg在使用PSPNet18骨干网络的情况下,节省了67%的计算成本(以GFLOPs为单位),同时保持了高精度的分割准确性。

Video semantic segmentation (VSS) is a computationally expensive task due to the per-frame prediction for videos of high frame rates. In recent work, compact models or adaptive network strategies have been proposed for efficient VSS. However, they did not consider a crucial factor that affects the computational cost from the input side: the input resolution. In this paper, we propose an altering resolution framework called AR-Seg for compressed videos to achieve efficient VSS. AR-Seg aims to reduce the computational cost by using low resolution for non-keyframes. To prevent the performance degradation caused by downsampling, we design a Cross Resolution Feature Fusion (CReFF) module, and supervise it with a novel Feature Similarity Training (FST) strategy. Specifically, CReFF first makes use of motion vectors stored in a compressed video to warp features from high-resolution keyframes to low-resolution non-keyframes for better spatial alignment, and then selectively aggregates the warped features with local attention mechanism. Furthermore, the proposed FST supervises the aggregated features with high-resolution features through an explicit similarity loss and an implicit constraint from the shared decoding layer. Extensive experiments on CamVid and Cityscapes show that AR-Seg achieves state-of-the-art performance and is compatible with different segmentation backbones. On CamVid, AR-Seg saves 67% computational cost (measured in GFLOPs) with the PSPNet18 backbone while maintaining high segmentation accuracy. Code: https://github.com/THU-LYJ-Lab/AR-Seg.

FlowGrad: Controlling the Output of Generative ODEs With Gradients
Liu, XingchaoandWu, LemengandZhang, ShujianandGong, ChengyueandPing, WeiandLiu, Qiang



研究问题:如何控制预训练的基于微分方程(ODE)的生成模型的生成内容。
动机:尽管基于ODE的生成模型在各种应用上取得了显著的成果,但很少有研究关注如何控制其生成的内容。
方法:提出一种优化ODE模型输出的方法,根据指导函数实现可控生成。通过分解反向传播和计算向量雅可比积,将梯度从输出有效地反向传播到ODE轨迹的任何中间时间步。为了进一步加速反向传播的计算,提出了一种非均匀离散化方法来近似ODE轨迹,根据轨迹的直线程度进行测量并将直线部分聚集到一个离散化步骤中。
效果:该方法被称为FlowGrad,在文本引导的图像操作上超过了最先进的基线。此外,FlowGrad能够在冻结的基于ODE的生成模型中找到全局语义方向,用于操纵新的图像而无需额外的优化。

Generative modeling with ordinary differential equations (ODEs) has achieved fantastic results on a variety of applications. Yet, few works have focused on controlling the generated content of a pre-trained ODE-based generative model. In this paper, we propose to optimize the output of ODE models according to a guidance function to achieve controllable generation. We point out that, the gradients can be efficiently back-propagated from the output to any intermediate time steps on the ODE trajectory, by decomposing the back-propagation and computing vector-Jacobian products. To further accelerate the computation of the back-propagation, we propose to use a non-uniform discretization to approximate the ODE trajectory, where we measure how straight the trajectory is and gather the straight parts into one discretization step. This allows us to save 90% of the back-propagation time with ignorable error. Our framework, named FlowGrad, outperforms the state-of-the-art baselines on text-guided image manipulation. Moreover, FlowGrad enables us to find global semantic directions in frozen ODE-based generative models that can be used to manipulate new images without extra optimization.

SMPConv: Self-Moving Point Representations for Continuous Convolution
Kim, SanghyeonandPark, Eunbyung



研究问题:本文旨在提出一种无需神经网络的连续卷积构建方法,以提高计算效率和性能。
动机:目前的连续卷积实现主要依赖于多层感知器(MLPs),但存在计算成本高、超参数调整复杂以及滤波器描述能力有限等问题。
方法:我们提出了自我移动点表示法,其中权重参数自由移动,并使用插值方案来实现连续函数。在构造卷积核时,实验结果表明,该方法在现有框架中具有改进的性能。
效果:由于其轻量级结构,我们是首次在大规模设置(如ImageNet)中展示连续卷积的有效性,并展示了对先前技术的改进。我们的代码可在https://github.com/sangnekim/SMPConv获取。

Continuous convolution has recently gained prominence due to its ability to handle irregularly sampled data and model long-term dependency. Also, the promising experimental results of using large convolutional kernels have catalyzed the development of continuous convolution since they can construct large kernels very efficiently. Leveraging neural networks, more specifically multilayer perceptrons (MLPs), is by far the most prevalent approach to implementing continuous convolution. However, there are a few drawbacks, such as high computational costs, complex hyperparameter tuning, and limited descriptive power of filters. This paper suggests an alternative approach to building a continuous convolution without neural networks, resulting in more computationally efficient and improved performance. We present self-moving point representations where weight parameters freely move, and interpolation schemes are used to implement continuous functions. When applied to construct convolutional kernels, the experimental results have shown improved performance with drop-in replacement in the existing frameworks. Due to its lightweight structure, we are first to demonstrate the effectiveness of continuous convolution in a large-scale setting, e.g., ImageNet, presenting the improvements over the prior arts. Our code is available on https://github.com/sangnekim/SMPConv

HNeRV: A Hybrid Neural Representation for Videos
Chen, HaoandGwilliam, MatthewandLim, Ser-NamandShrivastava, Abhinav



研究问题:本文旨在解决视频插值任务中隐式神经网络表示的局限性,如重构能力和内部泛化能力不足。
动机:目前的隐式神经网络表示(NeRV, E-NeRV等)通过固定和与内容无关的嵌入来重构视频帧,这在很大程度上限制了视频插值的回归能力和内部泛化能力。
方法:本文提出了一种混合神经网络视频表示(HNeRV),其中可学习和与内容自适应的嵌入作为解码器输入。此外,引入了HNeRV块,使模型参数在整个网络中均匀分布,因此靠近输出的高层可以具有更高的存储高分辨率内容和视频细节的能力。
效果:通过与内容自适应的嵌入和重新设计的模型架构,HNeRV在视频插值任务上优于隐式方法(NeRV, E-NeRV),无论是在重构质量和收敛速度上,还是在内部泛化上。作为一种简单而高效的视频表示,HNeRV在速度、灵活性和部署方面也优于传统编解码器(H.264, H.265)和基于学习的压缩方法。最后,探索了HNeRV在视频压缩和视频修复等下游任务上的有效性。

Implicit neural representations store videos as neural networks and have performed well for vision tasks such as video compression and denoising. With frame index and/or positional index as input, implicit representations (NeRV, E-NeRV, etc.) reconstruct video frames from fixed and content-agnostic embeddings. Such embedding largely limits the regression capacity and internal generalization for video interpolation. In this paper, we propose a Hybrid Neural Representation for Videos (HNeRV), where learnable and content-adaptive embeddings act as decoder input. Besides the input embedding, we introduce a HNeRV block to make model parameters evenly distributed across the entire network, therefore higher layers (layers near the output) can have more capacity to store high-resolution content and video details. With content-adaptive embedding and re-designed model architecture, HNeRV outperforms implicit methods (NeRV, E-NeRV) in video regression task for both reconstruction quality and convergence speed, and shows better internal generalization. As a simple and efficient video representation, HNeRV also shows decoding advantages for speed, flexibility, and deployment, compared to traditional codecs (H.264, H.265) and learning-based compression methods. Finally, we explore the effectiveness of HNeRV on downstream tasks such as video compression and video inpainting.

Decoupling Learning and Remembering: A Bilevel Memory Framework With Knowledge Projection for Task-Incremental Learning
Sun, WenjuandLi, QingyongandZhang, JingandWang, WenandGeng, Yangli-ao



研究问题:本文旨在解决增量学习中面临的可塑性和稳定性之间的两难困境。
动机:人类记忆系统能够解决这个难题,因为它具有多级记忆结构,这激发了我们提出一种具有知识投影的双层记忆系统(BMKP)用于增量学习。
方法:通过双层记忆设计,BMKP将学习和知识记忆的功能解耦:一个工作记忆负责自适应模型学习,以确保可塑性;一个长期记忆负责持久存储所学模型中融入的知识,以保证稳定性。为了解决如何从工作记忆中提取所学知识并将其整合到长期记忆中的问题,我们发现工作记忆中学习的模型实际上位于一个冗余的高维空间,而模型中融入的知识可以在所有增量学习任务共享的一组模式基下具有相当紧凑的表示。因此,我们提出了一种知识投影过程来自适应地维护共享的基,通过这个过程,工作记忆中松散组织的知识模型被投影到紧凑的表示中,以便在长期记忆中记住。
效果:我们在CIFAR-10、CIFAR-100和Tiny-ImageNet上评估BMKP。实验结果表明,BMKP在使用较低内存的情况下实现了最先进的性能。

The dilemma between plasticity and stability arises as a common challenge for incremental learning. In contrast, the human memory system is able to remedy this dilemma owing to its multi-level memory structure, which motivates us to propose a Bilevel Memory system with Knowledge Projection (BMKP) for incremental learning. BMKP decouples the functions of learning and knowledge remembering via a bilevel-memory design: a working memory responsible for adaptively model learning, to ensure plasticity; a long-term memory in charge of enduringly storing the knowledge incorporated within the learned model, to guarantee stability. However, an emerging issue is how to extract the learned knowledge from the working memory and assimilate it into the long-term memory. To approach this issue, we reveal that the model learned by the working memory are actually residing in a redundant high-dimensional space, and the knowledge incorporated in the model can have a quite compact representation under a group of pattern basis shared by all incremental learning tasks. Therefore, we propose a knowledge projection process to adapatively maintain the shared basis, with which the loosely organized model knowledge of working memory is projected into the compact representation to be remembered in the long-term memory. We evaluate BMKP on CIFAR-10, CIFAR-100, and Tiny-ImageNet. The experimental results show that BMKP achieves state-of-the-art performance with lower memory usage.

RepMode: Learning to Re-Parameterize Diverse Experts for Subcellular Structure Prediction
Zhou, DonghaoandGu, ChunbinandXu, JundeandLiu, FuruiandWang, QiongandChen, GuangyongandHeng, Pheng-Ann



研究问题:本文旨在解决生物研究中荧光染色技术慢、贵且对细胞有害的问题,以及预测亚细胞结构3D荧光图像的挑战。
动机:由于现有生物科技的限制,每个图像在亚细胞结构预测任务中只有部分标签,同时亚细胞结构的大小差异导致多尺度问题。
方法:提出重新参数化混合专家网络(RepMode),通过动态组织任务感知先验参数来处理特定的单标签预测任务。
效果:实验表明,RepMode在亚细胞结构预测任务上取得了最先进的整体性能。

In biological research, fluorescence staining is a key technique to reveal the locations and morphology of subcellular structures. However, it is slow, expensive, and harmful to cells. In this paper, we model it as a deep learning task termed subcellular structure prediction (SSP), aiming to predict the 3D fluorescent images of multiple subcellular structures from a 3D transmitted-light image. Unfortunately, due to the limitations of current biotechnology, each image is partially labeled in SSP. Besides, naturally, subcellular structures vary considerably in size, which causes the multi-scale issue of SSP. To overcome these challenges, we propose Re-parameterizing Mixture-of-Diverse-Experts (RepMode), a network that dynamically organizes its parameters with task-aware priors to handle specified single-label prediction tasks. In RepMode, the Mixture-of-Diverse-Experts (MoDE) block is designed to learn the generalized parameters for all tasks, and gating re-parameterization (GatRep) is performed to generate the specialized parameters for each task, by which RepMode can maintain a compact practical topology exactly like a plain network, and meanwhile achieves a powerful theoretical topology. Comprehensive experiments show that RepMode can achieve state-of-the-art overall performance in SSP.

Pruning Parameterization With Bi-Level Optimization for Efficient Semantic Segmentation on the Edge
Yang, ChangdiandZhao, PuandLi, YanyuandNiu, WeiandGuan, JiexiongandTang, HaoandQin, MinghaiandRen, BinandLin, XueandWang, Yanzhi



研究问题:如何在边缘设备上实现实时分割,以适应自动驾驶等应用的需求。
动机:随着边缘设备的普及,对实时分割的需求日益增加。然而,全注意力机制的视觉转换器(ViTs)通常消耗大量的计算资源,导致在边缘设备上的实时推理困难。
方法:提出了一种剪枝参数化方法来形成语义分割的剪枝问题,并采用双层优化方法通过隐式梯度解决这个问题。
效果:实验结果表明,该方法可以在Samsung S21上以56.5 FPS的速度实现38.9 mIoU的ADE20K验证集分割,这是相同计算约束下实时推理的最高mIoU。

With the ever-increasing popularity of edge devices, it is necessary to implement real-time segmentation on the edge for autonomous driving and many other applications. Vision Transformers (ViTs) have shown considerably stronger results for many vision tasks. However, ViTs with the full-attention mechanism usually consume a large number of computational resources, leading to difficulties for real-time inference on edge devices. In this paper, we aim to derive ViTs with fewer computations and fast inference speed to facilitate the dense prediction of semantic segmentation on edge devices. To achieve this, we propose a pruning parameterization method to formulate the pruning problem of semantic segmentation. Then we adopt a bi-level optimization method to solve this problem with the help of implicit gradients. Our experimental results demonstrate that we can achieve 38.9 mIoU on ADE20K val with a speed of 56.5 FPS on Samsung S21, which is the highest mIoU under the same computation constraint with real-time inference.

Less Is More: Reducing Task and Model Complexity for 3D Point Cloud Semantic Segmentation
Li, LiandShum, HubertP.H.andBreckon, TobyP.



研究问题:近年来,尽管3D激光雷达点云数据的可用性显著增长,但标注仍然昂贵且耗时,因此需要一种半监督的语义分割方法。
动机:现有的工作通常使用较大的分割骨干网络来提高分割精度,但这会增加计算成本。此外,许多方法使用均匀采样来减少学习所需的地面真值数据,这通常会导致次优的性能。
方法:我们提出了一种新的管道,它使用较小的架构,通过一种新的稀疏深度可分离卷积模块大大减少了网络参数数量,同时保持了整体任务性能。为了有效地对我们的训练数据进行子采样,我们提出了一种新的时空冗余帧降采样(ST-RFD)方法,该方法利用环境中传感器运动的知识提取更多样化的训练数据帧样本。为了利用有限的标注数据样本,我们还提出了一种基于激光反射率的软伪标签方法。
效果:在SemanticKITTI(59.5@5%)和ScribbleKITTI(58.1@5%)基准数据集上,我们的方法在使用更少的标注数据的情况下,优于当代的半监督工作,在模型参数减少了2.3倍,乘法加法操作减少了641倍的同时,也显示出在有限训练数据上的重大性能改进。

Whilst the availability of 3D LiDAR point cloud data has significantly grown in recent years, annotation remains expensive and time-consuming, leading to a demand for semi-supervised semantic segmentation methods with application domains such as autonomous driving. Existing work very often employs relatively large segmentation backbone networks to improve segmentation accuracy, at the expense of computational costs. In addition, many use uniform sampling to reduce ground truth data requirements for learning needed, often resulting in sub-optimal performance. To address these issues, we propose a new pipeline that employs a smaller architecture, requiring fewer ground-truth annotations to achieve superior segmentation accuracy compared to contemporary approaches. This is facilitated via a novel Sparse Depthwise Separable Convolution module that significantly reduces the network parameter count while retaining overall task performance. To effectively sub-sample our training data, we propose a new Spatio-Temporal Redundant Frame Downsampling (ST-RFD) method that leverages knowledge of sensor motion within the environment to extract a more diverse subset of training data frame samples. To leverage the use of limited annotated data samples, we further propose a soft pseudo-label method informed by LiDAR reflectivity. Our method outperforms contemporary semi-supervised work in terms of mIoU, using less labeled data, on the SemanticKITTI (59.5@5%) and ScribbleKITTI (58.1@5%) benchmark datasets, based on a 2.3x reduction in model parameters and 641x fewer multiply-add operations whilst also demonstrating significant performance improvement on limited training data (i.e., Less is More).

Constructing Deep Spiking Neural Networks From Artificial Neural Networks With Knowledge Distillation
Xu, QiandLi, YaxinandShen, JiangrongandLiu, JianK.andTang, HuajinandPan, Gang



研究问题:本文旨在解决现有的脉冲神经网络(SNNs)由于网络结构和训练方法的限制,其性能受到限制的问题。
动机:尽管基于脉冲的信号使SNNs具有了接近生物神经系统的高计算效率和能量效率,但由于其离散信号的特性,传统的SNNs无法像人工神经网络(ANNs)那样直接应用梯度下降规则进行参数调整。
方法:本文提出了一种利用知识蒸馏(KD)构建深度SNN模型的新方法,其中ANN作为教师模型,SNN作为学生模型。通过ANN-SNN联合训练算法,学生SNN模型可以通过KD方法从教师ANN模型中学习丰富的特征信息,同时避免了在与不可微分的脉冲进行交流时从零开始训练SNN。
效果:该方法不仅可以合理有效地构建更高效的深层脉冲结构,而且与直接训练或ANN到SNN的方法相比,使用较少的时间步骤来训练整个模型。更重要的是,它对各种类型的人工噪声和自然信号具有出色的抗噪能力。这种新方法为提高SNN的性能提供了有效的途径,有可能用于实际场景中的轻量级和高效的人脑启发式计算。

Spiking neural networks (SNNs) are well known as the brain-inspired models with high computing efficiency, due to a key component that they utilize spikes as information units, close to the biological neural systems. Although spiking based models are energy efficient by taking advantage of discrete spike signals, their performance is limited by current network structures and their training methods. As discrete signals, typical SNNs cannot apply the gradient descent rules directly into parameters adjustment as artificial neural networks (ANNs). Aiming at this limitation, here we propose a novel method of constructing deep SNN models with knowledge distillation (KD) that uses ANN as teacher model and SNN as student model. Through ANN-SNN joint training algorithm, the student SNN model can learn rich feature information from the teacher ANN model through the KD method, yet it avoids training SNN from scratch when communicating with non-differentiable spikes. Our method can not only build a more efficient deep spiking structure feasibly and reasonably, but use few time steps to train whole model compared to direct training or ANN to SNN methods. More importantly, it has a superb ability of noise immunity for various types of artificial noises and natural signals. The proposed novel method provides efficient ways to improve the performance of SNN through constructing deeper structures in a high-throughput fashion, with potential usage for light and efficient brain-inspired computing of practical scenarios.

The Differentiable Lens: Compound Lens Search Over Glass Surfaces and Materials for Object Detection
C\^ot\'e, GeoffroiandMannan, FahimandThibault, SimonandLalonde, Jean-Fran\c{c



研究问题:大多数相机镜头系统是独立设计的,与下游的计算机视觉方法分开。
动机:近年来,联合优化方法在图像获取和处理管道的其他组件——特别是下游神经网络——中设计镜头,已经取得了改善成像质量或在视觉任务上表现更好的效果。然而,这些现有方法只优化了镜头参数的一部分,无法优化玻璃材料,因为玻璃材料的类别性质。
方法:我们开发了一个可微分的球面镜头模拟模型,准确地捕捉到几何像差。我们提出了一种优化策略,以解决镜头设计的挑战——由于非凸损失函数景观和许多制造约束而加剧的问题——这些问题在联合优化任务中更加严重。具体来说,我们在端到端设计环境中引入量化连续的玻璃变量,以便于优化和选择玻璃材料,并结合精心设计的约束条件来支持可制造性。
效果:在汽车目标检测中,我们报告说,即使简化设计为两元素或三元素镜头,也比现有的设计有更高的检测性能,尽管这显著降低了图像质量。

Most camera lens systems are designed in isolation, separately from downstream computer vision methods. Recently, joint optimization approaches that design lenses alongside other components of the image acquisition and processing pipeline--notably, downstream neural networks--have achieved improved imaging quality or better performance on vision tasks. However, these existing methods optimize only a subset of lens parameters and cannot optimize glass materials given their categorical nature. In this work, we develop a differentiable spherical lens simulation model that accurately captures geometrical aberrations. We propose an optimization strategy to address the challenges of lens design--notorious for non-convex loss function landscapes and many manufacturing constraints--that are exacerbated in joint optimization tasks. Specifically, we introduce quantized continuous glass variables to facilitate the optimization and selection of glass materials in an end-to-end design context, and couple this with carefully designed constraints to support manufacturability. In automotive object detection, we report improved detection performance over existing designs even when simplifying designs to two- or three-element lenses, despite significantly degrading the image quality.

Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive Matrices
Xu, JingyiandVaidya, TusharandWu, YufeiandChandra, SaketandLai, ZhangshengandChong, KaiFongErnest



研究问题:本文旨在介绍一种适合抽象推理的新推理框架——代数机器推理。
动机:代数机器推理将新颖的问题解决过程简化为常规的代数计算,可以有效降低复杂性。
方法:通过求解Raven's Progressive Matrices (RPMs)作为代数计算问题,结合各种已知的代数子程序,如计算理想Grobner基、检查理想包含等。
效果:在I-RAVEN数据集上的实验中,该模型的整体准确率达到93.2%,显著优于当前最先进的77.0%准确率,甚至超过了人类的84.4%准确率。

We introduce algebraic machine reasoning, a new reasoning framework that is well-suited for abstract reasoning. Effectively, algebraic machine reasoning reduces the difficult process of novel problem-solving to routine algebraic computation. The fundamental algebraic objects of interest are the ideals of some suitably initialized polynomial ring. We shall explain how solving Raven's Progressive Matrices (RPMs) can be realized as computational problems in algebra, which combine various well-known algebraic subroutines that include: Computing the Grobner basis of an ideal, checking for ideal containment, etc. Crucially, the additional algebraic structure satisfied by ideals allows for more operations on ideals beyond set-theoretic operations. Our algebraic machine reasoning framework is not only able to select the correct answer from a given answer set, but also able to generate the correct answer with only the question matrix given. Experiments on the I-RAVEN dataset yield an overall 93.2% accuracy, which significantly outperforms the current state-of-the-art accuracy of 77.0% and exceeds human performance at 84.4% accuracy.

ABCD: Arbitrary Bitwise Coefficient for De-Quantization
Han, WooKyoungandLee, ByeonghunandPark, SangHyunandJin, KyongHwan



研究问题:如何从任意量化的输入中恢复去量化的图像。
动机:现有的位深度扩展方法在处理低比特深度图像时,如压缩编解码器产生的8位以下图像,会出现带状和模糊的人工痕迹,效果并不理想。
方法:提出一种隐式神经网络函数,通过引入位查询来从任意量化的输入中恢复去量化的图像,并开发了一个相位估计器来利用最近像素的信息。
效果:在自然和动画图像上,该方法的性能超过了先前的位深度扩展方法。同时,在YouTube UGC数据集上进行去带状化演示,也取得了良好的效果。

Modern displays and contents support more than 8bits image and video. However, bit-starving situations such as compression codecs make low bit-depth (LBD) images (<8bits), occurring banding and blurry artifacts. Previous bit depth expansion (BDE) methods still produce unsatisfactory high bit-depth (HBD) images. To this end, we propose an implicit neural function with a bit query to recover de-quantized images from arbitrarily quantized inputs. We develop a phasor estimator to exploit the information of the nearest pixels. Our method shows superior performance against prior BDE methods on natural and animation images. We also demonstrate our model on YouTube UGC datasets for de-banding. Our source code is available at https://github.com/WooKyoungHan/ABCD

CLIPPING: Distilling CLIP-Based Models With a Student Base for Video-Language Retrieval
Pei, RenjingandLiu, JianzhuangandLi, WeimianandShao, BinandXu, SongcenandDai, PengandLu, JuweiandYan, Youliang



研究问题:如何将预训练的视觉语言模型的知识有效地转移到小模型中,同时保持准确性。
动机:预训练的视觉语言模型通常推理时间长,而知识蒸馏是一种有效的技术,可以将大模型的能力转移到小模型中,同时保持准确性。
方法:提出一种新的知识蒸馏方法,名为CLIPPING,通过在微调阶段将大量已针对视频-语言任务进行微调的大型教师模型的知识有效转移到小型学生模型中。特别是,提出了一种新的层对齐方法,以学生为基准进行中间层的蒸馏,使学生的层成为教师的基础,从而让学生充分吸收教师的知识。
效果:CLIPPING在三个视频-语言检索基准上实现了其教师88.1%-95.3%的性能,其视觉编码器的大小仅为原来的19.5倍。CLIPPING在MSR-VTT数据集上也显著优于最先进的小型基线(ALL-in-one-B),获得了相对7.4%的性能提升,参数减少了29%,运算减少了86.9%。此外,CLIPPING与许多大型预训练模型相当甚至更优。

Pre-training a vison-language model and then fine-tuning it on downstream tasks have become a popular paradigm. However, pre-trained vison-language models with the Transformer architecture usually take long inference time. Knowledge distillation has been an efficient technique to transfer the capability of a large model to a small one while maintaining the accuracy, which has achieved remarkable success in natural language processing. However, it faces many problems when applying KD to the multi-modality applications. In this paper, we propose a novel knowledge distillation method, named CLIPPING, where the plentiful knowledge of a large teacher model that has been fine-tuned for video-language tasks with the powerful pre-trained CLIP can be effectively transferred to a small student only at the fine-tuning stage. Especially, a new layer-wise alignment with the student as the base is proposed for knowledge distillation of the intermediate layers in CLIPPING, which enables the student's layers to be the bases of the teacher, and thus allows the student to fully absorb the knowledge of the teacher. CLIPPING with MobileViT-v2 as the vison encoder without any vison-language pre-training achieves 88.1%-95.3% of the performance of its teacher on three video-language retrieval benchmarks, with its vison encoder being 19.5x smaller. CLIPPING also significantly outperforms a state-of-the-art small baseline (ALL-in-one-B) on the MSR-VTT dataset, obtaining relatively 7.4% performance gain, with 29% fewer parameters and 86.9% fewer flops. Moreover, CLIPPING is comparable or even superior to many large pre-training models.

Learning Federated Visual Prompt in Null Space for MRI Reconstruction
Feng, Chun-MeiandLi, BangjunandXu, XinxingandLiu, YongandFu, HuazhuandZuo, Wangmeng



研究问题:如何利用联邦磁共振成像(MRI)重建技术,在不聚合本地数据的情况下实现多医院分布式协作,保护患者隐私。
动机:由于不同的MRI协议、不足的本地训练数据和有限的通信带宽导致的数据异质性,不可避免地影响了全局模型的收敛和更新。
方法:本文提出了一种新的算法FedPR,用于学习MRI重建中全局提示为零空间的联邦视觉提示。FedPR是一种新的联邦范式,采用强大的预训练模型,同时仅学习和通信具有少量可学习参数的提示,从而显著降低通信成本并在有限的本地数据上实现竞争性能。此外,为了解决由数据异质性引起的灾难性遗忘问题,FedPR还更新了高效的联邦视觉提示,将局部提示投影到全局提示的近似零空间中,从而抑制梯度对服务器性能的干扰。
效果:在联邦MRI上的大量实验表明,当给定有限数量的本地数据时,FedPR的性能明显优于最先进的FL算法,通信成本降低了6%。

Federated Magnetic Resonance Imaging (MRI) reconstruction enables multiple hospitals to collaborate distributedly without aggregating local data, thereby protecting patient privacy. However, the data heterogeneity caused by different MRI protocols, insufficient local training data, and limited communication bandwidth inevitably impair global model convergence and updating. In this paper, we propose a new algorithm, FedPR, to learn federated visual prompts in the null space of global prompt for MRI reconstruction. FedPR is a new federated paradigm that adopts a powerful pre-trained model while only learning and communicating the prompts with few learnable parameters, thereby significantly reducing communication costs and achieving competitive performance on limited local data. Moreover, to deal with catastrophic forgetting caused by data heterogeneity, FedPR also updates efficient federated visual prompts that project the local prompts into an approximate null space of the global prompt, thereby suppressing the interference of gradients on the server performance. Extensive experiments on federated MRI show that FedPR significantly outperforms state-of-the-art FL algorithms with < 6% of communication costs when given the limited amount of local data.

Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning
Tu, Cheng-HaoandMai, ZhedaandChao, Wei-Lun



研究问题:如何有效利用预训练模型的中间特征进行准确的下游任务预测。
动机:预训练模型的中间特征对于下游任务预测具有重要信息,但如何有效利用这些特征仍是一个挑战。
方法:提出视觉查询调优(VQT)方法,通过在每一层引入可学习的“查询”令牌,利用Transformer的内部工作机制对各层的丰富中间特征进行“总结”,然后用于训练下游任务的预测头。
效果:实验表明,VQT在许多情况下优于其他参数高效微调方法,并在内存限制下实现了更高的准确率。同时,VQT与这些方法兼容,可以进一步提高迁移学习的准确性。

Intermediate features of a pre-trained model have been shown informative for making accurate predictions on downstream tasks, even if the model backbone is frozen. The key challenge is how to utilize them, given the gigantic amount. We propose visual query tuning (VQT), a simple yet effective approach to aggregate intermediate features of Vision Transformers. Through introducing a handful of learnable "query" tokens to each layer, VQT leverages the inner workings of Transformers to "summarize" rich intermediate features of each layer, which can then be used to train the prediction heads of downstream tasks. As VQT keeps the intermediate features intact and only learns to combine them, it enjoys memory efficiency in training, compared to many other parameter-efficient fine-tuning approaches that learn to adapt features and need back-propagation through the entire backbone. This also suggests the complementary role between VQT and those approaches in transfer learning. Empirically, VQT consistently surpasses the state-of-the-art approach that utilizes intermediate features for transfer learning and outperforms full fine-tuning in many cases. Compared to parameter-efficient approaches that adapt features, VQT achieves much higher accuracy under memory constraints. Most importantly, VQT is compatible with these approaches to attain higher accuracy, making it a simple add-on to further boost transfer learning.

Efficient Scale-Invariant Generator With Column-Row Entangled Pixel Synthesis
Nguyen, ThuanHoangandVanLe, ThanhandTran, Anh



研究问题:如何有效地合成任意尺度的图像,特别是在超过2K分辨率的情况下。
动机:现有的基于GAN的解决方案过度依赖卷积和分层架构,导致输出分辨率扩展时出现不一致性及“纹理粘贴”问题。而INR基生成器虽然设计上具有尺度等变性,但其巨大的内存占用和慢速推理阻碍了其在大规模或实时系统中的应用。
方法:提出一种名为Column-Row Entangled Pixel Synthesis(CREPS)的新生成模型,该模型无需使用任何空间卷积或粗到细的设计,既能高效运行又具有尺度等变性。为了节省内存并使系统可扩展,我们采用了一种新的双行表示法,将逐层特征图分解为独立的“厚”列和行编码。
效果:在FFHQ、LSUN-Church和MetFaces等标准数据集上的实验证明,CREPS能够合成尺度一致且无混叠的图像,最高可达4K分辨率,同时具备适当的训练和推理速度。

Any-scale image synthesis offers an efficient and scalable solution to synthesize photo-realistic images at any scale, even going beyond 2K resolution. However, existing GAN-based solutions depend excessively on convolutions and a hierarchical architecture, which introduce inconsistency and the "texture sticking" issue when scaling the output resolution. From another perspective, INR-based generators are scale-equivariant by design, but their huge memory footprint and slow inference hinder these networks from being adopted in large-scale or real-time systems. In this work, we propose Column-Row Entangled Pixel Synthesisthes (CREPS), a new generative model that is both efficient and scale-equivariant without using any spatial convolutions or coarse-to-fine design. To save memory footprint and make the system scalable, we employ a novel bi-line representation that decomposes layer-wise feature maps into separate "thick" column and row encodings. Experiments on standard datasets, including FFHQ, LSUN-Church, and MetFaces, confirm CREPS' ability to synthesize scale-consistent and alias-free images up to 4K resolution with proper training and inference speed.

Active Finetuning: Exploiting Annotation Budget in the Pretraining-Finetuning Paradigm
Xie, YichenandLu, HanandYan, JunchiandYang, XiaokangandTomizuka, MasayoshiandZhan, Wei



研究问题:如何有效利用标注预算进行预训练-微调范式中的样本选择和优化。
动机:尽管预训练-微调在计算机视觉任务中被广泛使用,但很少有研究关注如何优化微调阶段的标注预算。
方法:提出了一种名为ActiveFT的新颖方法,通过优化连续空间的参数模型来选择与整个未标记池分布相似且具有足够多样性的数据子集。
效果:实验证明,这种方法在选择的子集和整个数据池的分布之间的Earth Mover距离也减小了。在图像分类和语义分割任务上,ActiveFT的性能和效率均优于基线方法。

Given the large-scale data and the high annotation cost, pretraining-finetuning becomes a popular paradigm in multiple computer vision tasks. Previous research has covered both the unsupervised pretraining and supervised finetuning in this paradigm, while little attention is paid to exploiting the annotation budget for finetuning. To fill in this gap, we formally define this new active finetuning task focusing on the selection of samples for annotation in the pretraining-finetuning paradigm. We propose a novel method called ActiveFT for active finetuning task to select a subset of data distributing similarly with the entire unlabeled pool and maintaining enough diversity by optimizing a parametric model in the continuous space. We prove that the Earth Mover's distance between the distributions of the selected subset and the entire data pool is also reduced in this process. Extensive experiments show the leading performance and high efficiency of ActiveFT superior to baselines on both image classification and semantic segmentation.

MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering
Jiang, JingjingandZheng, Nanning



研究问题:预训练视觉-语言模型(VLMs)在低资源设置下进行特定任务的全参数微调时,计算成本高、存储效率低且容易过拟合。
动机:尽管现有的参数高效微调方法大大减少了可调参数的数量,但在低资源设置下的VQA任务中,与全微调相比仍存在显著的性能差距。
方法:本文提出了MixPHM,一种对冗余敏感的参数高效微调方法,该方法在低资源VQA任务上优于全微调。MixPHM是一个轻量级模块,由多个PHM专家以混合专家的方式实现。为了减少参数冗余,我们在低秩子空间中重新参数化专家权重,并共享MixPHM内部和跨部分的权重。此外,基于对表示冗余性的定量分析,我们提出了冗余正则化,这有助于MixPHM减少与任务无关的冗余,同时促进与任务相关的相关性。
效果:在VQA v2、GQA和OK-VQA等不同低资源设置下进行的实验表明,我们的MixPHM超越了最先进的参数高效方法,并且是唯一能持续超越全微调的方法。

Recently, finetuning pretrained vision-language models (VLMs) has been a prevailing paradigm for achieving state-of-the-art performance in VQA. However, as VLMs scale, it becomes computationally expensive, storage inefficient, and prone to overfitting when tuning full model parameters for a specific task in low-resource settings. Although current parameter-efficient tuning methods dramatically reduce the number of tunable parameters, there still exists a significant performance gap with full finetuning. In this paper, we propose MixPHM, a redundancy-aware parameter-efficient tuning method that outperforms full finetuning in low-resource VQA. Specifically, MixPHM is a lightweight module implemented by multiple PHM-experts in a mixture-of-experts manner. To reduce parameter redundancy, we reparameterize expert weights in a low-rank subspace and share part of the weights inside and across MixPHM. Moreover, based on our quantitative analysis of representation redundancy, we propose Redundancy Regularization, which facilitates MixPHM to reduce task-irrelevant redundancy while promoting task-relevant correlation. Experiments conducted on VQA v2, GQA, and OK-VQA with different low-resource settings show that our MixPHM outperforms state-of-the-art parameter-efficient methods and is the only one consistently surpassing full finetuning.

A Dynamic Multi-Scale Voxel Flow Network for Video Prediction
Hu, XiaotaoandHuang, ZheweiandHuang, AilinandXu, JunandZhou, Shuchang



研究问题:如何提高视频预测的性能,同时降低计算成本和模型大小。
动机:现有的视频预测方法大多需要额外的输入(如语义/深度图),并且模型大、计算成本高。
方法:提出一种动态多尺度体素流网络(DMVFN),仅使用RGB图像进行训练和预测,通过可微分的路由模块感知视频帧的运动尺度,选择适应的子网络进行输入。
效果:实验证明,DMVFN比现有的方法快一个数量级,并在生成的图像质量上超过了最先进的迭代方法OPT。

The performance of video prediction has been greatly boosted by advanced deep neural networks. However, most of the current methods suffer from large model sizes and require extra inputs, e.g., semantic/depth maps, for promising performance. For efficiency consideration, in this paper, we propose a Dynamic Multi-scale Voxel Flow Network (DMVFN) to achieve better video prediction performance at lower computational costs with only RGB images, than previous methods. The core of our DMVFN is a differentiable routing module that can effectively perceive the motion scales of video frames. Once trained, our DMVFN selects adaptive sub-networks for different inputs at the inference stage. Experiments on several benchmarks demonstrate that our DMVFN is an order of magnitude faster than Deep Voxel Flow and surpasses the state-of-the-art iterative-based OPT on generated image quality. Our code and demo are available at https://huxiaotaostasy.github.io/DMVFN/.

Stitchable Neural Networks
Pan, ZizhengandCai, JianfeiandZhuang, Bohan



研究问题:如何有效地组装预训练模型族,以实现运行时的动态精度-效率权衡。
动机:预训练模型族的规模空前庞大,包含各种规模和性能的预训练模型,因此需要一种有效的方法来组装这些模型。
方法:提出可拼接神经网络(SN-Net)框架,该框架通过将预训练神经网络(称为锚点)拆分并在不同的块/层之间进行拼接,然后使用简单的拼接层将这些锚点映射到另一个锚点的激活上,从而在有限的训练轮次内有效地在不同规模的锚点之间进行插值。
效果:实验结果表明,SN-Net在ImageNet分类任务上的表现与许多单独训练的网络相当甚至更好,同时支持多样化的部署场景。

The public model zoo containing enormous powerful pretrained model families (e.g., ResNet/DeiT) has reached an unprecedented scope than ever, which significantly contributes to the success of deep learning. As each model family consists of pretrained models with diverse scales (e.g., DeiT-Ti/S/B), it naturally arises a fundamental question of how to efficiently assemble these readily available models in a family for dynamic accuracy-efficiency trade-offs at runtime. To this end, we present Stitchable Neural Networks (SN-Net), a novel scalable and efficient framework for model deployment. It cheaply produces numerous networks with different complexity and performance trade-offs given a family of pretrained neural networks, which we call anchors. Specifically, SN-Net splits the anchors across the blocks/layers and then stitches them together with simple stitching layers to map the activations from one anchor to another. With only a few epochs of training, SN-Net effectively interpolates between the performance of anchors with varying scales. At runtime, SN-Net can instantly adapt to dynamic resource constraints by switching the stitching positions. Extensive experiments on ImageNet classification demonstrate that SN-Net can obtain on-par or even better performance than many individually trained networks while supporting diverse deployment scenarios. For example, by stitching Swin Transformers, we challenge hundreds of models in Timm model zoo with a single network. We believe this new elastic model framework can serve as a strong baseline for further research in wider communities.

Federated Learning With Data-Agnostic Distribution Fusion
Duan, Jian-huiandLi, WenzhongandZou, DerunandLi, RuichenandLu, Sanglu



研究问题:联邦学习中,由于数据样本在各客户端间不独立同分布(non-IID),导致全局模型的收敛速度慢且性能下降。
动机:为了解决联邦学习中non-IID数据的模型聚合问题,需要在保护隐私政策的前提下,推断出未知的全局分布。
方法:本文提出了一种名为FedFusion的数据无关分布融合模型聚合方法,该方法基于变分自编码器(VAE)学习分布融合组件的最优参数,以优化具有非IID本地数据集的联邦学习。
效果:通过在各种联邦学习场景和真实世界数据集上的大量实验,发现FedFusion相比现有技术有显著的性能提升。

Federated learning has emerged as a promising distributed machine learning paradigm to preserve data privacy. One of the fundamental challenges of federated learning is that data samples across clients are usually not independent and identically distributed (non-IID), leading to slow convergence and severe performance drop of the aggregated global model. To facilitate model aggregation on non-IID data, it is desirable to infer the unknown global distributions without violating privacy protection policy. In this paper, we propose a novel data-agnostic distribution fusion based model aggregation method called FedFusion to optimize federated learning with non-IID local datasets, based on which the heterogeneous clients' data distributions can be represented by a global distribution of several virtual fusion components with different parameters and weights. We develop a Variational AutoEncoder (VAE) method to learn the optimal parameters of the distribution fusion components based on limited statistical information extracted from the local models, and apply the derived distribution fusion model to optimize federated model aggregation with non-IID data. Extensive experiments based on various federated learning scenarios with real-world datasets show that FedFusion achieves significant performance improvement compared to the state-of-the-art.

PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers
Xu, JiacongandXiong, ZixiangandBhattacharyya, ShankarP.



研究问题:现有的两分支网络在实时语义分割任务中虽然有效,但高分辨率细节和研究问题:现有的两分支网络在实时语义分割任务中虽然有效,但高分辨率细节和低频率上下文的直接融合会导致细节特征被周围环境信息所淹没,限制了分割精度的提升。
动机:为了解决两分支网络在直接融合高分辨率细节和低频率上下文时出现的特征淹没问题,作者将卷积神经网络(CNN)与比例-积分-微分(PID)控制器进行关联,发现两分支网络相当于一个比例-积分(PI)控制器,同样存在过冲问题。
方法:为此,作者提出了一种新的三分支网络架构——PIDNet,它包含三个分支分别解析细节、上下文和边界信息,并采用边界注意力引导细节和上下文分支的融合。
效果:实验证明,PIDNet系列模型在推理速度和准确性之间取得了最佳平衡,其准确性在所有具有相似推理速度的现有模型上均超过了Cityscapes和CamVid数据集。具体来说,PIDNet-S在Cityscapes上实现了78.6 mIOU,推理速度为93.2 FPS;在CamVid上实现了80.1 mIOU,速度为153.7 FPS。

Two-branch network architecture has shown its efficiency and effectiveness in real-time semantic segmentation tasks. However, direct fusion of high-resolution details and low-frequency context has the drawback of detailed features being easily overwhelmed by surrounding contextual information. This overshoot phenomenon limits the improvement of the segmentation accuracy of existing two-branch models. In this paper, we make a connection between Convolutional Neural Networks (CNN) and Proportional-Integral-Derivative (PID) controllers and reveal that a two-branch network is equivalent to a Proportional-Integral (PI) controller, which inherently suffers from similar overshoot issues. To alleviate this problem, we propose a novel three-branch network architecture: PIDNet, which contains three branches to parse detailed, context and boundary information, respectively, and employs boundary attention to guide the fusion of detailed and context branches. Our family of PIDNets achieve the best trade-off between inference speed and accuracy and their accuracy surpasses all the existing models with similar inference speed on the Cityscapes and CamVid datasets. Specifically, PIDNet-S achieves 78.6 mIOU with inference speed of 93.2 FPS on Cityscapes and 80.1 mIOU with speed of 153.7 FPS on CamVid.

How To Prevent the Poor Performance Clients for Personalized Federated Learning?
Qu, ZheandLi, XingyuandHan, XiaoandDuan, RuiandShen, ChengchaoandChen, Lixing



研究问题:如何在异构分布式本地数据中,为每个客户端提供定制化模型解决方案。
动机:尽管许多最新研究已应用各种算法来提高个性化联邦学习中的个性化程度,但他们主要关注从平均或顶级角度提高性能,而忽视了部分表现不佳的客户端。
方法:提出了一种名为“局部个性化,普遍通用化”(PLGU)的新型联邦学习策略。通过设计一个分层锐度感知最小化(LWSAM)算法,在保持个性化的同时,对细粒度的普遍信息进行泛化并调整其有偏的性能。
效果:实验结果表明,所提出的基于PLGU的策略在两种联邦学习方案上都实现了具有竞争力的泛化界限,且所有基于PLGU的算法都达到了最先进的性能。

Personalized federated learning (pFL) collaboratively trains personalized models, which provides a customized model solution for individual clients in the presence of heterogeneous distributed local data. Although many recent studies have applied various algorithms to enhance personalization in pFL, they mainly focus on improving the performance from averaging or top perspective. However, part of the clients may fall into poor performance and are not clearly discussed. Therefore, how to prevent these poor clients should be considered critically. Intuitively, these poor clients may come from biased universal information shared with others. To address this issue, we propose a novel pFL strategy, called Personalize Locally, Generalize Universally (PLGU). PLGU generalizes the fine-grained universal information and moderates its biased performance by designing a Layer-Wised Sharpness Aware Minimization (LWSAM) algorithm while keeping the personalization local. Specifically, we embed our proposed PLGU strategy into two pFL schemes concluded in this paper: with/without a global model, and present the training procedures in detail. Through in-depth study, we show that the proposed PLGU strategy achieves competitive generalization bounds on both considered pFL schemes. Our extensive experimental results show that all the proposed PLGU based-algorithms achieve state-of-the-art performance.

CP3: Channel Pruning Plug-In for Point-Based Networks
Huang, YaominandLiu, NingandChe, ZhengpingandXu, ZhiyuanandShen, ChaominandPeng, YaxinandZhang, GuixuandLiu, XinmeiandFeng, FeifeiandTang, Jian



研究问题:如何有效地减少三维点基神经网络的计算成本和内存占用,同时保持相当的准确性。
动机:尽管2D图像卷积神经网络(CNNs)的通道剪枝方法取得了巨大成功,但现有的工作很少将其扩展到3D点基神经网络(PNNs)。
方法:提出了一种针对点基网络的通道剪枝插件CP^3,该插件精心设计以利用点云和PNN的特性,使2D通道剪枝方法适用于PNN。具体来说,它提出了一个坐标增强的通道重要性度量,以反映维度信息与单个通道特征之间的相关性,并在PNN的采样过程中回收被丢弃的点,重新考虑其可能独有的信息,以提高通道剪枝的鲁棒性。
效果:在各种PNN架构上的实验表明,CP^3不断改进了最先进的2D CNN剪枝方法在不同点云任务上的性能。例如,我们的压缩PointNeXt-S在ScanObjectNN上实现了88.52%的准确率,剪枝率为57.8%,比基线剪枝方法提高了1.94%的准确率。

Channel pruning has been widely studied as a prevailing method that effectively reduces both computational cost and memory footprint of the original network while keeping a comparable accuracy performance. Though great success has been achieved in channel pruning for 2D image-based convolutional networks (CNNs), existing works seldom extend the channel pruning methods to 3D point-based neural networks (PNNs). Directly implementing the 2D CNN channel pruning methods to PNNs undermine the performance of PNNs because of the different representations of 2D images and 3D point clouds as well as the network architecture disparity. In this paper, we proposed CP^3, which is a Channel Pruning Plug-in for Point-based network. CP^3 is elaborately designed to leverage the characteristics of point clouds and PNNs in order to enable 2D channel pruning methods for PNNs. Specifically, it presents a coordinate-enhanced channel importance metric to reflect the correlation between dimensional information and individual channel features, and it recycles the discarded points in PNN's sampling process and reconsiders their potentially-exclusive information to enhance the robustness of channel pruning. Experiments on various PNN architectures show that CP^3 constantly improves state-of-the-art 2D CNN pruning approaches on different point cloud tasks. For instance, our compressed PointNeXt-S on ScanObjectNN achieves an accuracy of 88.52% with a pruning rate of 57.8%, outperforming the baseline pruning methods with an accuracy gain of 1.94%.

MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation
Miles, RoyandYucel, MehmetKerimandManganelli, BrunoandSa\`a-Garriga, Albert



研究问题:本文旨在解决资源受限设备上的半监督视频对象分割问题。
动机:在资源有限的设备上,如手机,进行高效的视频对象分割。
方法:通过知识蒸馏任务,构建了一个理论框架,将对比性表示学习和知识蒸馏相结合,同时从预训练的教师模型中进行学习。
效果:在DAVIS和YouTube基准测试中,该方法在运行速度提高5倍,参数减少32倍的情况下,仍能达到与最先进的技术相竞争的结果。

This paper tackles the problem of semi-supervised video object segmentation on resource-constrained devices, such as mobile phones. We formulate this problem as a distillation task, whereby we demonstrate that small space-time-memory networks with finite memory can achieve competitive results with state of the art, but at a fraction of the computational cost (32 milliseconds per frame on a Samsung Galaxy S22). Specifically, we provide a theoretically grounded framework that unifies knowledge distillation with supervised contrastive representation learning. These models are able to jointly benefit from both pixel-wise contrastive learning and distillation from a pre-trained teacher. We validate this loss by achieving competitive J&F to state of the art on both the standard DAVIS and YouTube benchmarks, despite running up to x5 faster, and with x32 fewer parameters.

Unsupervised Continual Semantic Adaptation Through Neural Rendering
Liu, ZhizhengandMilano, FrancescoandFrey, JonasandSiegwart, RolandandBlum, HermannandCadena, Cesar



研究问题:如何适应新场景进行语义分割任务,同时保持之前场景的性能。
动机:由于训练和部署数据之间的不匹配,模型在新场景上的适应性通常至关重要。
方法:提出为每个场景训练一个Semantic-NeRF网络,通过融合分割模型的预测结果,并使用一致的渲染语义标签作为伪标签来调整模型。
效果:在ScanNet上评估该方法,其表现优于基于体素的基线方法和最新的无监督领域适应方法。

An increasing amount of applications rely on data-driven models that are deployed for perception tasks across a sequence of scenes. Due to the mismatch between training and deployment data, adapting the model on the new scenes is often crucial to obtain good performance. In this work, we study continual multi-scene adaptation for the task of semantic segmentation, assuming that no ground-truth labels are available during deployment and that performance on the previous scenes should be maintained. We propose training a Semantic-NeRF network for each scene by fusing the predictions of a segmentation model and then using the view-consistent rendered semantic labels as pseudo-labels to adapt the model. Through joint training with the segmentation model, the Semantic-NeRF model effectively enables 2D-3D knowledge transfer. Furthermore, due to its compact size, it can be stored in a long-term memory and subsequently used to render data from arbitrary viewpoints to reduce forgetting. We evaluate our approach on ScanNet, where we outperform both a voxel-based baseline and a state-of-the-art unsupervised domain adaptation method.

GradMA: A Gradient-Memory-Based Accelerated Federated Learning With Alleviated Catastrophic Forgetting
Luo, KangyangandLi, XiangandLan, YunshiandGao, Ming



研究问题:联邦学习中的数据异质性和部分参与导致的灾难性遗忘对性能产生负面影响。
动机:提出一种新的联邦学习方法(即GradMA),借鉴持续学习的思想,同时修正服务器端和工作端更新方向,充分利用服务器的丰富计算和内存资源。
方法:设计了一种记忆减少策略,使GradMA能够适应大规模的工人。在平滑非凸设置下从理论上分析了GradMA的收敛性,并证明其收敛速度比采样活跃工人数量增加的速度提高了线性倍率。
效果:在各种图像分类任务上进行的广泛实验表明,与最先进的SOTA基线相比,GradMA在准确性和通信效率方面取得了显著的性能提升。

Federated Learning (FL) has emerged as a de facto machine learning area and received rapid increasing research interests from the community. However, catastrophic forgetting caused by data heterogeneity and partial participation poses distinctive challenges for FL, which are detrimental to the performance. To tackle the problems, we propose a new FL approach (namely GradMA), which takes inspiration from continual learning to simultaneously correct the server-side and worker-side update directions as well as take full advantage of server's rich computing and memory resources. Furthermore, we elaborate a memory reduction strategy to enable GradMA to accommodate FL with a large scale of workers. We then analyze convergence of GradMA theoretically under the smooth non-convex setting and show that its convergence rate achieves a linear speed up w.r.t the increasing number of sampled active workers. At last, our extensive experiments on various image classification tasks show that GradMA achieves significant performance gains in accuracy and communication efficiency compared to SOTA baselines. We provide our code here: https://github.com/lkyddd/GradMA.

POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery
Zheng, CeandLiu, XianpengandQi, Guo-JunandChen, Chen



研究问题:如何降低在单目图像中进行人体网格恢复(HMR)的Transformer架构的内存和计算开销。
动机:虽然Transformer架构在单目图像的人体网格恢复任务上取得了最先进的性能,但其高昂的内存和计算成本限制了其在实际应用中的使用。
方法:提出了一种名为POoling aTtention TransformER(POTTER)的纯Transformer架构,通过引入高效的池化注意力模块来显著降低内存和计算成本,同时设计了一个集成高分辨率(HR)流的新Transformer架构,利用其高分辨率的局部和全局特征来恢复更准确的人体网格。
效果:实验结果表明,POTTER在Human3.6M和3DPW数据集上的PA-MPJPE和所有三个指标上都优于最先进的METRO方法,而且所需的参数和乘积累加运算分别只有METRO的7%和14%。

Transformer architectures have achieved SOTA performance on the human mesh recovery (HMR) from monocular images. However, the performance gain has come at the cost of substantial memory and computational overhead. A lightweight and efficient model to reconstruct accurate human mesh is needed for real-world applications. In this paper, we propose a pure transformer architecture named POoling aTtention TransformER (POTTER) for the HMR task from single images. Observing that the conventional attention module is memory and computationally expensive, we propose an efficient pooling attention module, which significantly reduces the memory and computational cost without sacrificing performance. Furthermore, we design a new transformer architecture by integrating a High-Resolution (HR) stream for the HMR task. The high-resolution local and global features from the HR stream can be utilized for recovering more accurate human mesh. Our POTTER outperforms the SOTA method METRO by only requiring 7% of total parameters and 14% of the Multiply-Accumulate Operations on the Human3.6M (PA-MPJPE) and 3DPW (all three metrics) datasets. Code will be publicly available.

DynaFed: Tackling Client Data Heterogeneity With Global Dynamics
Pi, RenjieandZhang, WeizhongandXie, YueqiandGao, JiahuiandWang, XiaoyuandKim, SunghunandChen, Qifeng



研究问题:联邦学习(FL)在面对异构客户端数据时面临挑战,本地训练非iid分布的数据会导致局部最优解偏离,使得客户端模型彼此偏离,并降低全局模型的性能。
动机:为了解决这一问题,本文提出了一种在不妨碍数据隐私的情况下收集和利用服务器上的全局知识的方法。
方法:首先在服务器上保留一段全球模型快照的轨迹,然后合成一个小的伪数据集,使在该数据集上训练的模型能够模仿保留的全球模型轨迹的动态。之后,使用合成的数据帮助聚合偏离的客户端到全局模型中。该方法被称为DynaFed。
效果:实验结果表明,DynaFed在广泛的基准测试中都表现出了良好的效果。同时,也提供了对该方法底层机制的深入理解和见解。

The Federated Learning (FL) paradigm is known to face challenges under heterogeneous client data. Local training on non-iid distributed data results in deflected local optimum, which causes the client models drift further away from each other and degrades the aggregated global model's performance. A natural solution is to gather all client data onto the server, such that the server has a global view of the entire data distribution. Unfortunately, this reduces to regular training, which compromises clients' privacy and conflicts with the purpose of FL. In this paper, we put forth an idea to collect and leverage global knowledge on the server without hindering data privacy. We unearth such knowledge from the dynamics of the global model's trajectory. Specifically, we first reserve a short trajectory of global model snapshots on the server. Then, we synthesize a small pseudo dataset such that the model trained on it mimics the dynamics of the reserved global model trajectory. Afterward, the synthesized data is used to help aggregate the deflected clients into the global model. We name our method DynaFed, which enjoys the following advantages: 1) we do not rely on any external on-server dataset, which requires no additional cost for data collection; 2) the pseudo data can be synthesized in early communication rounds, which enables DynaFed to take effect early for boosting the convergence and stabilizing training; 3) the pseudo data only needs to be synthesized once and can be directly utilized on the server to help aggregation in subsequent rounds. Experiments across extensive benchmarks are conducted to showcase the effectiveness of DynaFed. We also provide insights and understanding of the underlying mechanism of our method.

DistilPose: Tokenized Pose Regression With Heatmap Distillation
Ye, SuhangandZhang, YingyiandHu, JieandCao, LiujuanandZhang, ShengchuanandShen, LeiandWang, JunandDing, ShouhongandJi, Rongrong



研究问题:如何在人体姿态估计中同时利用基于热图的方法和基于回归的方法,以提高性能并保持效率。
动机:基于热图的方法在性能上优于基于回归的方法,但速度较慢;而基于回归的方法在速度上占优,但性能较差。如何结合两者的优势是一个挑战。
方法:提出一种名为DistilPose的新型人体姿态估计框架,通过令牌蒸馏编码器(TDE)和模拟热图,将知识从基于热图的教师模型转移到基于回归的学生模型。
效果:实验表明,提出的DistilPose可以显著提高基于回归的模型的性能,同时保持效率。在MSCOCO验证数据集上,DistilPose-S获得了71.6%的mAP,参数量为5.36M,GFLOPs为2.38,FPS为40.2,比其教师模型节省了12.95x、7.16x的计算成本,快了4.9倍,性能仅下降0.9点。此外,DistilPose-L在MSCOCO验证数据集上获得了74.4%的mAP,成为主流基于回归的模型中的新领先者。

In the field of human pose estimation, regression-based methods have been dominated in terms of speed, while heatmap-based methods are far ahead in terms of performance. How to take advantage of both schemes remains a challenging problem. In this paper, we propose a novel human pose estimation framework termed DistilPose, which bridges the gaps between heatmap-based and regression-based methods. Specifically, DistilPose maximizes the transfer of knowledge from the teacher model (heatmap-based) to the student model (regression-based) through Token-distilling Encoder (TDE) and Simulated Heatmaps. TDE aligns the feature spaces of heatmap-based and regression-based models by introducing tokenization, while Simulated Heatmaps transfer explicit guidance (distribution and confidence) from teacher heatmaps into student models. Extensive experiments show that the proposed DistilPose can significantly improve the performance of the regression-based models while maintaining efficiency. Specifically, on the MSCOCO validation dataset, DistilPose-S obtains 71.6% mAP with 5.36M parameter, 2.38 GFLOPs and 40.2 FPS, which saves 12.95x, 7.16x computational cost and is 4.9x faster than its teacher model with only 0.9 points performance drop. Furthermore, DistilPose-L obtains 74.4% mAP on MSCOCO validation dataset, achieving a new state-of-the-art among predominant regression-based models.

CUF: Continuous Upsampling Filters
Vasconcelos, CristinaN.andOztireli, CengizandMatthews, MarkandHashemi, MiladandSwersky, KevinandTagliasacchi, Andrea



研究问题:如何将神经场应用于2D图像处理中的一个重要操作——上采样。
动机:尽管神经场已被广泛用于3D信号表示,但在经典的2D图像处理中的应用相对有限。
方法:我们将上采样核参数化为神经场,这种参数化方式使得我们的架构比竞争的任意尺度超分辨率架构减少了40倍的参数数量。
效果:在对大小为256x256的图像进行上采样时,我们的架构比竞争的任意尺度超分辨率架构效率高2-10倍,并且在实例化为单尺度模型时,比亚像素卷积更高效。

Neural fields have rapidly been adopted for representing 3D signals, but their application to more classical 2D image-processing has been relatively limited. In this paper, we consider one of the most important operations in image processing: upsampling. In deep learning, learnable upsampling layers have extensively been used for single image super-resolution. We propose to parameterize upsampling kernels as neural fields. This parameterization leads to a compact architecture that obtains a 40-fold reduction in the number of parameters when compared with competing arbitrary-scale super-resolution architectures. When upsampling images of size 256x256 we show that our architecture is 2x-10x more efficient than competing arbitrary-scale super-resolution architectures, and more efficient than sub-pixel convolutions when instantiated to a single-scale model. In the general setting, these gains grow polynomially with the square of the target scale. We validate our method on standard benchmarks showing such efficiency gains can be achieved without sacrifices in super-resolution performance.

HOTNAS: Hierarchical Optimal Transport for Neural Architecture Search
Yang, JiechaoandLiu, YongandXu, Hongteng



研究问题:如何有效地在多个相对较小的细胞中搜索网络架构,同时衡量细胞微观结构和不同基于细胞的网络之间的宏观结构差异。
动机:目前的NAS方法需要在整个网络中进行直接搜索,成本较高。为了降低搜索成本,越来越多的方法是搜索多个相对较小的细胞。然而,如何衡量不同网络的相似性和差异性是一个主要挑战。
方法:提出了一种称为HOTNN的分层最优传输度量方法,用于测量不同网络的相似性。HOTNN通过考虑每个节点的相似性和每个细胞内节点对之间的信息流成本差异来计算不同网络中的细胞级相似性。通过网络级相似性和各自网络中每个细胞的全局位置变化来计算网络级相似性。然后在一个名为HOTNAS的贝叶斯优化框架中探索HOTNN,并证明其在多种任务中的有效性。
效果:实验表明,HOTNAS可以在多个模块化的基于细胞的搜索空间中发现性能更好的网络架构。

Instead of searching the entire network directly, current NAS approaches increasingly search for multiple relatively small cells to reduce search costs. A major challenge is to jointly measure the similarity of cell micro-architectures and the difference in macro-architectures between different cell-based networks. Recently, optimal transport (OT) has been successfully applied to NAS as it can capture the operational and structural similarity across various networks. However, existing OT-based NAS methods either ignore the cell similarity or focus solely on searching for a single cell architecture. To address these issues, we propose a hierarchical optimal transport metric called HOTNN for measuring the similarity of different networks. In HOTNN, the cell-level similarity computes the OT distance between cells in various networks by considering the similarity of each node and the differences in the information flow costs between node pairs within each cell in terms of operational and structural information. The network-level similarity calculates OT distance between networks by considering both the cell-level similarity and the variation in the global position of each cell within their respective networks. We then explore HOTNN in a Bayesian optimization framework called HOTNAS, and demonstrate its efficacy in diverse tasks. Extensive experiments demonstrate that HOTNAS can discover network architectures with better performance in multiple modular cell-based search spaces.

Practical Network Acceleration With Tiny Sets
Wang, Guo-HuaandWu, Jianxin



研究问题:如何利用少量训练样本加速神经网络。
动机:由于数据隐私问题,使用少量训练样本来加速网络在实践中变得至关重要。
方法:提出了一种全新的网络压缩方式——丢弃块,并定义了一个新的概念“可恢复性”来衡量压缩后的网络的恢复难度。
效果:实验结果表明,该方法在减少网络延迟方面优于之前的方法,并且在ImageNet-1k上平均比之前的方法高出7%。此外,该方法还具有良好的泛化能力,可以在无数据或领域外数据设置下良好运行。

Due to data privacy issues, accelerating networks with tiny training sets has become a critical need in practice. Previous methods mainly adopt filter-level pruning to accelerate networks with scarce training samples. In this paper, we reveal that dropping blocks is a fundamentally superior approach in this scenario. It enjoys a higher acceleration ratio and results in a better latency-accuracy performance under the few-shot setting. To choose which blocks to drop, we propose a new concept namely recoverability to measure the difficulty of recovering the compressed network. Our recoverability is efficient and effective for choosing which blocks to drop. Finally, we propose an algorithm named PRACTISE to accelerate networks using only tiny sets of training images. PRACTISE outperforms previous methods by a significant margin. For 22% latency reduction, PRACTISE surpasses previous methods by on average 7% on ImageNet-1k. It also enjoys high generalization ability, working well under data-free or out-of-domain data settings, too. Our code is at https://github.com/DoctorKey/Practise.

AstroNet: When Astrocyte Meets Artificial Neural Network
Han, MengqiaoandPan, LiyuanandLiu, Xiabi



研究问题:如何优化网络结构,提高其效率而不牺牲性能?
动机:通过研究星形胶质细胞这种新的神经元连接调控机制,提出一种可以自适应优化神经元连接的AstroNet模型。
方法:基于构建的星形胶质细胞-神经元模型,利用星形胶质细胞的双向通信特性,设计了一种具有时间调控机制和全局连接机制的AstroNet模型。该模型使用神经网络执行任务,同时用星形胶质细胞网络不断优化神经网络的连接,即自适应地为神经网络的神经元单元分配权重。
效果:在分类任务上的实验表明,我们的AstroNet模型可以在优化网络结构的同时实现最先进的准确率。

Network structure learning aims to optimize network architectures and make them more efficient without compromising performance. In this paper, we first study the astrocytes, a new mechanism to regulate connections in the classic M-P neuron. Then, with the astrocytes, we propose an AstroNet that can adaptively optimize neuron connections and therefore achieves structure learning to achieve higher accuracy and efficiency. AstroNet is based on our built Astrocyte-Neuron model, with a temporal regulation mechanism and a global connection mechanism, which is inspired by the bidirectional communication property of astrocytes. With the model, the proposed AstroNet uses a neural network (NN) for performing tasks, and an astrocyte network (AN) to continuously optimize the connections of NN, i.e., assigning weight to the neuron units in the NN adaptively. Experiments on the classification task demonstrate that our AstroNet can efficiently optimize the network structure while achieving state-of-the-art (SOTA) accuracy.

Parameter Efficient Local Implicit Image Function Network for Face Segmentation
Sarkar, MausoomandNikitha, SRandHemani, MayurandJain, RishabhandKrishnamurthy, Balaji



研究问题:本文旨在提出一种轻量级的人脸解析方法,利用人脸的结构一致性进行像素级别的标签标注。
动机:现有的人脸解析模型参数量大,不适用于低计算或低带宽的设备。
方法:提出了一种局部隐式函数网络(FP-LIIF)的轻量级人脸解析方法,其结构包括一个卷积编码器和一个像素多层感知器解码器,参数数量仅为最先进的模型的1/26,且无需预训练。
效果:在CelebAMask-HQ和LaPa等多个数据集上,该方法在不改变输入分辨率的情况下生成不同分辨率的分割结果,性能与最先进的模型相当或更好,同时具有较高的帧率和较小的模型大小,适用于低计算或低带宽的设备。

Face parsing is defined as the per-pixel labeling of images containing human faces. The labels are defined to identify key facial regions like eyes, lips, nose, hair, etc. In this work, we make use of the structural consistency of the human face to propose a lightweight face-parsing method using a Local Implicit Function network, FP-LIIF. We propose a simple architecture having a convolutional encoder and a pixel MLP decoder that uses 1/26th number of parameters compared to the state-of-the-art models and yet matches or outperforms state-of-the-art models on multiple datasets, like CelebAMask-HQ and LaPa. We do not use any pretraining, and compared to other works, our network can also generate segmentation at different resolutions without any changes in the input resolution. This work enables the use of facial segmentation on low-compute or low-bandwidth devices because of its higher FPS and smaller model size.

Modality-Invariant Visual Odometry for Embodied Vision
Memmel, MariusandBachmann, RomanandZamir, Amir



研究问题:如何在真实、嘈杂的环境中有效地定位代理?
动机:在现实环境中,视觉里程计(VO)是替代不可靠的GPS和罗盘传感器的有效方法,但现有的深度学习VO模型在传感器失效或改变时会崩溃。
方法:我们提出了一种基于Transformer的模态不变视觉里程计方法,该方法可以处理导航代理的多样化或变化的传感器套件。
效果:我们的模型在训练数据仅为以前一小部分的情况下,表现优于以往的方法,并希望这种方法能为从灵活和学习的VO模型中获益的更广泛的现实应用打开大门。

Effectively localizing an agent in a realistic, noisy setting is crucial for many embodied vision tasks. Visual Odometry (VO) is a practical substitute for unreliable GPS and compass sensors, especially in indoor environments. While SLAM-based methods show a solid performance without large data requirements, they are less flexible and robust w.r.t. to noise and changes in the sensor suite compared to learning-based approaches. Recent deep VO models, however, limit themselves to a fixed set of input modalities, e.g., RGB and depth, while training on millions of samples. When sensors fail, sensor suites change, or modalities are intentionally looped out due to available resources, e.g., power consumption, the models fail catastrophically. Furthermore, training these models from scratch is even more expensive without simulator access or suitable existing models that can be fine-tuned. While such scenarios get mostly ignored in simulation, they commonly hinder a model's reusability in real-world applications. We propose a Transformer-based modality-invariant VO approach that can deal with diverse or changing sensor suites of navigation agents. Our model outperforms previous methods while training on only a fraction of the data. We hope this method opens the door to a broader range of real-world applications that can benefit from flexible and learned VO models.

Towards a Smaller Student: Capacity Dynamic Distillation for Efficient Image Retrieval
Xie, YiandZhang, HuaidongandXu, XuemiaoandZhu, JianqingandHe, Shengfeng



研究问题:如何通过联合训练大规模文本语料库和知识图谱,利用外部知识增强语言表示。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,训练出ERNIE模型,该模型能同时充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Previous Knowledge Distillation based efficient image retrieval methods employ a lightweight network as the student model for fast inference. However, the lightweight student model lacks adequate representation capacity for effective knowledge imitation during the most critical early training period, causing final performance degeneration. To tackle this issue, we propose a Capacity Dynamic Distillation framework, which constructs a student model with editable representation capacity. Specifically, the employed student model is initially a heavy model to fruitfully learn distilled knowledge in the early training epochs, and the student model is gradually compressed during the training. To dynamically adjust the model capacity, our dynamic framework inserts a learnable convolutional layer within each residual block in the student model as the channel importance indicator. The indicator is optimized simultaneously by the image retrieval loss and the compression loss, and a retrieval-guided gradient resetting mechanism is proposed to release the gradient conflict. Extensive experiments show that our method has superior inference speed and accuracy, e.g., on the VeRi-776 dataset, given the ResNet101 as a teacher, our method saves 67.13% model parameters and 65.67% FLOPs without sacrificing accuracy.

Federated Incremental Semantic Segmentation
Dong, JiahuaandZhang, DuzhenandCong, YangandCong, WeiandDing, HenghuiandDai, Dengxin



研究问题:如何在联邦学习中解决语义分割模型对旧类别的遗忘问题。
动机:现有的联邦学习模型在处理新类别时,会对旧类别产生严重的遗忘现象,且无法应对新客户端收集的新类别加入全局训练的情况。
方法:提出一种遗忘平衡学习(FBL)模型,通过自适应类别平衡伪标签生成的伪标签指导,开发遗忘平衡语义补偿损失和遗忘平衡关系一致性损失来修正具有背景偏移的本地客户端内部异构遗忘旧类别的问题。同时,提出任务转换监视器来解决客户端间的异构遗忘问题。
效果:实验结果表明,FBL模型在处理新旧类别问题上有显著改进,优于对比方法。

Federated learning-based semantic segmentation (FSS) has drawn widespread attention via decentralized training on local clients. However, most FSS models assume categories are fxed in advance, thus heavily undergoing forgetting on old categories in practical applications where local clients receive new categories incrementally while have no memory storage to access old classes. Moreover, new clients collecting novel classes may join in the global training of FSS, which further exacerbates catastrophic forgetting. To surmount the above challenges, we propose a Forgetting-Balanced Learning (FBL) model to address heterogeneous forgetting on old classes from both intra-client and inter-client aspects. Specifically, under the guidance of pseudo labels generated via adaptive class-balanced pseudo labeling, we develop a forgetting-balanced semantic compensation loss and a forgetting-balanced relation consistency loss to rectify intra-client heterogeneous forgetting of old categories with background shift. It performs balanced gradient propagation and relation consistency distillation within local clients. Moreover, to tackle heterogeneous forgetting from inter-client aspect, we propose a task transition monitor. It can identify new classes under privacy protection and store the latest old global model for relation distillation. Qualitative experiments reveal large improvement of our model against comparison methods. The code is available at https://github.com/JiahuaDong/FISS.

Avatars Grow Legs: Generating Smooth Human Motion From Sparse Tracking Inputs With Diffusion Model
Du, YumingandKips, RobinandPumarola, AlbertandStarke, SebastianandThabet, AliandSanakoyeu, Artsiom



研究问题:如何准确控制3D全身虚拟形象,特别是在只有稀疏的上半身跟踪信号的情况下。
动机:随着AR/VR应用的普及,对3D全身虚拟形象的真实和准确控制需求日益增加,但现有的跟踪信号通常只能追踪到用户的头部和手腕,下半身需要通过上半身关节的有限信息来合成。
方法:提出了一种名为AGRoL的新型条件扩散模型,该模型基于简单的多层感知器(MLP)架构和一种新的运动数据条件方案,能够根据稀疏的上半身跟踪信号预测准确的全身运动,特别是具有挑战性的下半身运动。
效果:在AMASS运动捕捉数据集上进行训练和评估后,AGRoL在生成的运动准确性和平滑性方面优于现有方法,且由于其简洁的设计,可以实时运行,非常适合在线身体跟踪应用。

With the recent surge in popularity of AR/VR applications, realistic and accurate control of 3D full-body avatars has become a highly demanded feature. A particular challenge is that only a sparse tracking signal is available from standalone HMDs (Head Mounted Devices), often limited to tracking the user's head and wrists. While this signal is resourceful for reconstructing the upper body motion, the lower body is not tracked and must be synthesized from the limited information provided by the upper body joints. In this paper, we present AGRoL, a novel conditional diffusion model specifically designed to track full bodies given sparse upper-body tracking signals. Our model is based on a simple multi-layer perceptron (MLP) architecture and a novel conditioning scheme for motion data. It can predict accurate and smooth full-body motion, particularly the challenging lower body movement. Unlike common diffusion architectures, our compact architecture can run in real-time, making it suitable for online body-tracking applications. We train and evaluate our model on AMASS motion capture dataset, and demonstrate that our approach outperforms state-of-the-art methods in generated motion accuracy and smoothness. We further justify our design choices through extensive experiments and ablation studies.

NAR-Former: Neural Architecture Representation Learning Towards Holistic Attributes Prediction
Yi, YunandZhang, HaokuiandHu, WenzeandWang, NannanandWang, Xiaoyu



研究问题:如何模型化和学习神经网络自身的表示,以估计不同神经网络架构的属性,如准确性和延迟时间。
动机:随着深度学习模型在实际应用中的广泛深入使用,对模型化和学习神经网络自身表示的需求日益增加。
方法:本文提出了一种神经架构表示模型,该模型可以全面估计这些属性。具体来说,首先提出了一种简单而有效的标记器,将神经网络的操作和拓扑信息编码为单个序列。然后设计了一个多阶段融合转换器,从转换后的序列中构建紧凑的向量表示。为了有效训练模型,进一步提出了信息流一致性增强和相应的架构一致性损失,与以前的随机增强策略相比,这种策略可以在较少的增强样本下带来更多的好处。
效果:实验结果在NAS-Bench-101、NAS-Bench-201、DARTS搜索空间和NNLQP上表明,提出的框架可以预测细胞架构和整个深度神经网络的准确性和延迟属性,并取得了良好的性能。

With the wide and deep adoption of deep learning models in real applications, there is an increasing need to model and learn the representations of the neural networks themselves. These models can be used to estimate attributes of different neural network architectures such as the accuracy and latency, without running the actual training or inference tasks. In this paper, we propose a neural architecture representation model that can be used to estimate these attributes holistically. Specifically, we first propose a simple and effective tokenizer to encode both the operation and topology information of a neural network into a single sequence. Then, we design a multi-stage fusion transformer to build a compact vector representation from the converted sequence. For efficient model training, we further propose an information flow consistency augmentation and correspondingly design an architecture consistency loss, which brings more benefits with less augmentation samples compared with previous random augmentation strategies. Experiment results on NAS-Bench-101, NAS-Bench-201, DARTS search space and NNLQP show that our proposed framework can be used to predict the aforementioned latency and accuracy attributes of both cell architectures and whole deep neural networks, and achieves promising performance. Code is available at https://github.com/yuny220/NAR-Former.

Accelerated Coordinate Encoding: Learning to Relocalize in Minutes Using RGB and Poses
Brachmann, EricandCavallari, TommasoandPrisacariu, VictorAdrian



研究问题:现有的基于学习的视觉重定位方法虽然具有较高的准确性,但需要数小时或数天的培训时间,使得其在大多数应用中并不实用。
动机:为了解决训练时间长的问题,使基于学习的重定位方法在实际中得到应用。
方法:将重定位网络分为场景无关的特征主干和场景特定的预测头,并使用MLP预测头在每次训练迭代中同时优化数千个视点,实现快速收敛。
效果:该方法比最先进的场景坐标回归快300倍,同时保持了相同的精度。

Learning-based visual relocalizers exhibit leading pose accuracy, but require hours or days of training. Since training needs to happen on each new scene again, long training times make learning-based relocalization impractical for most applications, despite its promise of high accuracy. In this paper we show how such a system can actually achieve the same accuracy in less than 5 minutes. We start from the obvious: a relocalization network can be split in a scene-agnostic feature backbone, and a scene-specific prediction head. Less obvious: using an MLP prediction head allows us to optimize across thousands of view points simultaneously in each single training iteration. This leads to stable and extremely fast convergence. Furthermore, we substitute effective but slow end-to-end training using a robust pose solver with a curriculum over a reprojection loss. Our approach does not require privileged knowledge, such a depth maps or a 3D model, for speedy training. Overall, our approach is up to 300x faster in mapping than state-of-the-art scene coordinate regression, while keeping accuracy on par. Code is available: https://nianticlabs.github.io/ace

Switchable Representation Learning Framework With Self-Compatibility
Wu, ShengsenandBai, YanandLou, YihangandLinghu, XiongkunandHe, JianzhongandDuan, Ling-Yu



研究问题:如何在具有不同计算和存储资源的多个平台上部署视觉搜索系统,并实现模型之间的特征对齐。
动机:现有的统一模型在最小约束平台上的精度有限,需要开发适应资源限制的不同容量的模型,并要求这些模型提取的特征在度量空间中对齐。
方法:提出一种可切换表示学习框架(SFSC),通过一次训练过程生成一系列具有不同容量的兼容子模型。通过估计不确定性动态调整子模型的优先级,并对冲突方向的梯度进行投影以避免相互干扰。
效果:SFSC在评估数据集上取得了最先进的性能。

Real-world visual search systems involve deployments on multiple platforms with different computing and storage resources. Deploying a unified model that suits the minimal-constrain platforms leads to limited accuracy. It is expected to deploy models with different capacities adapting to the resource constraints, which requires features extracted by these models to be aligned in the metric space. The method to achieve feature alignments is called "compatible learning". Existing research mainly focuses on the one-to-one compatible paradigm, which is limited in learning compatibility among multiple models. We propose a Switchable representation learning Framework with Self-Compatibility (SFSC). SFSC generates a series of compatible sub-models with different capacities through one training process. The optimization of sub-models faces gradients conflict, and we mitigate this problem from the perspective of the magnitude and direction. We adjust the priorities of sub-models dynamically through uncertainty estimation to co-optimize sub-models properly. Besides, the gradients with conflicting directions are projected to avoid mutual interference. SFSC achieves state-of-the-art performance on the evaluated datasets.

Partial Network Cloning
Ye, JingwenandLiu, SonghuaandWang, Xinchao



研究问题:本文研究了一种新的部分知识转移任务,即部分网络克隆(PNC)。
动机:与需要在整个知识转移过程中更新目标网络的所有或至少部分参数的现有方法不同,PNC从源网络进行部分参数“克隆”,然后将克隆的模块注入目标网络,而不修改其参数。
方法:我们引入了一种创新的学习方案,可以同时确定要从源网络克隆的组件以及在目标网络中要插入的位置,以确保最优性能。
效果:实验结果表明,与基于参数调整的方法相比,我们的方法在准确率和局部性方面分别提高了5%和50%。

In this paper, we study a novel task that enables partial knowledge transfer from pre-trained models, which we term as Partial Network Cloning (PNC). Unlike prior methods that update all or at least part of the parameters in the target network throughout the knowledge transfer process, PNC conducts partial parametric "cloning" from a source network and then injects the cloned module to the target, without modifying its parameters. Thanks to the transferred module, the target network is expected to gain additional functionality, such as inference on new classes; whenever needed, the cloned module can be readily removed from the target, with its original parameters and competence kept intact. Specifically, we introduce an innovative learning scheme that allows us to identify simultaneously the component to be cloned from the source and the position to be inserted within the target network, so as to ensure the optimal performance. Experimental results on several datasets demonstrate that, our method yields a significant improvement of 5% in accuracy and 50% in locality when compared with parameter-tuning based methods.

Principles of Forgetting in Domain-Incremental Semantic Segmentation in Adverse Weather Conditions
Kalb, TobiasandBeyerer, J\"urgen



研究问题:在自动驾驶车辆的场景感知中,深度神经网络在训练领域上取得了优秀的结果,但在现实世界条件下,操作领域及其底层数据分布会发生变化,如何减少模型性能的下降?
动机:恶劣的天气条件会显著降低模型性能,当这种数据在训练期间不可用时。此外,当模型逐步适应新领域时,会出现灾难性遗忘,导致之前观察到的领域的性能大幅下降。
方法:通过实验和表示分析,研究语义分割模型在恶劣天气条件下进行领域增量学习时的表示变化。
效果:实验和表示分析表明,灾难性遗忘主要是由于领域增量学习中低层次特征的变化引起的。使用预训练和图像增强在源领域学习更通用的特征可以有效地重用后续任务的特征,从而大大减少灾难性遗忘。这些发现强调了促进通用特征的方法对于有效的持续学习算法的重要性。

Deep neural networks for scene perception in automated vehicles achieve excellent results for the domains they were trained on. However, in real-world conditions, the domain of operation and its underlying data distribution are subject to change. Adverse weather conditions, in particular, can significantly decrease model performance when such data are not available during training. Additionally, when a model is incrementally adapted to a new domain, it suffers from catastrophic forgetting, causing a significant drop in performance on previously observed domains. Despite recent progress in reducing catastrophic forgetting, its causes and effects remain obscure. Therefore, we study how the representations of semantic segmentation models are affected during domain-incremental learning in adverse weather conditions. Our experiments and representational analyses indicate that catastrophic forgetting is primarily caused by changes to low-level features in domain-incremental learning and that learning more general features on the source domain using pre-training and image augmentations leads to efficient feature reuse in subsequent tasks, which drastically reduces catastrophic forgetting. These findings highlight the importance of methods that facilitate generalized features for effective continual learning algorithms.

topic-8

Topic words :  image,  diffusion,  generation,  images,  face,  quality,  high,  latent

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis
Xu, YinghaoandChai, MengleiandShi, ZifanandPeng, SidaandSkorokhodov, IvanandSiarohin, AliaksandrandYang, CeyuanandShen, YujunandLee, Hsin-YingandZhou, BoleiandTulyakov, Sergey



研究问题:现有的3D感知图像合成方法主要关注生成单个标准对象,对于合成包含多种对象的复杂场景的能力有限。
动机:为了解决这个问题,本文提出了DisCoScene,一种用于高质量和可控场景合成的3D感知生成模型。
方法:DisCoScene的关键成分是一种非常抽象的对象级表示(即没有语义标注的3D边界框),作为场景布局先验。这种表示简单易得,通用性强,能够描述各种场景内容,同时具有区分对象和背景的信息量。此外,它还可以作为直观的用户控制手段进行场景编辑。基于这种先验,所提出的模型通过仅学习2D图像并利用全局-局部判别来将整个场景在空间上解耦为以对象为中心的生成辐射场。
效果:DisCoScene模型在许多场景数据集上表现出了最先进的性能,包括具有挑战性的Waymo户外数据集。我们的代码将公开发布。

Existing 3D-aware image synthesis approaches mainly focus on generating a single canonical object and show limited capacity in composing a complex scene containing a variety of objects. This work presents DisCoScene: a 3D-aware generative model for high-quality and controllable scene synthesis. The key ingredient of our method is a very abstract object-level representation (i.e., 3D bounding boxes without semantic annotation) as the scene layout prior, which is simple to obtain, general to describe various scene contents, and yet informative to disentangle objects and background. Moreover, it serves as an intuitive user control for scene editing. Based on such a prior, the proposed model spatially disentangles the whole scene into object-centric generative radiance fields by learning on only 2D images with the global-local discrimination. Our model obtains the generation fidelity and editing flexibility of individual objects while being able to efficiently compose objects and the background into a complete scene. We demonstrate state-of-the-art performance on many scene datasets, including the challenging Waymo outdoor dataset. Our code will be made publicly available.

An Image Quality Assessment Dataset for Portraits
Chahine, NicolasandCalarasanu, StefaniaandGarcia-Civiero, DavideandCayla, Th\'eoandFerradans, SiraandPonce, Jean



研究问题:如何提高智能手机照片的质量,特别是在人像摄影方面。
动机:随着对更优质智能手机照片的需求不断增长,制造商在开发智能手机相机时使用感知质量标准。然而,这种成本高昂的过程可以通过基于学习的自动图像质量评估方法部分替代。
方法:本文介绍了PIQ23,这是一个包含5116张图片的特定于人像的IQA数据集,涵盖了50个预定义的场景,由100部智能手机拍摄,覆盖了各种品牌、型号和使用情况。数据集包括各种性别和种族的个人,他们明确知情并同意他们的肖像用于公共研究。数据集通过成对比较(PWC)进行注释,收集了超过30位图像质量专家对三个图像属性的注释:面部细节保留、面部目标曝光和整体图像质量。
效果:通过对这些注释进行深入的统计分析,我们可以评估PIQ23的一致性。最后,我们通过与现有基线的广泛比较表明,语义信息(图像上下文)可以用于改善IQA预测。

Year after year, the demand for ever-better smartphone photos continues to grow, in particular in the domain of portrait photography. Manufacturers thus use perceptual quality criteria throughout the development of smartphone cameras. This costly procedure can be partially replaced by automated learning-based methods for image quality assessment (IQA). Due to its subjective nature, it is necessary to estimate and guarantee the consistency of the IQA process, a characteristic lacking in the mean opinion scores (MOS) widely used for crowdsourcing IQA. In addition, existing blind IQA (BIQA) datasets pay little attention to the difficulty of cross-content assessment, which may degrade the quality of annotations. This paper introduces PIQ23, a portrait-specific IQA dataset of 5116 images of 50 predefined scenarios acquired by 100 smartphones, covering a high variety of brands, models, and use cases. The dataset includes individuals of various genders and ethnicities who have given explicit and informed consent for their photographs to be used in public research. It is annotated by pairwise comparisons (PWC) collected from over 30 image quality experts for three image attributes: face detail preservation, face target exposure, and overall image quality. An in-depth statistical analysis of these annotations allows us to evaluate their consistency over PIQ23. Finally, we show through an extensive comparison with existing baselines that semantic information (image context) can be used to improve IQA predictions.

Text-Guided Unsupervised Latent Transformation for Multi-Attribute Image Manipulation
Wei, XiwenandXu, ZhenandLiu, ChengandWu, SiandYu, ZhiwenandWong, HauSan



研究问题:现有的图像编辑方法主要关注于有监督学习的语义潜在空间遍历方向,每个操作步骤通常针对单个属性确定。
动机:为了解决这个限制,我们提出了一种基于文本引导的无监督StyleGAN潜在变换(TUSLT)模型,该模型自适应地推断出在StyleGAN的潜在空间中的单个转换步骤,以同时操纵给定输入图像的多个属性。
方法:我们采用了两阶段架构的潜伏映射网络来将转换过程分解为两个可管理的步骤。首先,网络学习适应输入图像的一组多样化的语义方向,然后非线性融合与目标属性相关的那些方向,以推断出残差向量。
效果:通过利用CLIP的跨模态文本-图像表示,我们可以基于预设属性文本描述和训练图像之间的语义相似性进行伪标注,并进一步与潜在映射网络联合训练一个辅助属性分类器,提供语义指导。实验结果表明,所采用的策略有助于提高TUSLT的性能。

Great progress has been made in StyleGAN-based image editing. To associate with preset attributes, most existing approaches focus on supervised learning for semantically meaningful latent space traversal directions, and each manipulation step is typically determined for an individual attribute. To address this limitation, we propose a Text-guided Unsupervised StyleGAN Latent Transformation (TUSLT) model, which adaptively infers a single transformation step in the latent space of StyleGAN to simultaneously manipulate multiple attributes on a given input image. Specifically, we adopt a two-stage architecture for a latent mapping network to break down the transformation process into two manageable steps. Our network first learns a diverse set of semantic directions tailored to an input image, and later nonlinearly fuses the ones associated with the target attributes to infer a residual vector. The resulting tightly interlinked two-stage architecture delivers the flexibility to handle diverse attribute combinations. By leveraging the cross-modal text-image representation of CLIP, we can perform pseudo annotations based on the semantic similarity between preset attribute text descriptions and training images, and further jointly train an auxiliary attribute classifier with the latent mapping network to provide semantic guidance. We perform extensive experiments to demonstrate that the adopted strategies contribute to the superior performance of TUSLT.

SIEDOB: Semantic Image Editing by Disentangling Object and Background
Luo, WuyangandYang, SuandZhang, XinjianandZhang, Weishan



研究问题:现有的语义图像编辑方法将背景和对象作为一个整体处理,限制了其在处理内容丰富的图像上的能力,并可能导致生成不真实的对象和纹理不一致的背景。
动机:为了解决这个问题,我们提出了一种新的范式——通过解耦对象和背景进行语义图像编辑(SIEDOB)。
方法:SIEDOB将编辑的输入分解为背景区域和实例级对象,然后分别送入专门的生成器中。所有合成的部分都嵌入到它们原来的位置,并使用融合网络得到一个协调的结果。此外,我们还提出了一些创新的设计,包括语义感知的自我传播模块、边界锚定的补丁判别器和风格多样性的对象生成器,并将它们集成到SIEDOB中。
效果:我们在Cityscapes和ADE20K-Room数据集上进行了广泛的实验,结果显示我们的方法显著优于基线,特别是在合成真实且多样的对象和纹理一致的背景方面。

Semantic image editing provides users with a flexible tool to modify a given image guided by a corresponding segmentation map. In this task, the features of the foreground objects and the backgrounds are quite different. However, all previous methods handle backgrounds and objects as a whole using a monolithic model. Consequently, they remain limited in processing content-rich images and suffer from generating unrealistic objects and texture-inconsistent backgrounds. To address this issue, we propose a novel paradigm, Semantic Image Editing by Disentangling Object and Background (SIEDOB), the core idea of which is to explicitly leverages several heterogeneous subnetworks for objects and backgrounds. First, SIEDOB disassembles the edited input into background regions and instance-level objects. Then, we feed them into the dedicated generators. Finally, all synthesized parts are embedded in their original locations and utilize a fusion network to obtain a harmonized result. Moreover, to produce high-quality edited images, we propose some innovative designs, including Semantic-Aware Self-Propagation Module, Boundary-Anchored Patch Discriminator, and Style-Diversity Object Generator, and integrate them into SIEDOB. We conduct extensive experiments on Cityscapes and ADE20K-Room datasets and exhibit that our method remarkably outperforms the baselines, especially in synthesizing realistic and diverse objects and texture-consistent backgrounds.

Learning Semantic-Aware Disentangled Representation for Flexible 3D Human Body Editing
Sun, XiaokunandFeng, QiaoandLi, XiongzhengandZhang, JinsongandLai, Yu-KunandYang, JingyuandLi, Kun



研究问题:近年来,3D人体表示学习受到了越来越多的关注。然而,由于语义粗糙和表示能力不足,特别是在缺乏监督数据的情况下,现有的方法无法灵活、可控和准确地表示人体。
动机:本文提出了一种在无监督环境中具有细粒度语义和高重建精度的人体表示方法。
方法:通过设计一个部分感知的骨架分离解耦策略,建立了潜在向量和身体部位的几何测量之间的对应关系,从而通过修改相应的潜在代码来控制编辑人体。
效果:实验结果表明,该方法在公共数据集上具有准确的重建和灵活的编辑能力。

3D human body representation learning has received increasing attention in recent years. However, existing works cannot flexibly, controllably and accurately represent human bodies, limited by coarse semantics and unsatisfactory representation capability, particularly in the absence of supervised data. In this paper, we propose a human body representation with fine-grained semantics and high reconstruction-accuracy in an unsupervised setting. Specifically, we establish a correspondence between latent vectors and geometric measures of body parts by designing a part-aware skeleton-separated decoupling strategy, which facilitates controllable editing of human bodies by modifying the corresponding latent codes. With the help of a bone-guided auto-encoder and an orientation-adaptive weighting strategy, our representation can be trained in an unsupervised manner. With the geometrically meaningful latent space, it can be applied to a wide range of applications, from human body editing to latent code interpolation and shape style transfer. Experimental results on public datasets demonstrate the accurate reconstruction and flexible editing abilities of the proposed method. The code will be available at http://cic.tju.edu.cn/faculty/likun/projects/SemanticHuman.

Paint by Example: Exemplar-Based Image Editing With Diffusion Models
Yang, BinxinandGu, ShuyangandZhang, BoandZhang, TingandChen, XuejinandSun, XiaoyanandChen, DongandWen, Fang



研究问题:本文旨在通过利用自我监督训练来分离和重组源图像和范例,以实现更精确的控制。
动机:目前的语言引导图像编辑取得了巨大的成功,但直接复制粘贴范例图像会导致明显的融合痕迹。
方法:通过设计信息瓶颈和强大的增强技术,避免直接复制粘贴范例图像的平凡解决方案。同时,为了确保编辑过程的可控性,为范例图像设计了一个任意形状的遮罩,并利用无分类器指导来增加与范例图像的相似性。整个框架只涉及一次扩散模型的前向传播,无需任何迭代优化。
效果:实验结果表明,该方法在野图像上实现了令人印象深刻的性能,并能够进行高保真的可控编辑。

Language-guided image editing has achieved great success recently. In this paper, we investigate exemplar-guided image editing for more precise control. We achieve this goal by leveraging self-supervised training to disentangle and re-organize the source image and the exemplar. However, the naive approach will cause obvious fusing artifacts. We carefully analyze it and propose an information bottleneck and strong augmentations to avoid the trivial solution of directly copying and pasting the exemplar image. Meanwhile, to ensure the controllability of the editing process, we design an arbitrary shape mask for the exemplar image and leverage the classifier-free guidance to increase the similarity to the exemplar image. The whole framework involves a single forward of the diffusion model without any iterative optimization. We demonstrate that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity.

Graphics Capsule: Learning Hierarchical 3D Face Representations From 2D Images
Yu, ChangandZhu, XiangyuandZhang, XiaomeiandZhang, ZhaoxiangandLei, Zhen



研究问题:如何利用胶囊网络从大规模无标签图像中学习层次化的三维人脸表示。
动机:目前的胶囊网络在描述物体时仅限于二维空间,限制了其模仿人类固有的三维感知能力。
方法:提出一种逆向图形胶囊网络(IGC-Net),通过将对象分解为一组语义一致的部分级描述,然后组装成对象级描述来构建层次结构,从而学习层次化的三维人脸表示。
效果:实验结果表明,IGC-Net能够揭示神经网络如何以视觉感知为导向理解人脸作为三维模型的层次结构,同时,发现的部分可以用于无监督的人脸分割任务,评估我们的方法的语义一致性。

The function of constructing the hierarchy of objects is important to the visual process of the human brain. Previous studies have successfully adopted capsule networks to decompose the digits and faces into parts in an unsupervised manner to investigate the similar perception mechanism of neural networks. However, their descriptions are restricted to the 2D space, limiting their capacities to imitate the intrinsic 3D perception ability of humans. In this paper, we propose an Inverse Graphics Capsule Network (IGC-Net) to learn the hierarchical 3D face representations from large-scale unlabeled images. The core of IGC-Net is a new type of capsule, named graphics capsule, which represents 3D primitives with interpretable parameters in computer graphics (CG), including depth, albedo, and 3D pose. Specifically, IGC-Net first decomposes the objects into a set of semantic-consistent part-level descriptions and then assembles them into object-level descriptions to build the hierarchy. The learned graphics capsules reveal how the neural networks, oriented at visual perception, understand faces as a hierarchy of 3D models. Besides, the discovered parts can be deployed to the unsupervised face segmentation task to evaluate the semantic consistency of our method. Moreover, the part-level descriptions with explicit physical meanings provide insight into the face analysis that originally runs in a black box, such as the importance of shape and texture for face recognition. Experiments on CelebA, BP4D, and Multi-PIE demonstrate the characteristics of our IGC-Net.

Make-a-Story: Visual Memory Conditioned Consistent Story Generation
Rahman, TanzilaandLee, Hsin-YingandRen, JianandTulyakov, SergeyandMahajan, ShwetaandSigal, Leonid



研究问题:如何利用文本描述生成高质量的图像或视频,特别是在复杂的故事情节中处理自然引用和共指关系。
动机:现有的基于文本描述生成图像或视频的模型主要依赖于明确的场景和主要角色描述,对于需要根据故事进展判断何时保持一致性、何时不保持一致性的复杂任务,如故事可视化,仍面临挑战。
方法:提出一种新的自回归扩散基框架,并加入视觉记忆模块,该模块能隐式捕获生成帧中的角色和背景上下文。通过在记忆中进行句子条件的软注意力,实现有效的引用解析,并在需要时学习保持场景和角色的一致性。
效果:在MUGEN、PororoSV和FlintstonesSV数据集上进行故事生成实验,结果显示该方法不仅在生成与故事一致的高视觉质量帧方面优于现有技术,还能在角色和背景之间建立适当的对应关系。

There has been a recent explosion of impressive generative models that can produce high quality images (or videos) conditioned on text descriptions. However, all such approaches rely on conditional sentences that contain unambiguous descriptions of scenes and main actors in them. Therefore employing such models for more complex task of story visualization, where naturally references and co-references exist, and one requires to reason about when to maintain consistency of actors and backgrounds across frames/scenes, and when not to, based on story progression, remains a challenge. In this work, we address the aforementioned challenges and propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context across the generated frames. Sentence-conditioned soft attention over the memories enables effective reference resolution and learns to maintain scene and actor consistency when needed. To validate the effectiveness of our approach, we extend the MUGEN dataset and introduce additional characters, backgrounds and referencing in multi-sentence storylines. Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, which are consistent with the story, but also models appropriate correspondences between the characters and the background.

StyleGAN Salon: Multi-View Latent Optimization for Pose-Invariant Hairstyle Transfer
Khwanmuang, SasikarnandPhongthawee, PakkaponandSangkloy, PatsornandSuwajanakorn, Supasorn



研究问题:本文旨在将参考图像的发型转移到输入照片中进行虚拟发型试穿。
动机:现有的解决方案使用StyleGAN来生成任何缺失的部分,并通过所谓的GAN反转或投影来产生无缝的面部-头发复合图像。然而,控制这些幻觉以准确转移发型并保留输入的面部形状和身份仍然是一个挑战。
方法:我们提出了一个多视图优化框架,该框架使用参考复合物的“两个不同视图”来语义地指导被遮挡或模糊的区域。我们的优化在两种姿势之间共享信息,这使得我们可以从不完整的参考中产生高保真和逼真的结果。
效果:我们的框架产生了高质量的结果,并在用户研究中优于先前的工作,该研究包含比之前研究更具挑战性的发型转移场景。

Our paper seeks to transfer the hairstyle of a reference image to an input photo for virtual hair try-on. We target a variety of challenges scenarios, such as transforming a long hairstyle with bangs to a pixie cut, which requires removing the existing hair and inferring how the forehead would look, or transferring partially visible hair from a hat-wearing person in a different pose. Past solutions leverage StyleGAN for hallucinating any missing parts and producing a seamless face-hair composite through so-called GAN inversion or projection. However, there remains a challenge in controlling the hallucinations to accurately transfer hairstyle and preserve the face shape and identity of the input. To overcome this, we propose a multi-view optimization framework that uses "two different views" of reference composites to semantically guide occluded or ambiguous regions. Our optimization shares information between two poses, which allows us to produce high fidelity and realistic results from incomplete references. Our framework produces high-quality results and outperforms prior work in a user study that consists of significantly more challenging hair transfer scenarios than previously studied. Project page: https://stylegan-salon.github.io/.

Neural Preset for Color Style Transfer
Ke, ZhanghanandLiu, YuhaoandZhu, LeiandZhao, NanxuanandLau, RynsonW.H.



研究问题:本文旨在解决现有颜色风格转换方法的局限性,包括视觉伪影、巨大的内存需求和慢速的风格切换速度。
动机:为了解决这些问题,我们提出了一种基于确定性神经颜色映射(DNCM)和两阶段管道的颜色风格预置技术。
方法:首先,我们使用图像自适应颜色映射矩阵对每个像素进行操作,避免伪影并支持高分辨率输入,同时减少内存占用。其次,我们将任务分为颜色归一化和风格化两个阶段,通过提取颜色样式作为预置并在标准化的输入图像上重复使用,实现高效的风格切换。由于无法获取配对数据集,我们描述了如何通过自我监督策略训练神经预置。
效果:实验结果表明,神经预置在各种应用中都优于现有方法,包括低光图像增强、水下图像校正、图像去雾和图像和谐化。此外,我们的模型无需微调即可自然支持多种应用。

In this paper, we present a Neural Preset technique to address the limitations of existing color style transfer methods, including visual artifacts, vast memory requirement, and slow style switching speed. Our method is based on two core designs. First, we propose Deterministic Neural Color Mapping (DNCM) to consistently operate on each pixel via an image-adaptive color mapping matrix, avoiding artifacts and supporting high-resolution inputs with a small memory footprint. Second, we develop a two-stage pipeline by dividing the task into color normalization and stylization, which allows efficient style switching by extracting color styles as presets and reusing them on normalized input images. Due to the unavailability of pairwise datasets, we describe how to train Neural Preset via a self-supervised strategy. Various advantages of Neural Preset over existing methods are demonstrated through comprehensive evaluations. Besides, we show that our trained model can naturally support multiple applications without fine-tuning, including low-light image enhancement, underwater image correction, image dehazing, and image harmonization.

PosterLayout: A New Benchmark and Approach for Content-Aware Visual-Textual Presentation Layout
Hsu, HsiaoYuanandHe, XiangtengandPeng, YuxinandKong, HaoandZhang, Qing



研究问题:如何有效地在给定的画布上自动布局预定义的元素,包括文本、标志和底纹,以进行无模板的创意图形设计。
动机:现有的方法在处理元素间关系和图层间关系时表现不佳,如布局变化性不足或空间对齐不良。
方法:我们首先构建了一个名为PKU PosterLayout的新数据集,包含9,974个海报布局对和905张非空画布图像。然后,我们提出了设计序列形成(DSF)来重新组织布局中的元素,模仿人类设计师的设计过程,并提出了一种新的基于CNN-LSTM的条件生成对抗网络(GAN)来生成合适的布局。
效果:实验结果验证了新基准的有效性和所提出方法的有效性,该方法通过为多样化的画布生成合适的布局,实现了最佳性能。

Content-aware visual-textual presentation layout aims at arranging spatial space on the given canvas for pre-defined elements, including text, logo, and underlay, which is a key to automatic template-free creative graphic design. In practical applications, e.g., poster designs, the canvas is originally non-empty, and both inter-element relationships as well as inter-layer relationships should be concerned when generating a proper layout. A few recent works deal with them simultaneously, but they still suffer from poor graphic performance, such as a lack of layout variety or spatial non-alignment. Since content-aware visual-textual presentation layout is a novel task, we first construct a new dataset named PKU PosterLayout, which consists of 9,974 poster-layout pairs and 905 images, i.e., non-empty canvases. It is more challenging and useful for greater layout variety, domain diversity, and content diversity. Then, we propose design sequence formation (DSF) that reorganizes elements in layouts to imitate the design processes of human designers, and a novel CNN-LSTM-based conditional generative adversarial network (GAN) is presented to generate proper layouts. Specifically, the discriminator is design-sequence-aware and will supervise the "design" process of the generator. Experimental results verify the usefulness of the new benchmark and the effectiveness of the proposed approach, which achieves the best performance by generating suitable layouts for diverse canvases. The dataset and the source code are available at https://github.com/PKU-ICST-MIPL/PosterLayout-CVPR2023.

ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model With Knowledge-Enhanced Mixture-of-Denoising-Experts
Feng, ZhidaandZhang, ZhenyuandYu, XintongandFang, YeweiandLi, LanxinandChen, XuyiandLu, YuxiangandLiu, JiaxiangandYin, WeichongandFeng, ShikunandSun, YuandChen, LiandTian, HaoandWu, HuaandWang, Haifeng



研究问题:现有的文本到图像生成技术在提高图像保真度和文本相关性方面存在限制。
动机:为了解决这些问题,我们提出了ERNIE-ViLG 2.0,一种大规模的中文文本到图像扩散模型。
方法:通过结合场景中关键元素的细粒度文本和视觉知识,以及在不同的去噪阶段使用不同的去噪专家。
效果:实验结果表明,ERNIE-ViLG 2.0不仅在MS-COCO上取得了新的最先进的成果,而且在图像保真度和图像-文本对齐方面也显著优于最近的模型。

Recent progress in diffusion models has revolutionized the popular technology of text-to-image generation. While existing approaches could produce photorealistic high-resolution images with text conditions, there are still several open problems to be solved, which limits the further improvement of image fidelity and text relevancy. In this paper, we propose ERNIE-ViLG 2.0, a large-scale Chinese text-to-image diffusion model, to progressively upgrade the quality of generated images by: (1) incorporating fine-grained textual and visual knowledge of key elements in the scene, and (2) utilizing different denoising experts at different denoising stages. With the proposed mechanisms, ERNIE-ViLG 2.0 not only achieves a new state-of-the-art on MS-COCO with zero-shot FID-30k score of 6.75, but also significantly outperforms recent models in terms of image fidelity and image-text alignment, with side-by-side human evaluation on the bilingual prompt set ViLG-300.

Learning To Generate Image Embeddings With User-Level Differential Privacy
Xu, ZhengandCollins, MaxwellandWang, YuxiaoandPanait, LiviuandOh, SewoongandAugenstein, SeanandLiu, TingandSchroff, FlorianandMcMahan, H.Brendan



研究问题:如何在大型图像嵌入特征提取器的训练中实现用户级别的差分隐私(DP)。
动机:现有的方法在直接应用于使用具有大类空间的监督训练数据学习嵌入模型时可能会失败。
方法:提出DP-FedEmb,一种带有每用户灵敏度控制和噪声添加的联邦学习算法变体,用于从数据中心集中的用户分区数据进行训练。DP-FedEmb结合了虚拟客户端、部分聚合、私有本地微调以及公共预训练,以实现强大的隐私效用权衡。
效果:将DP-FedEmb应用于人脸、地标和自然物种的图像嵌入模型训练,并在基准数据集DigiFace、GLD和iNaturalist上证明,在相同的隐私预算下,其效用优越。当数百万用户参与训练时,有可能实现强用户级别DP保证(ε<2),同时控制效用下降在5%以内。

Small on-device models have been successfully trained with user-level differential privacy (DP) for next word prediction and image classification tasks in the past. However, existing methods can fail when directly applied to learn embedding models using supervised training data with a large class space. To achieve user-level DP for large image-to-embedding feature extractors, we propose DP-FedEmb, a variant of federated learning algorithms with per-user sensitivity control and noise addition, to train from user-partitioned data centralized in datacenter. DP-FedEmb combines virtual clients, partial aggregation, private local fine-tuning, and public pretraining to achieve strong privacy utility trade-offs. We apply DP-FedEmb to train image embedding models for faces, landmarks and natural species, and demonstrate its superior utility under same privacy budget on benchmark datasets DigiFace, GLD and iNaturalist. We further illustrate it is possible to achieve strong user-level DP guarantees of epsilon < 2 while controlling the utility drop within 5%, when millions of users can participate in training.

BlendFields: Few-Shot Example-Driven Facial Modeling
Kania, KacperandGarbin, StephanJ.andTagliasacchi, AndreaandEstellers, VirginiaandYi, KwangMooandValentin, JulienandTrzci\'nski, TomaszandKowalski, Marek



研究问题:如何生成忠实的人脸可视化,同时捕捉到人脸几何和外观的粗粒度和细粒度细节。
动机:现有的方法要么需要大量的数据,这些数据对研究社区来说并不公开可获取,要么由于依赖于只能表示粗粒度细节的几何人脸模型而无法捕捉纹理的细粒度细节。
方法:我们的方法借鉴了传统的计算机图形技术,通过测量极端表情中的局部体积变化并在当地复制其外观来混合外观,以模拟未见过的表情。
效果:我们的方法是通用的,可以在平滑的人脸体积变形上添加细粒度的效果,并且可以推广到人脸之外。

Generating faithful visualizations of human faces requires capturing both coarse and fine-level details of the face geometry and appearance. Existing methods are either data-driven, requiring an extensive corpus of data not publicly accessible to the research community, or fail to capture fine details because they rely on geometric face models that cannot represent fine-grained details in texture with a mesh discretization and linear deformation designed to model only a coarse face geometry. We introduce a method that bridges this gap by drawing inspiration from traditional computer graphics techniques. Unseen expressions are modeled by blending appearance from a sparse set of extreme poses. This blending is performed by measuring local volumetric changes in those expressions and locally reproducing their appearance whenever a similar expression is performed at test time. We show that our method generalizes to unseen expressions, adding fine-grained effects on top of smooth volumetric deformations of a face, and demonstrate how it generalizes beyond faces.

3D GAN Inversion With Facial Symmetry Prior
Yin, FeiandZhang, YongandWang, XuanandWang, TengfeiandLi, XiaoyuandGong, YuanandFan, YanboandCun, XiaodongandShan, YingandOztireli, CengizandYang, Yujiu



研究问题:如何通过将真实图像投影到生成器的潜在空间中,实现3D GAN的逆映射,以进行一致的合成和编辑。
动机:尽管预训练的3D GAN保留了面部先验,但仅使用单目图像重建3D肖像仍然是一个病态问题。直接应用2D GAN逆映射方法只关注纹理相似性,而忽略了3D几何形状的正确性,可能导致几何塌陷效应。
方法:我们提出了一种引入面部对称先验的新方法来改进3D GAN逆映射。我们设计了一个管道和约束,充分利用通过图像翻转获得的伪辅助视图,在逆过程中获得一致且结构良好的几何形状。为了提高未观察到的视角中的纹理保真度,深度引导的3D变形的伪标签可以提供额外的监督。我们还设计了约束,旨在过滤出不对称情况下的冲突区域进行优化。
效果:我们在图像重建和编辑方面的全面定量和定性评估表明,我们的方法具有优越性。

Recently, a surge of high-quality 3D-aware GANs have been proposed, which leverage the generative power of neural rendering. It is natural to associate 3D GANs with GAN inversion methods to project a real image into the generator's latent space, allowing free-view consistent synthesis and editing, referred as 3D GAN inversion. Although with the facial prior preserved in pre-trained 3D GANs, reconstructing a 3D portrait with only one monocular image is still an ill-pose problem. The straightforward application of 2D GAN inversion methods focuses on texture similarity only while ignoring the correctness of 3D geometry shapes. It may raise geometry collapse effects, especially when reconstructing a side face under an extreme pose. Besides, the synthetic results in novel views are prone to be blurry. In this work, we propose a novel method to promote 3D GAN inversion by introducing facial symmetry prior. We design a pipeline and constraints to make full use of the pseudo auxiliary view obtained via image flipping, which helps obtain a view-consistent and well-structured geometry shape during the inversion process. To enhance texture fidelity in unobserved viewpoints, pseudo labels from depth-guided 3D warping can provide extra supervision. We design constraints aimed at filtering out conflict areas for optimization in asymmetric situations. Comprehensive quantitative and qualitative evaluations on image reconstruction and editing demonstrate the superiority of our method.

SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation
Cheng, Yen-ChiandLee, Hsin-YingandTulyakov, SergeyandSchwing, AlexanderG.andGui, Liang-Yan



研究问题:本文旨在为业余用户提供一种简化的3D资产生成框架。
动机:为了实现交互式生成,我们的方法支持多种易于人类提供的输入模态,包括图像、文本、部分观察到的形状及其组合,并允许调整每种输入的强度。
方法:我们的核心方法是编码器-解码器,将3D形状压缩成紧凑的潜在表示形式,然后学习扩散模型。为了支持多种多模态输入,我们采用了带有丢弃和交叉注意力机制的任务特定编码器。
效果:由于其灵活性,我们的模型自然地支持各种任务,并在形状完成、基于图像的3D重建和文本到3D方面优于先前的工作。最有趣的是,我们的模型可以将所有这些任务组合成一个瑞士军刀工具,使用户能够同时使用不完整形状、图像和文本描述进行形状生成,并为每个输入提供相对权重以促进交互性。尽管我们的方法只针对形状,但我们进一步展示了一种利用大规模文本到图像模型高效纹理化生成的方法。

In this work, we present a novel framework built to simplify 3D asset generation for amateur users. To enable interactive generation, our method supports a variety of input modalities that can be easily provided by a human, including images, texts, partially observed shapes and combinations of these, further allowing for adjusting the strength of each input. At the core of our approach is an encoder-decoder, compressing 3D shapes into a compact latent representation, upon which a diffusion model is learned. To enable a variety of multi-modal inputs, we employ task-specific encoders with dropout followed by a cross-attention mechanism. Due to its flexibility, our model naturally supports a variety of tasks outperforming prior works on shape completion, image-based 3D reconstruction, and text-to-3D. Most interestingly, our model can combine all these tasks into one swiss-army-knife tool, enabling the user to perform shape generation using incomplete shapes, images, and textual descriptions at the same time, providing the relative weights for each input and facilitating interactivity. Despite our approach being shape-only, we further show an efficient method to texture the generated using large-scale text-to-image models.

TryOnDiffusion: A Tale of Two UNets
Zhu, LuyangandYang, DaweiandZhu, TylerandReda, FitsumandChan, WilliamandSaharia, ChitwanandNorouzi, MohammadandKemelmacher-Shlizerman, Ira



研究问题:如何生成逼真的可视化试衣效果,同时适应穿着者的身体姿势和形状变化。
动机:现有的方法或关注衣物细节保护,无法有效处理姿势和形状变化,或允许以期望的形状和姿势试穿,但缺乏衣物细节。
方法:提出一种基于扩散的架构,该架构统一了两个UNets(称为并行UNet),可以在单个网络中保留衣物细节并适应显著的姿势和身体变化。
效果:实验结果表明,TryOnDiffusion在定性和定量上都取得了最先进的性能。

Given two images depicting a person and a garment worn by another person, our goal is to generate a visualization of how the garment might look on the input person. A key challenge is to synthesize a photorealistic detail-preserving visualization of the garment, while warping the garment to accommodate a significant body pose and shape change across the subjects. Previous methods either focus on garment detail preservation without effective pose and shape variation, or allow try-on with the desired shape and pose but lack garment details. In this paper, we propose a diffusion-based architecture that unifies two UNets (referred to as Parallel-UNet), which allows us to preserve garment details and warp the garment for significant pose and body change in a single network. The key ideas behind Parallel-UNet include: 1) garment is warped implicitly via a cross attention mechanism, 2) garment warp and person blend happen as part of a unified process as opposed to a sequence of two separate tasks. Experimental results indicate that TryOnDiffusion achieves state-of-the-art performance both qualitatively and quantitatively.

Automatic High Resolution Wire Segmentation and Removal
Chiu, MangTikandZhang, XuanerandWei, ZijunandZhou, YuqianandShechtman, EliandBarnes, ConnellyandLin, ZheandKainz, FlorianandAmirghodsi, SohrabandShi, Humphrey



研究问题:如何有效地自动清理照片中的电线,以提升照片的美观度。
动机:手动精确分割和移除电线的过程既繁琐又耗时,特别是在高分辨率的照片中,电线可能横跨整个画面,这极大地增加了处理的难度。
方法:提出了一种两阶段的方法,首先利用全局和局部上下文准确地在高分辨率图像中分割电线,然后采用基于分块的修复策略根据预测的分割掩码移除电线。同时,还引入了首个电线分割基准数据集WireSegHR。
效果:实验证明,这种自动清理电线的系统能够完全自动化地移除各种外观的电线,大大提高了处理效率和准确性。

Wires and powerlines are common visual distractions that often undermine the aesthetics of photographs. The manual process of precisely segmenting and removing them is extremely tedious and may take up to hours, especially on high-resolution photos where wires may span the entire space. In this paper, we present an automatic wire clean-up system that eases the process of wire segmentation and removal/inpainting to within a few seconds. We observe several unique challenges: wires are thin, lengthy, and sparse. These are rare properties of subjects that common segmentation tasks cannot handle, especially in high-resolution images. We thus propose a two-stage method that leverages both global and local context to accurately segment wires in high-resolution images efficiently, and a tile-based inpainting strategy to remove the wires given our predicted segmentation masks. We also introduce the first wire segmentation benchmark dataset, WireSegHR. Finally, we demonstrate quantitatively and qualitatively that our wire clean-up system enables fully automated wire removal for great generalization to various wire appearances.

Multi-Realism Image Compression With a Conditional Generator
Agustsson, EirikurandMinnen, DavidandToderici, GeorgeandMentzer, Fabian



研究问题:优化率失真真实性权衡,生成压缩方法可以产生详细、真实的图像,即使在低比特率下,而不是由率失真优化模型产生的模糊重建。
动机:先前的方法没有明确控制合成的细节量,这可能导致用户担心生成的重建图像远离输入图像,这是一个常见的批评。
方法:我们通过训练一个解码器来缓解这些问题,该解码器可以连接两个领域并导航失真-真实性权衡。从单个压缩表示中,接收者可以选择重建接近输入的低均方误差重建,具有高感知质量的真实重建,或两者之间的任何内容。
效果:我们的方法在失真-真实性方面设置了新的最先进的状态,推动了可实现的失真-真实性对的前沿,即我们的方法在高真实性时实现了更好的失真,在低失真时实现了更好的真实性。

By optimizing the rate-distortion-realism trade-off, generative compression approaches produce detailed, realistic images, even at low bit rates, instead of the blurry reconstructions produced by rate-distortion optimized models. However, previous methods do not explicitly control how much detail is synthesized, which results in a common criticism of these methods: users might be worried that a misleading reconstruction far from the input image is generated. In this work, we alleviate these concerns by training a decoder that can bridge the two regimes and navigate the distortion-realism trade-off. From a single compressed representation, the receiver can decide to either reconstruct a low mean squared error reconstruction that is close to the input, a realistic reconstruction with high perceptual quality, or anything in between. With our method, we set a new state-of-the-art in distortion-realism, pushing the frontier of achievable distortion-realism pairs, i.e., our method achieves better distortions at high realism and better realism at low distortion than ever before.

High-Fidelity 3D Face Generation From Natural Language Descriptions
Wu, MenghuaandZhu, HaoandHuang, LinjiaandZhuang, YiyuandLu, YuanxunandCao, Xun



研究问题:如何从自然语言描述中合成高质量的3D人脸模型。
动机:合成高质量的3D人脸模型对于许多应用(包括创建虚拟形象、虚拟现实和远程呈现)非常有价值,但目前对此的研究还很少。
方法:构建了DESCRIBE3D数据集,这是第一个用于文本到3D人脸生成任务的具有精细文本描述的大型数据集。然后提出了一个两阶段框架,首先生成与具体描述匹配的3D人脸,然后在3D形状和纹理空间中使用抽象描述优化参数以细化3D人脸模型。
效果:大量实验结果表明,该方法可以生成忠实于输入描述的高质量3D人脸,其准确性和质量高于以前的方法。

Synthesizing high-quality 3D face models from natural language descriptions is very valuable for many applications, including avatar creation, virtual reality, and telepresence. However, little research ever tapped into this task. We argue the major obstacle lies in 1) the lack of high-quality 3D face data with descriptive text annotation, and 2) the complex mapping relationship between descriptive language space and shape/appearance space. To solve these problems, we build DESCRIBE3D dataset, the first large-scale dataset with fine-grained text descriptions for text-to-3D face generation task. Then we propose a two-stage framework to first generate a 3D face that matches the concrete descriptions, then optimize the parameters in the 3D shape and texture space with abstract description to refine the 3D face model. Extensive experimental results show that our method can produce a faithful 3D face that conforms to the input descriptions with higher accuracy and quality than previous methods. The code and DESCRIBE3D dataset are released at https://github.com/zhuhao-nju/describe3d.

On Distillation of Guided Diffusion Models
Meng, ChenlinandRombach, RobinandGao, RuiqiandKingma, DiederikandErmon, StefanoandHo, JonathanandSalimans, Tim



研究问题:如何降低无分类器引导的扩散模型在推理时的计算成本。
动机:无分类器引导的扩散模型虽然在高分辨率图像生成方面效果显著,但计算成本高,需要评估两个扩散模型,即条件模型和无条件模型,数十到数百次。
方法:提出一种将无分类器引导的扩散模型蒸馏成快速采样模型的方法。首先学习一个模型来匹配组合的条件和无条件模型的输出,然后逐步将该模型蒸馏成一个扩散模型,该模型需要的采样步骤要少得多。
效果:对于在像素空间上训练的标准扩散模型,该方法能够生成与原始模型视觉上相当的图像,在ImageNet 64x64和CIFAR-10上只需要4个采样步骤,FID/IS分数与原始模型相当,但采样速度提高了256倍。对于在潜在空间(如Stable Diffusion)上训练的扩散模型,该方法能够生成高保真图像,在ImageNet 256x256和LAION数据集上只需要1到4个去噪步骤,比现有方法至少快10倍。进一步证明了该方法在文本引导的图像编辑和修复中的有效性,我们的蒸馏模型只需要2到4个去噪步骤就能生成高质量的结果。

Classifier-free guided diffusion models have recently been shown to be highly effective at high-resolution image generation, and they have been widely used in large-scale diffusion frameworks including DALL*E 2, Stable Diffusion and Imagen. However, a downside of classifier-free guided diffusion models is that they are computationally expensive at inference time since they require evaluating two diffusion models, a class-conditional model and an unconditional model, tens to hundreds of times. To deal with this limitation, we propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from: Given a pre-trained classifier-free guided model, we first learn a single model to match the output of the combined conditional and unconditional models, and then we progressively distill that model to a diffusion model that requires much fewer sampling steps. For standard diffusion models trained on the pixel-space, our approach is able to generate images visually comparable to that of the original model using as few as 4 sampling steps on ImageNet 64x64 and CIFAR-10, achieving FID/IS scores comparable to that of the original model while being up to 256 times faster to sample from. For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps, accelerating inference by at least 10-fold compared to existing methods on ImageNet 256x256 and LAION datasets. We further demonstrate the effectiveness of our approach on text-guided image editing and inpainting, where our distilled model is able to generate high-quality results using as few as 2-4 denoising steps.

Zero-Shot Pose Transfer for Unrigged Stylized 3D Characters
Wang, JiashunandLi, XuetingandLiu, SifeiandDeMello, ShaliniandGallo, OrazioandWang, XiaolongandKautz, Jan



研究问题:如何将参考化身的姿态转移到各种形状的样式化3D角色上是计算机图形学中的基本任务。
动机:现有的方法要么需要样式化的角色被绑定,要么在训练中使用期望姿态的样式化角色作为真实值。我们提出了一种零射方法,只需要在训练中使用广泛可用的变形非样式化化身,并在推理时使形状显著不同的样式化角色变形。
方法:我们引入了一个半监督的形状理解模块来绕过测试时间对显式对应关系的需求,以及一个隐式的姿态变形模块,该模块变形单个表面点以匹配目标姿态。此外,为了鼓励对样式化角色的真实和准确变形,我们引入了一种有效的基于体积的测试时间训练过程。
效果:由于我们的模型不需要绑定,也不需要训练时的变形样式化角色,因此它可以推广到类别稀缺注释的情况,如四足动物。大量的实验表明,与使用可比或更多监督的训练状态-of-the-art方法相比,我们提出的方法更有效。

Transferring the pose of a reference avatar to stylized 3D characters of various shapes is a fundamental task in computer graphics. Existing methods either require the stylized characters to be rigged, or they use the stylized character in the desired pose as ground truth at training. We present a zero-shot approach that requires only the widely available deformed non-stylized avatars in training, and deforms stylized characters of significantly different shapes at inference. Classical methods achieve strong generalization by deforming the mesh at the triangle level, but this requires labelled correspondences. We leverage the power of local deformation, but without requiring explicit correspondence labels. We introduce a semi-supervised shape-understanding module to bypass the need for explicit correspondences at test time, and an implicit pose deformation module that deforms individual surface points to match the target pose. Furthermore, to encourage realistic and accurate deformation of stylized characters, we introduce an efficient volume-based test-time training procedure. Because it does not need rigging, nor the deformed stylized character at training time, our model generalizes to categories with scarce annotation, such as stylized quadrupeds. Extensive experiments demonstrate the effectiveness of the proposed method compared to the state-of-the-art approaches trained with comparable or more supervision. Our project page is available at https://jiashunwang.github.io/ZPT

OTAvatar: One-Shot Talking Face Avatar With Controllable Tri-Plane Rendering
Ma, ZhiyuanandZhu, XiangyuandQi, Guo-JunandLei, ZhenandZhang, Lei



研究问题:如何同时满足可控性、泛化性和效率,构建由神经隐式场表示的人脸头像。
动机:现有的方法无法同时满足这三个要求,或者只关注静态肖像,限制了表示能力,或者计算成本高,限制了灵活性。
方法:本文提出了一种一次拍摄的会话人脸头像(OTAvatar),通过一个通用的可控三角面绘制解决方案来构建人脸头像,使得每个个性化的头像都可以仅从一个肖像作为参考进行构建。具体来说,OTAvatar首先将肖像图像转换为无运动的身份代码,然后使用身份代码和运动代码调制一个高效的CNN生成一个编码主题所需运动的三角面形成的体积,最后使用体积渲染在任何视图中生成图像。
效果:由于采用了有效的三角面表示,我们实现了以35FPS的速度在A100上对通用人脸头像的可控渲染。实验表明,跨身份重演在训练集外的主题上表现出良好的性能,并且具有更好的3D一致性。

Controllability, generalizability and efficiency are the major objectives of constructing face avatars represented by neural implicit field. However, existing methods have not managed to accommodate the three requirements simultaneously. They either focus on static portraits, restricting the representation ability to a specific subject, or suffer from substantial computational cost, limiting their flexibility. In this paper, we propose One-shot Talking face Avatar (OTAvatar), which constructs face avatars by a generalized controllable tri-plane rendering solution so that each personalized avatar can be constructed from only one portrait as the reference. Specifically, OTAvatar first inverts a portrait image to a motion-free identity code. Second, the identity code and a motion code are utilized to modulate an efficient CNN to generate a tri-plane formulated volume, which encodes the subject in the desired motion. Finally, volume rendering is employed to generate an image in any view. The core of our solution is a novel decoupling-by-inverting strategy that disentangles identity and motion in the latent code via optimization-based inversion. Benefiting from the efficient tri-plane representation, we achieve controllable rendering of generalized face avatar at 35 FPS on A100. Experiments show promising performance of cross-identity reenactment on subjects out of the training set and better 3D consistency. The code is available at https://github.com/theEricMa/OTAvatar.

HOLODIFFUSION: Training a 3D Diffusion Model Using 2D Images
Karnewar, AnimeshandVedaldi, AndreaandNovotny, DavidandMitra, NiloyJ.



研究问题:如何将扩散模型扩展到3D图像生成?
动机:虽然扩散模型在2D图像生成方面表现出色,但扩展到3D仍面临数据获取复杂和计算内存大的挑战。
方法:提出新的训练策略,仅使用2D图像进行监督训练;同时提出一种新的图像形成模型,使模型内存与空间记忆解耦。
效果:通过CO3D数据集的实验证明,该方法可扩展性强,训练稳定,并在样本质量和对现有3D生成模型的逼真度上具有竞争力。

Diffusion models have emerged as the best approach for generative modeling of 2D images. Part of their success is due to the possibility of training them on millions if not billions of images with a stable learning objective. However, extending these models to 3D remains difficult for two reasons. First, finding a large quantity of 3D training data is much more complex than for 2D images. Second, while it is conceptually trivial to extend the models to operate on 3D rather than 2D grids, the associated cubic growth in memory and compute complexity makes this infeasible. We address the first challenge by introducing a new diffusion setup that can be trained, end-to-end, with only posed 2D images for supervision; and the second challenge by proposing an image formation model that decouples model memory from spatial memory. We evaluate our method on real-world data, using the CO3D dataset which has not been used to train 3D generative models before. We show that our diffusion models are scalable, train robustly, and are competitive in terms of sample quality and fidelity to existing approaches for 3D generative modeling.

NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-Shot Real Image Animation
Yin, YuandGhasedi, KamranandWu, HsiangTaoandYang, JiaolongandTong, XinandFu, Yun



研究问题:如何将NeRF-GAN模型用于生成真实主体的高质量人脸图像。
动机:尽管现有的NeRF-GAN模型能够成功合成随机采样的潜在空间中的假身份图像,但在生成真实主体的人脸图像方面仍面临挑战,因为存在所谓的“逆问题”。
方法:本文提出了一种通用的方法,通过手术式微调这些NeRF-GAN模型,仅通过单张图像实现真实主体的高保真动画。给定一张非领域的真实图像的优化潜在代码,我们在渲染的图像上使用二维损失函数来缩小身份差距。此外,我们的方法利用优化潜在代码周围的领域内邻域样本进行显性和隐性的3D正则化,以消除几何和视觉伪像。
效果:实验证实,我们的方法在多个不同的数据集上的多个NeRF-GAN模型中都能有效地生成真实、高保真且具有3D一致性的真实人脸动画。

Nerf-based Generative models have shown impressive capacity in generating high-quality images with consistent 3D geometry. Despite successful synthesis of fake identity images randomly sampled from latent space, adopting these models for generating face images of real subjects is still a challenging task due to its so-called inversion issue. In this paper, we propose a universal method to surgically fine-tune these NeRF-GAN models in order to achieve high-fidelity animation of real subjects only by a single image. Given the optimized latent code for an out-of-domain real image, we employ 2D loss functions on the rendered image to reduce the identity gap. Furthermore, our method leverages explicit and implicit 3D regularizations using the in-domain neighborhood samples around the optimized latent code to remove geometrical and visual artifacts. Our experiments confirm the effectiveness of our method in realistic, high-fidelity, and 3D consistent animation of real faces on multiple NeRF-GAN models across different datasets.

Disentangling Writer and Character Styles for Handwriting Generation
Dai, GangandZhang, YifanandWang, QingfengandDu, QingandYu, ZhuliangandLiu, ZhuomanandHuang, Shuangping



研究问题:训练机器合成多样的手写体是一项有趣的任务,但现有的基于RNN的方法主要关注捕捉一个人的书写风格,忽视了同一人写的字符之间的微妙风格不一致。
动机:尽管一个人的笔迹通常表现出一般的统一性(如字形倾斜和宽高比),但在更细微的细节上(如笔画长度和曲率)仍然存在小的风格变化。因此,我们提出从个人笔迹中解耦作者和字符级别的风格表示,以合成真实的在线手写字符。
方法:我们提出了风格解耦的Transformer(SDT),它使用两个互补的对比目标来提取参考样本的风格共性,并捕获每个样本的详细风格模式。
效果:我们在各种语言脚本上的大量实验表明,SDT的有效性。值得注意的是,我们的实证研究发现,学习到的两个风格表示提供了不同频率大小的信息,强调了单独提取风格的重要性。

Training machines to synthesize diverse handwritings is an intriguing task. Recently, RNN-based methods have been proposed to generate stylized online Chinese characters. However, these methods mainly focus on capturing a person's overall writing style, neglecting subtle style inconsistencies between characters written by the same person. For example, while a person's handwriting typically exhibits general uniformity (e.g., glyph slant and aspect ratios), there are still small style variations in finer details (e.g., stroke length and curvature) of characters. In light of this, we propose to disentangle the style representations at both writer and character levels from individual handwritings to synthesize realistic stylized online handwritten characters. Specifically, we present the style-disentangled Transformer (SDT), which employs two complementary contrastive objectives to extract the style commonalities of reference samples and capture the detailed style patterns of each sample, respectively. Extensive experiments on various language scripts demonstrate the effectiveness of SDT. Notably, our empirical findings reveal that the two learned style representations provide information at different frequency magnitudes, underscoring the importance of separate style extraction. Our source code is public at: https://github.com/dailenson/SDT.

StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-Based Generator
Guan, JiazhiandZhang, ZhanwangandZhou, HangandHu, TianshuandWang, KaisiyuanandHe, DongliangandFeng, HaochengandLiu, JingtuoandDing, ErruiandLiu, ZiweiandWang, Jingdong



研究问题:当前的语言表示模型在利用丰富的结构化知识方面存在不足,如何通过结合大规模文本语料库和知识图谱来训练一种增强的语言表示模型。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,而知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:本文提出了一种名为ERNIE的增强语言表示模型,该模型采用大规模文本语料库和知识图谱进行联合训练,能够同时充分利用词汇、句法和知识信息。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Despite recent advances in syncing lip movements with any audio waves, current methods still struggle to balance generation quality and the model's generalization ability. Previous studies either require long-term data for training or produce a similar movement pattern on all subjects with low quality. In this paper, we propose StyleSync, an effective framework that enables high-fidelity lip synchronization. We identify that a style-based generator would sufficiently enable such a charming property on both one-shot and few-shot scenarios. Specifically, we design a mask-guided spatial information encoding module that preserves the details of the given face. The mouth shapes are accurately modified by audio through modulated convolutions. Moreover, our design also enables personalized lip-sync by introducing style space and generator refinement on only limited frames. Thus the identity and talking style of a target person could be accurately preserved. Extensive experiments demonstrate the effectiveness of our method in producing high-fidelity results on a variety of scenes.

High-Fidelity and Freely Controllable Talking Head Video Generation
Gao, YueandZhou, YuanandWang, JingluandLi, XiaoandMing, XiangandLu, Yan



研究问题:如何生成高质量的、可控的说话人视频。
动机:当前方法在生成视频时,面临面部变形和扭曲严重、运动相关信息未明确分离、视频中常出现闪烁等挑战,限制了生成视频的质量与可控性。
方法:提出一种新模型,利用自监督学习地标和基于3D人脸模型的地标来模拟运动,并引入一种新颖的运动感知多尺度特征对齐模块,以有效传输运动而不会引起面部扭曲。同时,通过特征上下文适应和传播模块增强合成说话人视频的平滑度。
效果:在具有挑战性的数据集上进行评估,实验结果表明该模型在各种指标上都达到了最先进的性能。

Talking head generation is to generate video based on a given source identity and target motion. However, current methods face several challenges that limit the quality and controllability of the generated videos. First, the generated face often has unexpected deformation and severe distortions. Second, the driving image does not explicitly disentangle movement-relevant information, such as poses and expressions, which restricts the manipulation of different attributes during generation. Third, the generated videos tend to have flickering artifacts due to the inconsistency of the extracted landmarks between adjacent frames. In this paper, we propose a novel model that produces high-fidelity talking head videos with free control over head pose and expression. Our method leverages both self-supervised learned landmarks and 3D face model-based landmarks to model the motion. We also introduce a novel motion-aware multi-scale feature alignment module to effectively transfer the motion without face distortion. Furthermore, we enhance the smoothness of the synthesized talking head videos with a feature context adaptation and propagation module. We evaluate our model on challenging datasets and demonstrate its state-of-the-art performance. More information is available at https://yuegao.me/PECHead.

Towards Accurate Image Coding: Improved Autoregressive Image Generation With Dynamic Vector Quantization
Huang, MengqiandMao, ZhendongandChen, ZhuoweiandZhang, Yongdong



研究问题:现有的基于向量量化的自回归模型在生成图像时,由于将固定大小的图像区域编码为固定长度的代码,忽视了不同区域的信息密度差异,导致重要区域信息不足,不重要区域冗余,最终影响生成质量和速度。
动机:为了解决上述问题,提出了一种新的两阶段框架,通过动态量化变分自编码器(DQ-VAE)对图像区域进行基于信息密度的可变长度编码,然后通过DQ-Transformer以粗到细的方式生成图像。
方法:首先,使用DQ-VAE对图像区域进行基于信息密度的可变长度编码;其次,通过堆叠变压器架构和共享内容、非共享位置输入层设计,交替地模拟每个粒度的位置和内容代码,从粗到细生成图像。
效果:在各种生成任务上的全面实验验证了该方法在有效性和效率上的优越性。

Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm that first learns a codebook to encode images as discrete codes, and then completes generation based on the learned codebook. However, they encode fixed-size image regions into fixed-length codes and ignore their naturally different information densities, which results in insufficiency in important regions and redundancy in unimportant ones, and finally degrades the generation quality and speed. Moreover, the fixed-length coding leads to an unnatural raster-scan autoregressive generation. To address the problem, we propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based on their information densities for an accurate & compact code representation. (2) DQ-Transformer which thereby generates images autoregressively from coarse-grained (smooth regions with fewer codes) to fine-grained (details regions with more codes) by modeling the position and content of codes in each granularity alternately, through a novel stacked-transformer architecture and shared-content, non-shared position input layers designs. Comprehensive experiments on various generation tasks validate our superiorities in both effectiveness and efficiency.

ReCo: Region-Controlled Text-to-Image Generation
Yang, ZhengyuanandWang, JianfengandGan, ZheandLi, LinjieandLin, KevinandWu, ChenfeiandDuan, NanandLiu, ZichengandLiu, CeandZeng, MichaelandWang, Lijuan



研究问题:如何提高文本到图像生成模型的可控性,使其能够根据自由形式的区域描述精确地指定特定区域的具体内容。
动机:目前的大规模文本到图像(T2I)模型在生成高保真图像方面表现出色,但在可控性方面存在限制,例如无法根据自由形式的文本描述精确地指定特定区域的内容。
方法:提出一种有效的区域控制技术,通过为T2I模型的输入添加额外的位置标记来增强其可控性。每个区域由四个位置标记表示,分别代表左上角和右下角,然后是开放式的自然语言区域描述。然后,使用这种新的输入接口对预训练的T2I模型进行微调。
效果:实验结果表明,ReCo(区域控制的T2I)模型在任意对象的开放式区域文本描述下实现了更好的图像质量和更准确的对象放置,相比于使用位置词加强的T2I模型,FID从8.82降低到7.36,SceneFID从15.54降低到6.51(COCO数据集),并且在COCO数据集上实现了20.40%的区域分类精度提升。此外,我们还证明ReCo可以更好地控制对象数量、空间关系和区域属性(如颜色/大小)。在PaintSkill上的人类评估显示,ReCo在生成具有正确对象数量和空间关系的图像方面的准确率比T2I模型高出+19.28%和+17.21%。

Recently, large-scale text-to-image (T2I) models have shown impressive performance in generating high-fidelity images, but with limited controllability, e.g., precisely specifying the content in a specific region with a free-form text description. In this paper, we propose an effective technique for such regional control in T2I generation. We augment T2I models' inputs with an extra set of position tokens, which represent the quantized spatial coordinates. Each region is specified by four position tokens to represent the top-left and bottom-right corners, followed by an open-ended natural language regional description. Then, we fine-tune a pre-trained T2I model with such new input interface. Our model, dubbed as ReCo (Region-Controlled T2I), enables the region control for arbitrary objects described by open-ended regional texts rather than by object labels from a constrained category set. Empirically, ReCo achieves better image quality than the T2I model strengthened by positional words (FID: 8.82 -> 7.36, SceneFID: 15.54 -> 6.51 on COCO), together with objects being more accurately placed, amounting to a 20.40% region classification accuracy improvement on COCO. Furthermore, we demonstrate that ReCo can better control the object count, spatial relationship, and region attributes such as color/size, with the free-form regional description. Human evaluation on PaintSkill shows that ReCo is +19.28% and +17.21% more accurate in generating images with correct object count and spatial relationship than the T2I model.

Fix the Noise: Disentangling Source Feature for Controllable Domain Translation
Lee, DongyeunandLee, JaeYoungandKim, DoyeonandChoi, JaehyunandYoo, JaejunandKim, Junmo



研究问题:如何通过单一模型实现高质量的领域翻译,同时更好地控制不同领域的特征。
动机:现有的方法需要额外的模型,计算量大且视觉质量不佳,且控制步骤有限,无法实现平滑过渡。
方法:在目标特征空间的解耦子空间中保留源特征,仅使用单一模型从全新领域生成图像,从而平滑地控制保留源特征的程度。
效果:实验表明,该方法可以生成更一致、更真实的图像,并在不同级别的转换上保持精确的可控性。

Recent studies show strong generative performance in domain translation especially by using transfer learning techniques on the unconditional generator. However, the control between different domain features using a single model is still challenging. Existing methods often require additional models, which is computationally demanding and leads to unsatisfactory visual quality. In addition, they have restricted control steps, which prevents a smooth transition. In this paper, we propose a new approach for high-quality domain translation with better controllability. The key idea is to preserve source features within a disentangled subspace of a target feature space. This allows our method to smoothly control the degree to which it preserves source features while generating images from an entirely new domain using only a single model. Our extensive experiments show that the proposed method can produce more consistent and realistic images than previous works and maintain precise controllability over different levels of transformation. The code is available at LeeDongYeun/FixNoise.

FaceLit: Neural 3D Relightable Faces
Ranjan, AnuragandYi, KwangMooandChang, Jen-HaoRickandTuzel, Oncel



研究问题:如何仅从2D图像中生成可以在各种用户定义的光照条件和视角下渲染的3D人脸。
动机:现有的方法需要仔细的捕捉设置或人工劳动,而我们的方法依赖于现成的姿态和照明估计器,无需手动标注。
方法:我们提出了一个名为FaceLit的生成框架,该框架能够从野外的2D图像中学习生成3D人脸,并结合了Phong反射模型在神经体积渲染框架中。
效果:我们的方法能够在多个数据集——FFHQ、MetFaces和CelebA-HQ上实现具有明确照明和视图控制的逼真人脸生成。在FFHQ数据集上,我们在3D感知的GANs中实现了最先进的照片写实主义,FID得分为3.5。

We propose a generative framework, FaceLit, capable of generating a 3D face that can be rendered at various user-defined lighting conditions and views, learned purely from 2D images in-the-wild without any manual annotation. Unlike existing works that require careful capture setup or human labor, we rely on off-the-shelf pose and illumination estimators. With these estimates, we incorporate the Phong reflectance model in the neural volume rendering framework. Our model learns to generate shape and material properties of a face such that, when rendered according to the natural statistics of pose and illumination, produces photorealistic face images with multiview 3D and illumination consistency. Our method enables photorealistic generation of faces with explicit illumination and view controls on multiple datasets -- FFHQ, MetFaces and CelebA-HQ. We show state-of-the-art photorealism among 3D aware GANs on FFHQ dataset achieving an FID score of 3.5.

StyleGene: Crossover and Mutation of Region-Level Facial Genes for Kinship Face Synthesis
Li, HaoandHou, XianxuandHuang, ZepengandShen, Linlin



研究问题:如何利用大规模、高质量的亲属关系数据合成高质量的后代面部图像。
动机:由于缺乏大规模的高质量标注的亲属关系数据,合成具有遗传关系的高质量后代面部图像具有挑战性。
方法:提出区域级面部基因(RFG)提取框架,通过图像基基因编码器(IGE)、潜在基基因编码器(LGE)和基因解码器学习给定面部图像的RFG及其与StyleGAN2的潜在空间的关系。设计循环损失来测量基因解码器和图像编码器的输出以及LGE和IGE的输出之间的L_2距离,从而仅需要面部图像就可以训练我们的框架。
效果:在FIW、TSKinFace和FF数据库上的定性、定量和主观实验清楚地表明,我们的方法生成的亲属关系面部图像的质量和多样性明显优于现有的最先进的方法。

High-fidelity kinship face synthesis has many potential applications, such as kinship verification, missing child identification, and social media analysis. However, it is challenging to synthesize high-quality descendant faces with genetic relations due to the lack of large-scale, high-quality annotated kinship data. This paper proposes RFG (Region-level Facial Gene) extraction framework to address this issue. We propose to use IGE (Image-based Gene Encoder), LGE (Latent-based Gene Encoder) and Gene Decoder to learn the RFGs of a given face image, and the relationships between RFGs and the latent space of StyleGAN2. As cycle-like losses are designed to measure the L_2 distances between the output of Gene Decoder and image encoder, and that between the output of LGE and IGE, only face images are required to train our framework, i.e. no paired kinship face data is required. Based upon the proposed RFGs, a crossover and mutation module is further designed to inherit the facial parts of parents. A Gene Pool has also been used to introduce the variations into the mutation of RFGs. The diversity of the faces of descendants can thus be significantly increased. Qualitative, quantitative, and subjective experiments on FIW, TSKinFace, and FF-Databases clearly show that the quality and diversity of kinship faces generated by our approach are much better than the existing state-of-the-art methods.

3D Cinemagraphy From a Single Image
Li, XingyiandCao, ZhiguoandSun, HuiqiangandZhang, JianmingandXian, KeandLin, Guosheng



研究问题:如何将2D图像动画与3D摄影相结合,以生成同时包含视觉内容动画和相机运动的视频。
动机:现有的2D图像动画和3D摄影方法直接结合会产生明显的伪影或不一致的动画效果。
方法:首先,将输入图像转换为基于特征的分层深度图像,然后将其投影到特征点云。为了动画场景,进行运动估计并将2D运动提升为3D场景流。最后,通过根据场景流双向移动点云并分别投影到目标图像平面并混合结果来解决点向前移动时出现的孔洞问题。
效果:大量实验表明该方法的有效性。用户研究也验证了该方法的引人注目的渲染结果。

We present 3D Cinemagraphy, a new technique that marries 2D image animation with 3D photography. Given a single still image as input, our goal is to generate a video that contains both visual content animation and camera motion. We empirically find that naively combining existing 2D image animation and 3D photography methods leads to obvious artifacts or inconsistent animation. Our key insight is that representing and animating the scene in 3D space offers a natural solution to this task. To this end, we first convert the input image into feature-based layered depth images using predicted depth values, followed by unprojecting them to a feature point cloud. To animate the scene, we perform motion estimation and lift the 2D motion into the 3D scene flow. Finally, to resolve the problem of hole emergence as points move forward, we propose to bidirectionally displace the point cloud as per the scene flow and synthesize novel views by separately projecting them into target image planes and blending the results. Extensive experiments demonstrate the effectiveness of our method. A user study is also conducted to validate the compelling rendering results of our method.

Inversion-Based Style Transfer With Diffusion Models
Zhang, YuxinandHuang, NishaandTang, FanandHuang, HaibinandMa, ChongyangandDong, WeimingandXu, Changsheng



研究问题:如何通过一种模型,直接从一幅画中学习艺术风格,并无需复杂的文本描述即可指导图像的合成。
动机:现有的艺术风格生成方法往往无法有效控制形状变化或传达元素,而预训练的文本到图像合成扩散概率模型虽然质量出色,但通常需要详细的文本描述才能准确描绘特定画作的属性。
方法:提出一种基于反转的风格转移方法(InST),该方法可以高效准确地学习图像的关键信息,从而捕捉和转移画作的艺术风格。
效果:在多种艺术家和风格的众多画作上展示了该方法的质量和效率。

The artistic style within a painting is the means of expression, which includes not only the painting material, colors, and brushstrokes, but also the high-level attributes, including semantic elements and object shapes. Previous arbitrary example-guided artistic image generation methods often fail to control shape changes or convey elements. Pre-trained text-to-image synthesis diffusion probabilistic models have achieved remarkable quality but often require extensive textual descriptions to accurately portray the attributes of a particular painting.The uniqueness of an artwork lies in the fact that it cannot be adequately explained with normal language. Our key idea is to learn the artistic style directly from a single painting and then guide the synthesis without providing complex textual descriptions. Specifically, we perceive style as a learnable textual description of a painting.We propose an inversion-based style transfer method (InST), which can efficiently and accurately learn the key information of an image, thus capturing and transferring the artistic style of a painting. We demonstrate the quality and efficiency of our method on numerous paintings of various artists and styles. Codes are available at https://github.com/zyxElsa/InST.

StyleRes: Transforming the Residuals for Real Image Editing With StyleGAN
Pehlivan, HamzaandDalva, YusufandDundar, Aysegul



研究问题:如何实现高保真图像反转和高质量属性编辑。
动机:在将真实图像转换为StyleGAN的潜空间时,如何在图像重建保真度和图像编辑质量之间取得平衡仍然是一个开放的挑战。
方法:通过学习较高潜码中较低潜码无法编码的残差特征来实现高保真反转,并通过学习如何转换残差特征以适应潜在代码中的操作来实现高质量编辑。
效果:通过训练框架提取残差特征并使用新颖的架构管道和循环一致性损失进行转换,实验结果和与最先进的反转方法的比较表明了显著的改进。

We present a novel image inversion framework and a training pipeline to achieve high-fidelity image inversion with high-quality attribute editing. Inverting real images into StyleGAN's latent space is an extensively studied problem, yet the trade-off between the image reconstruction fidelity and image editing quality remains an open challenge. The low-rate latent spaces are limited in their expressiveness power for high-fidelity reconstruction. On the other hand, high-rate latent spaces result in degradation in editing quality. In this work, to achieve high-fidelity inversion, we learn residual features in higher latent codes that lower latent codes were not able to encode. This enables preserving image details in reconstruction. To achieve high-quality editing, we learn how to transform the residual features for adapting to manipulations in latent codes. We train the framework to extract residual features and transform them via a novel architecture pipeline and cycle consistency losses. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements.

Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding
Kim, GyeongmanandShim, HajinandKim, HyunsuandChoi, YunjeyandKim, JunhoandYang, Eunho



研究问题:如何将最新的人脸图像编辑方法扩展到人脸视频编辑任务,并解决编辑帧之间的时间一致性问题。
动机:现有的人脸视频编辑方法无法有效解决编辑帧之间的时间一致性问题。
方法:提出一种基于扩散自编码器的新型人脸视频编辑框架,该模型能够从给定的视频中提取身份和运动的特征,并通过操纵这些特征来实现视频编辑。
效果:实验结果表明,该方法在各种情况下都能实现优秀的人脸视频编辑效果,并且优于现有的基于GAN的方法。

Inspired by the impressive performance of recent face image editing methods, several studies have been naturally proposed to extend these methods to the face video editing task. One of the main challenges here is temporal consistency among edited frames, which is still unresolved. To this end, we propose a novel face video editing framework based on diffusion autoencoders that can successfully extract the decomposed features - for the first time as a face video editing model - of identity and motion from a given video. This modeling allows us to edit the video by simply manipulating the temporally invariant feature to the desired direction for the consistency. Another unique strength of our model is that, since our model is based on diffusion models, it can satisfy both reconstruction and edit capabilities at the same time, and is robust to corner cases in wild face videos (e.g. occluded faces) unlike the existing GAN-based methods.

Conditional Text Image Generation With Diffusion Models
Zhu, YuanzhiandLi, ZhaohaiandWang, TianweiandHe, MengchaoandYao, Cong



研究问题:本文旨在解决文本图像生成的问题,通过利用扩散模型的强大能力,在给定条件下生成逼真且多样化的图像样本。
动机:目前的文本识别系统严重依赖图像合成和增强,因为收集和注释足够的真实文本图像以实现现实世界的复杂性和多样性是困难的。
方法:本文提出了一种名为条件文本图像生成的扩散模型(CTIG-DM)的方法,该方法利用扩散模型在给定条件下生成逼真且多样化的图像样本的能力,并设计了三种条件:图像条件、文本条件和风格条件,以控制图像生成过程中样本的属性、内容和样式。
效果:实验结果表明,提出的CTIG-DM能够生成模拟现实世界复杂性和多样性的图像样本,从而提高现有文本识别器的性能。此外,CTIG-DM在领域适应和生成包含未登录词的图像方面显示出其吸引人的潜力。

Current text recognition systems, including those for handwritten scripts and scene text, have relied heavily on image synthesis and augmentation, since it is difficult to realize real-world complexity and diversity through collecting and annotating enough real text images. In this paper, we explore the problem of text image generation, by taking advantage of the powerful abilities of Diffusion Models in generating photo-realistic and diverse image samples with given conditions, and propose a method called Conditional Text Image Generation with Diffusion Models (CTIG-DM for short). To conform to the characteristics of text images, we devise three conditions: image condition, text condition, and style condition, which can be used to control the attributes, contents, and styles of the samples in the image generation process. Specifically, four text image generation modes, namely: (1) synthesis mode, (2) augmentation mode, (3) recovery mode, and (4) imitation mode, can be derived by combining and configuring these three conditions. Extensive experiments on both handwritten and scene text demonstrate that the proposed CTIG-DM is able to produce image samples that simulate real-world complexity and diversity, and thus can boost the performance of existing text recognizers. Besides, CTIG-DM shows its appealing potential in domain adaptation and generating images containing Out-Of-Vocabulary (OOV) words.

Transforming Radiance Field With Lipschitz Network for Photorealistic 3D Scene Stylization
Zhang, ZichengandLiu, YingluandHan, CongyingandPan, YingweiandGuo, TiandeandYao, Ting



研究问题:如何利用神经辐射场(NeRF)进行真实感3D场景风格化。
动机:虽然神经辐射场(NeRFs)在3D场景表示和新颖视图合成方面取得了显著进步,但直接将其用于生成视觉一致、真实的风格化场景仍具有挑战性。
方法:研究者提出了LipRF框架,该框架首先预训练一个辐射场来重建3D场景,然后通过2D PST模拟每个视图的风格,学习一个Lipschitz网络来风格化预先训练的外观。为了平衡重建和风格化,设计了一种自适应正则化策略,并引入了渐进式梯度聚合策略以优化LipRF。
效果:实验表明,LipRF在真实感3D风格化和物体外观编辑方面表现出高质量和稳健的性能。

Recent advances in 3D scene representation and novel view synthesis have witnessed the rise of Neural Radiance Fields (NeRFs). Nevertheless, it is not trivial to exploit NeRF for the photorealistic 3D scene stylization task, which aims to generate visually consistent and photorealistic stylized scenes from novel views. Simply coupling NeRF with photorealistic style transfer (PST) will result in cross-view inconsistency and degradation of stylized view syntheses. Through a thorough analysis, we demonstrate that this non-trivial task can be simplified in a new light: When transforming the appearance representation of a pre-trained NeRF with Lipschitz mapping, the consistency and photorealism across source views will be seamlessly encoded into the syntheses. That motivates us to build a concise and flexible learning framework namely LipRF, which upgrades arbitrary 2D PST methods with Lipschitz mapping tailored for the 3D scene. Technically, LipRF first pre-trains a radiance field to reconstruct the 3D scene, and then emulates the style on each view by 2D PST as the prior to learn a Lipschitz network to stylize the pre-trained appearance. In view of that Lipschitz condition highly impacts the expressivity of the neural network, we devise an adaptive regularization to balance the reconstruction and stylization. A gradual gradient aggregation strategy is further introduced to optimize LipRF in a cost-efficient manner. We conduct extensive experiments to show the high quality and robust performance of LipRF on both photorealistic 3D stylization and object appearance editing.

DiffCollage: Parallel Generation of Large Content With Diffusion Models
Zhang, QinshengandSong, JiamingandHuang, XunandChen, YongxinandLiu, Ming-Yu



研究问题:本文旨在提出一种组合扩散模型DiffCollage,利用在生成大段内容的训练中训练的扩散模型来生成大量内容。
动机:现有的生成模型需要通过自回归过程来生成内容,效率较低。扩散模型可以并行生成任意大小和形状的内容,但需要对每个部分分别进行训练。
方法:DiffCollage采用因子图表示,其中每个因子节点代表内容的一部分,变量节点代表它们的重叠部分。这种表示允许我们聚合在单个节点上定义的扩散模型的中间输出,以并行生成任意大小和形状的内容,而无需依赖自回归生成过程。
效果:我们将DiffCollage应用于各种任务,包括无限图像生成、全景图像生成和长篇文本引导的运动生成等。与强大的自回归基线相比,大量的实验结果验证了我们方法的有效性。

We present DiffCollage, a compositional diffusion model that can generate large content by leveraging diffusion models trained on generating pieces of the large content. Our approach is based on a factor graph representation where each factor node represents a portion of the content and a variable node represents their overlap. This representation allows us to aggregate intermediate outputs from diffusion models defined on individual nodes to generate content of arbitrary size and shape in parallel without resorting to an autoregressive generation procedure. We apply DiffCollage to various tasks, including infinite image generation, panorama image generation, and long-duration text-guided motion generation. Extensive experimental results with a comparison to strong autoregressive baselines verify the effectiveness of our approach.

Mofusion: A Framework for Denoising-Diffusion-Based Motion Synthesis
Dabral, RishabhandMughal, MuhammadHamzaandGolyanik, VladislavandTheobalt, Christian



研究问题:现有的人类运动合成方法要么确定性太强,要么在运动多样性和质量之间难以取舍。
动机:为了解决这些问题,我们提出了MoFusion,一种基于去噪扩散的新型高质量条件人类运动合成框架。
方法:通过去噪扩散框架,根据各种条件(如音乐和文本)合成长时间、时间上合理且语义准确的运动。并通过我们的计划加权策略在运动扩散框架中引入众所周知的运动合理性损失。
效果:通过全面的定量评估和感知用户研究,我们证明了MoFusion在文献中的现有基准上优于最先进的技术。

Conventional methods for human motion synthesis have either been deterministic or have had to struggle with the trade-off between motion diversity vs motion quality. In response to these limitations, we introduce MoFusion, i.e., a new denoising-diffusion-based framework for high-quality conditional human motion synthesis that can synthesise long, temporally plausible, and semantically accurate motions based on a range of conditioning contexts (such as music and text). We also present ways to introduce well-known kinematic losses for motion plausibility within the motion-diffusion framework through our scheduled weighting strategy. The learned latent space can be used for several interactive motion-editing applications like in-betweening, seed-conditioning, and text-based editing, thus, providing crucial abilities for virtual-character animation and robotics. Through comprehensive quantitative evaluations and a perceptual user study, we demonstrate the effectiveness of MoFusion compared to the state-of-the-art on established benchmarks in the literature. We urge the reader to watch our supplementary video. The source code will be released.

Recognizability Embedding Enhancement for Very Low-Resolution Face Recognition and Quality Estimation
Chai, JackyChenLongandNg, Tiong-SikandLow, Cheng-YawandPark, JaewooandTeoh, AndrewBengJin



研究问题:本文旨在解决极低分辨率人脸识别(VLRFR)中的挑战,如兴趣区域小和由于采集设备极端的远离距离或广视角导致的低分辨率。
动机:现有的方法主要关注提高视觉质量,而忽视了在嵌入空间中提升面部识别性的方法。
方法:本文提出了一种基于学习的面部识别性测量方法,即识别性指数(RI)。同时设计了一种索引转移损失函数,以推动具有低RI的难以识别的面部嵌入远离不可识别的面部簇,从而提高RI。此外,还引入了一种感知注意力机制,以关注显著可识别的面部区域,为嵌入学习提供更好的解释性和判别性内容。
效果:通过在三个具有挑战性的低分辨率数据集上进行广泛的评估,并与最先进的方法进行比较,证明了所提出模型在极低分辨率人脸识别任务上的优越性。

Very low-resolution face recognition (VLRFR) poses unique challenges, such as tiny regions of interest and poor resolution due to extreme standoff distance or wide viewing angle of the acquisition device. In this paper, we study principled approaches to elevate the recognizability of a face in the embedding space instead of the visual quality. We first formulate a robust learning-based face recognizability measure, namely recognizability index (RI), based on two criteria: (i) proximity of each face embedding against the unrecognizable faces cluster center and (ii) closeness of each face embedding against its positive and negative class prototypes. We then devise an index diversion loss to push the hard-to-recognize face embedding with low RI away from unrecognizable faces cluster to boost the RI, which reflects better recognizability. Additionally, a perceptibility-aware attention mechanism is introduced to attend to the salient recognizable face regions, which offers better explanatory and discriminative content for embedding learning. Our proposed model is trained end-to-end and simultaneously serves recognizability-aware embedding learning and face quality estimation. To address VLRFR, extensive evaluations on three challenging low-resolution datasets and face quality assessment demonstrate the superiority of the proposed model over the state-of-the-art methods.

Shape-Aware Text-Driven Layered Video Editing
Lee, Yao-ChihandJang, Ji-ZeGenevieveandChen, Yi-TingandQiu, ElizabethandHuang, Jia-Bin



研究问题:本文旨在解决视频编辑中形状变化的问题,现有的方法只能编辑对象外观,无法处理形状变化。
动机:由于使用固定UV映射场进行纹理图集的限制,现有的视频编辑方法无法处理形状变化。
方法:我们提出了一种形状感知的、基于文本的视频编辑方法。首先,我们将输入和编辑的关键帧之间的形变场传播到所有帧。然后,我们利用预训练的文本条件扩散模型作为指导,对形状失真进行细化并完成未见过的区域。
效果:实验结果表明,我们的方法可以实现形状感知的一致视频编辑,并与最先进的技术相比具有优势。

Temporal consistency is essential for video editing applications. Existing work on layered representation of videos allows propagating edits consistently to each frame. These methods, however, can only edit object appearance rather than object shape changes due to the limitation of using a fixed UV mapping field for texture atlas. We present a shape-aware, text-driven video editing method to tackle this challenge. To handle shape changes in video editing, we first propagate the deformation field between the input and edited keyframe to all frames. We then leverage a pre-trained text-conditioned diffusion model as guidance for refining shape distortion and completing unseen regions. The experimental results demonstrate that our method can achieve shape-aware consistent video editing and compare favorably with the state-of-the-art.

QuantArt: Quantizing Image Style Transfer Towards High Visual Fidelity
Huang, SiyuandAn, JieandWei, DonglaiandLuo, JieboandPfister, Hanspeter



研究问题:如何提高预训练语言模型对结构化知识的利用,以提升语言理解能力。
动机:现有的预训练语言模型在处理知识驱动任务时,缺乏对结构化知识的充分利用。
方法:本文提出了一种增强的语言表示模型ERNIE,该模型通过联合训练大规模文本语料库和知识图谱,能够同时捕捉词汇、句法和知识信息。
效果:实验结果显示,ERNIE在各种知识驱动任务上表现优秀,且在其他常见的NLP任务上与BERT模型相媲美。

The mechanism of existing style transfer algorithms is by minimizing a hybrid loss function to push the generated image toward high similarities in both content and style. However, this type of approach cannot guarantee visual fidelity, i.e., the generated artworks should be indistinguishable from real ones. In this paper, we devise a new style transfer framework called QuantArt for high visual-fidelity stylization. QuantArt pushes the latent representation of the generated artwork toward the centroids of the real artwork distribution with vector quantization. By fusing the quantized and continuous latent representations, QuantArt allows flexible control over the generated artworks in terms of content preservation, style similarity, and visual fidelity. Experiments on various style transfer settings show that our QuantArt framework achieves significantly higher visual fidelity compared with the existing style transfer methods.

Neural Transformation Fields for Arbitrary-Styled Font Generation
Fu, BinandHe, JunjunandWang, JianjunandQiao, Yu



研究问题:如何利用少量样本生成字体图像。
动机:由于学术和商业价值,近年来少样本字体生成(FFG)成为新兴话题。
方法:将字体生成模型化为从源字符图像到目标字体图像的连续转换过程,通过创建和消散字体像素将相应的转换嵌入神经变换场。
效果:实验表明该方法在少样本字体生成任务上取得了最先进的性能,证明了提出的模型的有效性。

Few-shot font generation (FFG), aiming at generating font images with a few samples, is an emerging topic in recent years due to the academic and commercial values. Typically, the FFG approaches follow the style-content disentanglement paradigm, which transfers the target font styles to characters by combining the content representations of source characters and the style codes of reference samples. Most existing methods attempt to increase font generation ability via exploring powerful style representations, which may be a sub-optimal solution for the FFG task due to the lack of modeling spatial transformation in transferring font styles. In this paper, we model font generation as a continuous transformation process from the source character image to the target font image via the creation and dissipation of font pixels, and embed the corresponding transformations into a neural transformation field. With the estimated transformation path, the neural transformation field generates a set of intermediate transformation results via the sampling process, and a font rendering formula is developed to accumulate them into the target font image. Extensive experiments show that our method achieves state-of-the-art performance on few-shot font generation task, which demonstrates the effectiveness of our proposed model. Our implementation is available at: https://github.com/fubinfb/NTF.

EDICT: Exact Diffusion Inversion via Coupled Transformations
Wallace, BramandGokul, AkashandNaik, Nikhil



研究问题:如何找到一种初始噪声向量,当输入到扩散过程中时,可以产生输入图像(称为反转),这是去噪扩散模型(DDMs)中的一个重要问题,对于真实图像编辑有应用。
动机:在真实图像的编辑和反转中,标准的方法是使用去噪扩散隐式模型(DDIMs)确定性地对图像进行噪声处理,使其达到中间状态,这是给定原始条件的情况下,去噪将遵循的路径。然而,这种方法对于真实图像的反转是不稳定的,因为它依赖于局部线性化假设,这会导致误差的传播,从而导致错误的图像重建和内容的丢失。
方法:我们提出了通过耦合变换的精确扩散反转(EDICT)方法,这是一种从仿射耦合层中获得灵感的反转方法。EDICT通过保持两个耦合的噪声向量来数学上精确地反转真实图像和模型生成的图像,这两个噪声向量以交替的方式相互反转。
效果:我们在最先进的潜在扩散模型Stable Diffusion上进行实验,证明EDICT能够高保真地重建真实图像。在像MS-COCO这样的复杂图像数据集上,EDICT的重建效果显著优于DDIM,将重建的均方误差提高了一倍。使用从真实图像反转的噪声向量,EDICT可以实现广泛的图像编辑——从局部和全局语义编辑到图像风格化——同时保持与原始图像结构的准确性。EDICT不需要模型训练/微调、提示调整或额外的数据,并且可以与任何预训练的DDM结合使用。

Finding an initial noise vector that produces an input image when fed into the diffusion process (known as inversion) is an important problem in denoising diffusion models (DDMs), with applications for real image editing. The standard approach for real image editing with inversion uses denoising diffusion implicit models (DDIMs) to deterministically noise the image to the intermediate state along the path that the denoising would follow given the original conditioning. However, DDIM inversion for real images is unstable as it relies on local linearization assumptions, which result in the propagation of errors, leading to incorrect image reconstruction and loss of content. To alleviate these problems, we propose Exact Diffusion Inversion via Coupled Transformations (EDICT), an inversion method that draws inspiration from affine coupling layers. EDICT enables mathematically exact inversion of real and model-generated images by maintaining two coupled noise vectors which are used to invert each other in an alternating fashion. Using Stable Diffusion [25], a state-of-the-art latent diffusion model, we demonstrate that EDICT successfully reconstructs real images with high fidelity. On complex image datasets like MS-COCO, EDICT reconstruction significantly outperforms DDIM, improving the mean square error of reconstruction by a factor of two. Using noise vectors inverted from real images, EDICT enables a wide range of image edits--from local and global semantic edits to image stylization--while maintaining fidelity to the original image structure. EDICT requires no model training/finetuning, prompt tuning, or extra data and can be combined with any pretrained DDM.

Image Super-Resolution Using T-Tetromino Pixels
Grosche, SimonandRegensky, AndyandSeiler, J\"urgenandKaup, Andr\'e



研究问题:如何提高现代高分辨率成像传感器在低光照条件下和需要高帧率时的性能?
动机:为了恢复原始的空间分辨率,可以在低光照条件和需要高帧率的情况下进行像素合并。为了在放大后获得更高的图像质量,我们提出了一种新的像素合并概念,即使用T型形状的像素。
方法:我们将这种新的像素合并概念嵌入到压缩传感领域,并计算其一致性以激发使用的传感器布局。然后,我们首次在文献中研究了使用T型像素进行重建的质量。
效果:我们使用局部全连接重建(LFCR)网络以及压缩传感领域的两种经典重建方法进行重建。与使用非常深的超分辨率(VDSR)网络进行的传统单图像超分辨率相比,使用提出的T型像素布局和LFCR网络可以实现更好的图像质量,例如在PSNR、SSIM方面,并且视觉上也有优势。对于PSNR,可以获得高达+1.92 dB的增益。

For modern high-resolution imaging sensors, pixel binning is performed in low-lighting conditions and in case high frame rates are required. To recover the original spatial resolution, single-image super-resolution techniques can be applied for upscaling. To achieve a higher image quality after upscaling, we propose a novel binning concept using tetromino-shaped pixels. It is embedded into the field of compressed sensing and the coherence is calculated to motivate the sensor layouts used. Next, we investigate the reconstruction quality using tetromino pixels for the first time in literature. Instead of using different types of tetrominoes as proposed elsewhere, we show that using a small repeating cell consisting of only four T-tetrominoes is sufficient. For reconstruction, we use a locally fully connected reconstruction (LFCR) network as well as two classical reconstruction methods from the field of compressed sensing. Using the LFCR network in combination with the proposed tetromino layout, we achieve superior image quality in terms of PSNR, SSIM, and visually compared to conventional single-image super-resolution using the very deep super-resolution (VDSR) network. For PSNR, a gain of up to +1.92 dB is achieved.

VIVE3D: Viewpoint-Independent Video Editing Using 3D-Aware GANs
Fr\"uhst\"uck, AnnaandSarafianos, NikolaosandXu, YuanluandWonka, PeterandTung, Tony



研究问题:如何将基于图像的3D GANs的能力扩展到视频编辑,并以身份保护和时间一致的方式表示输入视频。
动机:目前的3D GANs在视频编辑方面的能力有限,需要一种新方法来扩展其功能。
方法:提出了一种新的GAN逆技术,通过联合嵌入多帧并优化相机参数,专门针对3D GANs进行优化。同时,除了传统的语义面部编辑(如年龄和表情),还首次展示了由3D GANs的内在属性和光流引导的合成技术实现的新头部视图编辑。
效果:实验证明,VIVE3D可以从各种相机视角生成高保真度的面部编辑,并以时间和空间一致的方式与原始视频合成。

We introduce VIVE3D, a novel approach that extends the capabilities of image-based 3D GANs to video editing and is able to represent the input video in an identity-preserving and temporally consistent way. We propose two new building blocks. First, we introduce a novel GAN inversion technique specifically tailored to 3D GANs by jointly embedding multiple frames and optimizing for the camera parameters. Second, besides traditional semantic face edits (e.g. for age and expression), we are the first to demonstrate edits that show novel views of the head enabled by the inherent properties of 3D GANs and our optical flow-guided compositing technique to combine the head with the background video. Our experiments demonstrate that VIVE3D generates high-fidelity face edits at consistent quality from a range of camera viewpoints which are composited with the original video in a temporally and spatially-consistent manner.

StyleRF: Zero-Shot 3D Style Transfer of Neural Radiance Fields
Liu, KunhaoandZhan, FangnengandChen, YiwenandZhang, JiahuiandYu, YingchenandElSaddik, AbdulmotalebandLu, ShijianandXing, EricP.



研究问题:现有的3D风格转换技术在准确重建几何形状、高质量风格化和对任意新风格的泛化性上存在困境。
动机:提出一种创新的3D风格转换技术,通过在辐射场的特征空间中进行风格转换,解决这个三难困境。
方法:采用显式的高级特征网格来表示3D场景,通过体积渲染可以可靠地恢复高精度几何形状。同时,根据参考样式转换网格特征,直接实现高质量的零样本风格转换。
效果:实验表明,StyleRF在精确重建几何形状的同时实现了高质量的3D风格化,并能以零样本的方式推广到各种新风格。

3D style transfer aims to render stylized novel views of a 3D scene with multi-view consistency. However, most existing work suffers from a three-way dilemma over accurate geometry reconstruction, high-quality stylization, and being generalizable to arbitrary new styles. We propose StyleRF (Style Radiance Fields), an innovative 3D style transfer technique that resolves the three-way dilemma by performing style transformation within the feature space of a radiance field. StyleRF employs an explicit grid of high-level features to represent 3D scenes, with which high-fidelity geometry can be reliably restored via volume rendering. In addition, it transforms the grid features according to the reference style which directly leads to high-quality zero-shot style transfer. StyleRF consists of two innovative designs. The first is sampling-invariant content transformation that makes the transformation invariant to the holistic statistics of the sampled 3D points and accordingly ensures multi-view consistency. The second is deferred style transformation of 2D feature maps which is equivalent to the transformation of 3D points but greatly reduces memory footprint without degrading multi-view consistency. Extensive experiments show that StyleRF achieves superior 3D stylization quality with precise geometry reconstruction and it can generalize to various new styles in a zero-shot manner. Project website: https://kunhao-liu.github.io/StyleRF/

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing
Pang, YouxinandZhang, YongandQuan, WeizeandFan, YanboandCun, XiaodongandShan, YingandYan, Dong-Ming



研究问题:本文旨在解决面部运动中头部姿态和面部表情的耦合问题,特别是在视频肖像编辑中需要修改表情而保持姿态不变的场景。
动机:目前的面部运动转移方法由于头部姿态和面部表情的耦合,无法直接应用于视频肖像编辑。此外,缺乏配对数据(如相同姿态但不同表情)也增加了挑战。
方法:本文提出了一种新颖的无监督解耦框架,无需3DMMs和配对数据即可分离姿态和表情。该框架包括一个运动编辑模块、一个姿态生成器和一个表情生成器。编辑模块将面部投影到潜在空间中,在潜在空间中可以解耦姿态运动和表情运动,通过添加操作可以在潜在空间中方便地进行姿态或表情转移。两个生成器分别将修改后的潜在代码渲染为图像。
效果:实验结果表明,该方法可以独立控制姿态或表情,并可用于一般的视频编辑。

One-shot video-driven talking face generation aims at producing a synthetic talking video by transferring the facial motion from a video to an arbitrary portrait image. Head pose and facial expression are always entangled in facial motion and transferred simultaneously. However, the entanglement sets up a barrier for these methods to be used in video portrait editing directly, where it may require to modify the expression only while maintaining the pose unchanged. One challenge of decoupling pose and expression is the lack of paired data, such as the same pose but different expressions. Only a few methods attempt to tackle this challenge with the feat of 3D Morphable Models (3DMMs) for explicit disentanglement. But 3DMMs are not accurate enough to capture facial details due to the limited number of Blendshapes, which has side effects on motion transfer. In this paper, we introduce a novel self-supervised disentanglement framework to decouple pose and expression without 3DMMs and paired data, which consists of a motion editing module, a pose generator, and an expression generator. The editing module projects faces into a latent space where pose motion and expression motion can be disentangled, and the pose or expression transfer can be performed in the latent space conveniently via addition. The two generators render the modified latent codes to images, respectively. Moreover, to guarantee the disentanglement, we propose a bidirectional cyclic training strategy with well-designed constraints. Evaluations demonstrate our method can control pose or expression independently and be used for general video editing.

Implicit Diffusion Models for Continuous Super-Resolution
Gao, SichengandLiu, XuhuiandZeng, BohanandXu, ShengandLi, YanjingandLuo, XiaoyanandLiu, JianzhuangandZhen, XiantongandZhang, Baochang



研究问题:本文旨在解决当前图像超分辨率方法普遍存在的过度平滑和伪影问题,以及仅能处理固定放大倍数的问题。
动机:由于其广泛的应用,图像超分辨率(SR)引起了越来越多的关注。然而,现有的SR方法通常存在过度平滑和伪影的问题,且大多数只能处理固定倍数的放大。
方法:本文提出了一种隐式扩散模型(IDM)用于高保真连续图像超分辨率。IDM在一个统一的端到端框架中集成了一个隐式的神经表示和一个去噪扩散模型,其中在解码过程中采用隐式的神经表示来学习连续分辨率表示。此外,我们还设计了一个可控制尺度的调节机制,该机制由一个低分辨率(LR)条件网络和一个缩放因子组成。缩放因子调整分辨率,并相应地调节LR信息和生成的特征在最终输出中的比例,使模型能够满足连续分辨率的需求。
效果:大量的实验验证了我们IDM的有效性,并证明其在现有技术中具有优越的性能。

Image super-resolution (SR) has attracted increasing attention due to its wide applications. However, current SR methods generally suffer from over-smoothing and artifacts, and most work only with fixed magnifications. This paper introduces an Implicit Diffusion Model (IDM) for high-fidelity continuous image super-resolution. IDM integrates an implicit neural representation and a denoising diffusion model in a unified end-to-end framework, where the implicit neural representation is adopted in the decoding process to learn continuous-resolution representation. Furthermore, we design a scale-controllable conditioning mechanism that consists of a low-resolution (LR) conditioning network and a scaling factor. The scaling factor regulates the resolution and accordingly modulates the proportion of the LR information and generated features in the final output, which enables the model to accommodate the continuous-resolution requirement. Extensive experiments validate the effectiveness of our IDM and demonstrate its superior performance over prior arts.

VGFlow: Visibility Guided Flow Network for Human Reposing
Jain, RishabhandSingh, KrishnaKumarandHemani, MayurandLu, JingwanandSarkar, MausoomandCeylan, DuyguandKrishnamurthy, Balaji



研究问题:生成具有任意可想象姿势的人体模型的真实图像存在多个困难,包括保留纹理、保持图案连贯性、尊重衣物边界、处理遮挡和操作皮肤生成等。
动机:现有的方法在保持纹理、维护图案连贯性、尊重衣物边界、处理遮挡和操作皮肤生成等方面存在限制,而且人体姿势的可能性空间大且多变,衣物的性质高度非刚性,人群的身体形状差异大。
方法:我们提出了VGFlow模型,该模型使用可见性引导流模块将流分离为目标的可见部分和不可见部分,以实现纹理保留和风格操纵的同时进行。此外,为了解决不同的身体形状并避免网络伪影,我们还引入了一种自我监督的逐片“真实感”损失来进一步提高输出质量。
效果:VGFlow模型在SSIM、LPIPS、FID等不同的图像质量指标上取得了最先进的结果,无论是定性还是定量观察。

The task of human reposing involves generating a realistic image of a model standing in an arbitrary conceivable pose. There are multiple difficulties in generating perceptually accurate images and existing methods suffers from limitations in preserving texture, maintaining pattern coherence, respecting cloth boundaries, handling occlusions, manipulating skin generation etc. These difficulties are further exacerbated by the fact that the possible space of pose orientation for humans is large and variable, the nature of clothing items are highly non-rigid and the diversity in body shape differ largely among the population. To alleviate these difficulties and synthesize perceptually accurate images, we propose VGFlow, a model which uses a visibility guided flow module to disentangle the flow into visible and invisible parts of the target for simultaneous texture preservation and style manipulation. Furthermore, to tackle distinct body shapes and avoid network artifacts, we also incorporate an a self-supervised patch-wise "realness" loss to further improve the output. VGFlow achieves state-of-the-art results as observed qualitatively and quantitatively on different image quality metrics(SSIM, LPIPS, FID).

CoralStyleCLIP: Co-Optimized Region and Layer Selection for Image Editing
Revanur, AmbareeshandBasu, DebrajandAgrawal, ShradhaandAgarwal, DhwanitandPai, Deepak



研究问题:在开放世界的可控生成图像编辑中,编辑保真度是一个重要问题。
动机:最近,基于CLIP的方法通过在StyleGAN的手动选择层引入空间注意力来缓解这些问题,但牺牲了简单性。
方法:本文提出了CoralStyleCLIP,该方法在StyleGAN2的特征空间中引入了多层注意力引导的混合策略,以获得高保真的编辑效果。
效果:我们的方法在保持使用简便的同时,实现了高质量的编辑效果。实验结果表明,CoralStyleCLIP在各种架构复杂性下都能保持较低的时间复杂度,同时保持了高质量的编辑效果。

Edit fidelity is a significant issue in open-world controllable generative image editing. Recently, CLIP-based approaches have traded off simplicity to alleviate these problems by introducing spatial attention in a handpicked layer of a StyleGAN. In this paper, we propose CoralStyleCLIP, which incorporates a multi-layer attention-guided blending strategy in the feature space of StyleGAN2 for obtaining high-fidelity edits. We propose multiple forms of our co-optimized region and layer selection strategy to demonstrate the variation of time complexity with the quality of edits over different architectural intricacies while preserving simplicity. We conduct extensive experimental analysis and benchmark our method against state-of-the-art CLIP-based methods. Our findings suggest that CoralStyleCLIP results in high-quality edits while preserving the ease of use.

GLeaD: Improving GANs With a Generator-Leading Task
Bai, QingyanandYang, CeyuanandXu, YinghaoandLiu, XihuiandYang, YujiuandShen, Yujun



研究问题:如何使生成对抗网络(GAN)的训练过程更公平,以提高模型性能。
动机:目前的GAN训练过程中,判别器(D)往往占据主导地位,导致训练结果可能不理想。
方法:提出一种新的对抗训练范式,让生成器(G)为判别器(D)分配任务,即从图像中提取出代表性特征,并被G解码以重建输入图像。
效果:在多个数据集上的实验结果表明,这种方法比基线方法有显著优势,例如,将StyleGAN2在LSUN Bedroom和LSUN Church上的FID得分分别从4.30降低到2.55和从4.04降低到2.82。

Generative adversarial network (GAN) is formulated as a two-player game between a generator (G) and a discriminator (D), where D is asked to differentiate whether an image comes from real data or is produced by G. Under such a formulation, D plays as the rule maker and hence tends to dominate the competition. Towards a fairer game in GANs, we propose a new paradigm for adversarial training, which makes G assign a task to D as well. Specifically, given an image, we expect D to extract representative features that can be adequately decoded by G to reconstruct the input. That way, instead of learning freely, D is urged to align with the view of G for domain classification. Experimental results on various datasets demonstrate the substantial superiority of our approach over the baselines. For instance, we improve the FID of StyleGAN2 from 4.30 to 2.55 on LSUN Bedroom and from 4.04 to 2.82 on LSUN Church. We believe that the pioneering attempt present in this work could inspire the community with better designed generator-leading tasks for GAN improvement. Project page is at https://ezioby.github.io/glead/.

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis
Tao, MingandBao, Bing-KunandTang, HaoandXu, Changsheng



研究问题:如何从文本中合成高保真复杂图像?
动机:现有的大型预训练自动回归和扩散模型虽然在图像合成方面取得了显著进展,但仍存在需要大量训练数据和参数、多步骤生成设计导致速度慢以及难以控制合成视觉特征的问题。
方法:提出一种名为GALIP的生成对抗性CLIP模型,利用强大的预训练CLIP模型作为判别器和生成器。具体包括一个基于CLIP的判别器和一个由CLIP通过桥接特征和提示引发视觉概念的生成器。
效果:GALIP仅需要约3%的训练数据和6%的可学习参数,就能达到与大型预训练自动回归和扩散模型相当的结果。此外,该模型的合成速度提高了120倍,并继承了GAN的平滑潜在空间。实验结果证明了GALIP的优秀性能。

Synthesizing high-fidelity complex images from text is challenging. Based on large pretraining, the autoregressive and diffusion models can synthesize photo-realistic images. Although these large models have shown notable progress, there remain three flaws. 1) These models require tremendous training data and parameters to achieve good performance. 2) The multi-step generation design slows the image synthesis process heavily. 3) The synthesized visual features are challenging to control and require delicately designed prompts. To enable high-quality, efficient, fast, and controllable text-to-image synthesis, we propose Generative Adversarial CLIPs, namely GALIP. GALIP leverages the powerful pretrained CLIP model both in the discriminator and generator. Specifically, we propose a CLIP-based discriminator. The complex scene understanding ability of CLIP enables the discriminator to accurately assess the image quality. Furthermore, we propose a CLIP-empowered generator that induces the visual concepts from CLIP through bridge features and prompts. The CLIP-integrated generator and discriminator boost training efficiency, and as a result, our model only requires about 3% training data and 6% learnable parameters, achieving comparable results to large pretrained autoregressive and diffusion models. Moreover, our model achieves 120 times faster synthesis speed and inherits the smooth latent space from GAN. The extensive experimental results demonstrate the excellent performance of our GALIP. Code is available at https://github.com/tobran/GALIP.

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars
Abdal, RameenandLee, Hsin-YingandZhu, PeihaoandChai, MengleiandSiarohin, AliaksandrandWonka, PeterandTulyakov, Sergey



研究问题:我们能否在艺术数据上训练一个3D-GAN,同时保持多视图一致性和纹理质量?
动机:现有的3D-GAN模型主要在具有一致结构的大型数据集上进行训练,而在风格化、艺术性的数据上进行训练,尤其是那些几何形状未知且高度变化的艺术性数据,以及相机信息,尚未被证明是可能的。
方法:我们提出了一个适应框架,其中源领域是一个预训练的3D-GAN,而目标领域是一个在艺术数据集上训练的2D-GAN。然后,我们将2D生成器的知识提炼到源3D生成器中。为此,我们首先提出了一种优化方法来对齐不同领域的相机参数分布。其次,我们提出了必要的正则化方法来学习高质量的纹理,同时避免退化的几何解决方案,如平面形状。第三,我们展示了一种基于变形的技术,用于模拟艺术领域的夸张几何形状,从而实现个性化的几何编辑。最后,我们提出了一种新的3D-GAN逆映射方法,将源领域和目标领域的隐空间联系起来。
效果:我们的贡献首次实现了在艺术数据集上生成、编辑和动画个性化艺术3D头像。

Modern 3D-GANs synthesize geometry and texture by training on large-scale datasets with a consistent structure. Training such models on stylized, artistic data, with often unknown, highly variable geometry, and camera information has not yet been shown possible. Can we train a 3D GAN on such artistic data, while maintaining multi-view consistency and texture quality? To this end, we propose an adaptation framework, where the source domain is a pre-trained 3D-GAN, while the target domain is a 2D-GAN trained on artistic datasets. We, then, distill the knowledge from a 2D generator to the source 3D generator. To do that, we first propose an optimization-based method to align the distributions of camera parameters across domains. Second, we propose regularizations necessary to learn high-quality texture, while avoiding degenerate geometric solutions, such as flat shapes. Third, we show a deformation-based technique for modeling exaggerated geometry of artistic domains, enabling---as a byproduct---personalized geometric editing. Finally, we propose a novel inversion method for 3D-GANs linking the latent spaces of the source and the target domains. Our contributions---for the first time---allow for the generation, editing, and animation of personalized artistic 3D avatars on artistic datasets.

Person Image Synthesis via Denoising Diffusion Model
Bhunia, AnkanKumarandKhan, SalmanandCholakkal, HishamandAnwer, RaoMuhammadandLaaksonen, JormaandShah, MubarakandKhan, FahadShahbaz



研究问题:如何生成任意姿态的人像图像?
动机:现有的方法在处理复杂变形和严重遮挡时,无法保持真实的纹理或需要密集的对应关系。
方法:利用去噪扩散模型进行高保真度的人像图像合成,通过一系列简单的前向-后向去噪步骤分解复杂的转换问题。
效果:在两个大规模基准测试和一个用户研究中,该方法在具有挑战性的场景下表现出了高度的真实性,并可以用于下游任务。

The pose-guided person image generation task requires synthesizing photorealistic images of humans in arbitrary poses. The existing approaches use generative adversarial networks that do not necessarily maintain realistic textures or need dense correspondences that struggle to handle complex deformations and severe occlusions. In this work, we show how denoising diffusion models can be applied for high-fidelity person image synthesis with strong sample diversity and enhanced mode coverage of the learnt data distribution. Our proposed Person Image Diffusion Model (PIDM) disintegrates the complex transfer problem into a series of simpler forward-backward denoising steps. This helps in learning plausible source-to-target transformation trajectories that result in faithful textures and undistorted appearance details. We introduce a 'texture diffusion module' based on cross-attention to accurately model the correspondences between appearance and pose information available in source and target images. Further, we propose 'disentangled classifier-free guidance' to ensure close resemblance between the conditional inputs and the synthesized output in terms of both pose and appearance information. Our extensive results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios. We also show how our generated images can help in downstream tasks.

Implicit Neural Head Synthesis via Controllable Local Deformation Fields
Chen, ChuhanandO{\textquoteright



研究问题:如何从2D视频中高质量重建可控的3D头部化身,以实现电影、游戏和远程会议等虚拟人应用。
动机:现有的方法无法精细地控制面部部分,或者从单目视频推断非对称表情,而且大多数方法只依赖于3DMM参数,缺乏局部性。
方法:我们基于部分基于隐式形状模型,将全局形变场分解为局部形变场。我们的新方法通过基于3DMM的参数和代表性面部地标,用局部语义装配式控制来建模多个隐式形变场。此外,我们还提出了局部控制损失和注意力掩码机制,以促进每个学习到的形变场的稀疏性。
效果:我们的方法比之前的隐式单目方法渲染出更锐利的局部可控非线性形变,特别是在口腔内部、非对称表情和面部细节方面。

High-quality reconstruction of controllable 3D head avatars from 2D videos is highly desirable for virtual human applications in movies, games, and telepresence. Neural implicit fields provide a powerful representation to model 3D head avatars with personalized shape, expressions, and facial parts, e.g., hair and mouth interior, that go beyond the linear 3D morphable model (3DMM). However, existing methods do not model faces with fine-scale facial features, or local control of facial parts that extrapolate asymmetric expressions from monocular videos. Further, most condition only on 3DMM parameters with poor(er) locality, and resolve local features with a global neural field. We build on part-based implicit shape models that decompose a global deformation field into local ones. Our novel formulation models multiple implicit deformation fields with local semantic rig-like control via 3DMM-based parameters, and representative facial landmarks. Further, we propose a local control loss and attention mask mechanism that promote sparsity of each learned deformation field. Our formulation renders sharper locally controllable nonlinear deformations than previous implicit monocular approaches, especially mouth interior, asymmetric expressions, and facial details. Project page:https://imaging.cs.cmu.edu/local_deformation_fields/

GANHead: Towards Generative Animatable Neural Head Avatars
Wu, SijingandYan, YichaoandLi, YunhaoandCheng, YuhaoandZhu, WenhanandGao, KeandLi, XiaoboandZhai, Guangtao



研究问题:如何有效地生成完整、真实且可动画的头部虚拟形象。
动机:现有的方法难以同时满足生成完整、真实和可动画的头部虚拟形象的要求。
方法:提出GANHead,一种利用显式表达参数的精细控制和隐式表示的真实渲染结果的新型生成头部模型。具体来说,GANHead通过三个网络在标准空间中表示粗糙的几何形状、细致的细节和纹理,以获得生成完整和真实的头部虚拟形象的能力。为了实现灵活的动画,我们定义了由标准的线性混合蒙皮(LBS)确定的变形字段,结合学习到的连续姿态和表情基础以及LBS权重。这使得虚拟形象可以直接通过FLAME参数进行动画化,并能很好地推广到未见过的姿态和表情。
效果:与最先进的方法相比,GANHead在头部虚拟形象生成和原始扫描拟合方面表现出优越的性能。

To bring digital avatars into people's lives, it is highly demanded to efficiently generate complete, realistic, and animatable head avatars. This task is challenging, and it is difficult for existing methods to satisfy all the requirements at once. To achieve these goals, we propose GANHead (Generative Animatable Neural Head Avatar), a novel generative head model that takes advantages of both the fine-grained control over the explicit expression parameters and the realistic rendering results of implicit representations. Specifically, GANHead represents coarse geometry, fine-gained details and texture via three networks in canonical space to obtain the ability to generate complete and realistic head avatars. To achieve flexible animation, we define the deformation filed by standard linear blend skinning (LBS), with the learned continuous pose and expression bases and LBS weights. This allows the avatars to be directly animated by FLAME parameters and generalize well to unseen poses and expressions. Compared to state-of-the-art (SOTA) methods, GANHead achieves superior performance on head avatar generation and raw scan fitting.

NeuralField-LDM: Scene Generation With Hierarchical Latent Diffusion Models
Kim, SeungWookandBrown, BradleyandYin, KangxueandKreis, KarstenandSchwarz, KatjaandLi, DaiqingandRombach, RobinandTorralba, AntonioandFidler, Sanja



研究问题:如何自动生成高质量的真实世界3D场景,以应用于虚拟现实和机器人仿真等领域。
动机:现有的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Automatically generating high-quality real world 3D scenes is of enormous interest for applications such as virtual reality and robotics simulation. Towards this goal, we introduce NeuralField-LDM, a generative model capable of synthesizing complex 3D environments. We leverage Latent Diffusion Models that have been successfully utilized for efficient high-quality 2D content creation. We first train a scene auto-encoder to express a set of image and pose pairs as a neural field, represented as density and feature voxel grids that can be projected to produce novel views of the scene. To further compress this representation, we train a latent-autoencoder that maps the voxel grids to a set of latent representations. A hierarchical diffusion model is then fit to the latents to complete the scene generation pipeline. We achieve a substantial improvement over existing state-of-the-art scene generation models. Additionally, we show how NeuralField-LDM can be used for a variety of 3D content creation applications, including conditional scene generation, scene inpainting and scene style manipulation.

NUWA-LIP: Language-Guided Image Inpainting With Defect-Free VQGAN
Ni, MinhengandLi, XiaomingandZuo, Wangmeng



研究问题:如何利用文本指导进行图像修复,同时保持非缺陷区域不变。
动机:直接编码缺陷图像会对非缺陷区域产生负面影响,导致非缺陷部分的变形结构。
方法:提出NUWA-LIP模型,包括无损VQGAN(DF-VQGAN)和多视角序列到序列模块(MP-S2S)。DF-VQGAN引入相对估计以谨慎控制感受野扩散,并使用对称连接保护结构细节不变。MP-S2S通过聚合来自低级别像素、高级别令牌和文本描述的互补视角,将文本指导和谐地嵌入局部缺陷区域。
效果:实验表明,我们的DF-VQGAN有效地辅助了修复过程,同时避免了非缺陷区域的意外变化。在三个开放领域的基准测试中,我们的方法优于现有技术。

Language-guided image inpainting aims to fill the defective regions of an image under the guidance of text while keeping the non-defective regions unchanged. However, directly encoding the defective images is prone to have an adverse effect on the non-defective regions, giving rise to distorted structures on non-defective parts. To better adapt the text guidance to the inpainting task, this paper proposes NUWA-LIP, which involves defect-free VQGAN (DF-VQGAN) and a multi-perspective sequence-to-sequence module (MP-S2S). To be specific, DF-VQGAN introduces relative estimation to carefully control the receptive spreading, as well as symmetrical connections to protect structure details unchanged. For harmoniously embedding text guidance into the locally defective regions, MP-S2S is employed by aggregating the complementary perspectives from low-level pixels, high-level tokens as well as the text description. Experiments show that our DF-VQGAN effectively aids the inpainting process while avoiding unexpected changes in non-defective regions. Results on three open-domain benchmarks demonstrate the superior performance of our method against state-of-the-arts. Our code, datasets, and model will be made publicly available.

MARLIN: Masked Autoencoder for Facial Video Representation LearnINg
Cai, ZhixiandGhosh, ShreyaandStefanov, KalinandDhall, AbhinavandCai, JianfeiandRezatofighi, HamidandHaffari, RezaandHayat, Munawar



研究问题:本文旨在提出一种自我监督的方法,从视频中学习通用的面部表示,可以跨各种面部分析任务进行转移。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:提出了一个名为MARLIN的面部视频掩码自动编码器框架,通过大量非注释的网络爬取面部视频学习高度稳健和通用的面部嵌入。
效果:实验结果表明,MARLIN在各种下游任务上表现出色,包括面部属性识别(FAR)、面部表情识别(FER)、深度伪造检测(DFD)和唇同步(LS),并在低数据量的情况下也有良好的表现。

This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our code and models are available at https://github.com/ControlNet/MARLIN.

3D-Aware Face Swapping
Li, YixuanandMa, ChaoandYan, YichaoandZhu, WenhanandYang, Xiaokang



研究问题:本文旨在解决现有人脸交换方法在面对大姿态变化时,会在交换后的人脸上产生不期望的人工痕迹的问题。
动机:由于现有的人脸交换方法直接学习交换二维人脸图像,没有考虑到人脸的几何信息,因此在源和目标脸部之间存在大的姿态变化时,交换后的人脸上总会存在不期望的人工痕迹。
方法:本文提出了一种新的3D感知人脸交换方法,该方法从单视图的源和目标图像生成高保真度和多视图一致的交换人脸。为了实现这一目标,我们利用了3D人脸的强大几何和纹理先验,将2D人脸投影到3D生成模型的潜在空间中。通过在潜在空间中解耦身份和属性特征,我们成功地以3D感知的方式交换了人脸,对姿态变化具有鲁棒性,同时传递了精细的面部细节。
效果:大量的实验表明,我们的3D感知人脸交换框架在视觉质量、身份相似性和多视图一致性方面都具有优越性。

Face swapping is an important research topic in computer vision with wide applications in entertainment and privacy protection. Existing methods directly learn to swap 2D facial images, taking no account of the geometric information of human faces. In the presence of large pose variance between the source and the target faces, there always exist undesirable artifacts on the swapped face. In this paper, we present a novel 3D-aware face swapping method that generates high-fidelity and multi-view-consistent swapped faces from single-view source and target images. To achieve this, we take advantage of the strong geometry and texture prior of 3D human faces, where the 2D faces are projected into the latent space of a 3D generative model. By disentangling the identity and attribute features in the latent space, we succeed in swapping faces in a 3D-aware manner, being robust to pose variations while transferring fine-grained facial details. Extensive experiments demonstrate the superiority of our 3D-aware face swapping framework in terms of visual quality, identity similarity, and multi-view consistency. Code is available at https://lyx0208.github.io/3dSwap.

RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion
Wang, TengfeiandZhang, BoandZhang, TingandGu, ShuyangandBao, JianminandBaltrusaitis, TadasandShen, JingjingandChen, DongandWen, FangandChen, QifengandGuo, Baining



研究问题:如何有效地生成高质量的3D数字头像。
动机:现有的3D扩散模型由于内存和处理成本的限制,难以生成具有丰富细节的高质量结果。
方法:提出一种卷出扩散网络(RODIN)模型,将3D NeRF模型表示为多个2D特征图,并将其展开到一个单一的2D特征平面上进行3D感知扩散。
效果:RODIN模型在保持3D扩散完整性的同时,通过使用3D感知卷积来关注2D平面中投影特征的原始3D关系,大大提高了计算效率。此外,我们还利用潜在条件来协调特征生成,使生成的头像具有高度的真实性,并能够根据文本提示进行语义编辑。最后,我们使用分层合成来进一步增强细节。

This paper presents a 3D diffusion model that automatically generates 3D digital avatars represented as neural radiance fields (NeRFs). A significant challenge for 3D diffusion is that the memory and processing costs are prohibitive for producing high-quality results with rich details. To tackle this problem, we propose the roll-out diffusion network (RODIN), which takes a 3D NeRF model represented as multiple 2D feature maps and rolls out them onto a single 2D feature plane within which we perform 3D-aware diffusion. The RODIN model brings much-needed computational efficiency while preserving the integrity of 3D diffusion by using 3D-aware convolution that attends to projected features in the 2D plane according to their original relationships in 3D. We also use latent conditioning to orchestrate the feature generation with global coherence, leading to high-fidelity avatars and enabling semantic editing based on text prompts. Finally, we use hierarchical synthesis to further enhance details. The 3D avatars generated by our model compare favorably with those produced by existing techniques. We can generate highly detailed avatars with realistic hairstyles and facial hair. We also demonstrate 3D avatar generation from image or text, as well as text-guided editability.

High-Fidelity Guided Image Synthesis With Latent Diffusion Models
Singh, JaskiratandGould, StephenandZheng, Liang



研究问题:现有的基于用户涂鸦的可控图像合成方法存在内在领域偏移问题,生成的输出结果常常缺乏细节,类似于目标领域的简化表示。
动机:为了解决这个问题,本文提出了一种新的引导式图像合成框架,通过将输出图像建模为约束优化问题的解来解决问题。
方法:虽然计算优化问题的精确解是不可行的,但只需要反向扩散过程的一次传递就可以实现相同解的近似。此外,通过定义输入文本令牌和用户描边绘画之间的交叉注意力对应关系,用户可以在不需要任何条件训练或微调的情况下控制不同绘画区域的语义。
效果:人类用户研究结果表明,该方法在整体用户满意度得分上超过先前最先进的方法85.32%。

Controllable image synthesis with user scribbles has gained huge public interest with the recent advent of text-conditioned latent diffusion models. The user scribbles control the color composition while the text prompt provides control over the overall image semantics. However, we find that prior works suffer from an intrinsic domain shift problem wherein the generated outputs often lack details and resemble simplistic representations of the target domain. In this paper, we propose a novel guided image synthesis framework, which addresses this problem by modeling the output image as the solution of a constrained optimization problem. We show that while computing an exact solution to the optimization is infeasible, an approximation of the same can be achieved while just requiring a single pass of the reverse diffusion process. Additionally, we show that by simply defining a cross-attention based correspondence between the input text tokens and the user stroke-painting, the user is also able to control the semantics of different painted regions without requiring any conditional training or finetuning. Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores. Project page for our paper is available at https://1jsingh.github.io/gradop.

CodeTalker: Speech-Driven 3D Facial Animation With Discrete Motion Prior
Xing, JinboandXia, MenghanandZhang, YuechenandCun, XiaodongandWang, JueandWong, Tien-Tsin



研究问题:如何通过语音驱动的三维面部动画实现更真实、生动的效果。
动机:由于音频-视觉数据的稀缺性和高度病态性质,现有的工作在实现真实和生动的面部动画方面仍存在差距。
方法:将语音驱动的面部动画转化为学习代码簿有限代理空间中的代码查询任务,通过减少跨模态映射的不确定性来有效提升生成动画的生动性。
效果:实验结果表明,该方法在定性和定量上都优于当前最先进的方法,用户研究进一步证明了其在感知质量上的优越性。

Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness due to the highly ill-posed nature and scarcity of audio-visual data. Existing works typically formulate the cross-modal mapping into a regression task, which suffers from the regression-to-mean problem leading to over-smoothed facial motions. In this paper, we propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook, which effectively promotes the vividness of the generated motions by reducing the cross-modal mapping uncertainty. The codebook is learned by self-reconstruction over real facial motions and thus embedded with realistic facial motion priors. Over the discrete motion space, a temporal autoregressive model is employed to sequentially synthesize facial motions from the input speech signal, which guarantees lip-sync as well as plausible facial expressions. We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. Also, a user study further justifies our superiority in perceptual quality.

Semi-Supervised Parametric Real-World Image Harmonization
Wang, KeandGharbi, Micha\"elandZhang, HeandXia, ZhihaoandShechtman, Eli



研究问题:现有的基于学习的图片和谐化技术通常只能处理单一的全局变换,无法很好地处理真实图像中的复杂局部变化。
动机:为了解决现有模型在真实图像中无法很好地进行复杂局部变化的处理问题,我们提出了一种新的半监督训练策略。
方法:我们的模型是完全参数化的,使用RGB曲线来修正全局颜色和色调,并使用阴影图来模拟局部变化。
效果:实验结果表明,我们的方法在已建立的基准测试和真实图像上都优于先前的工作,并且在用户研究中表现出色,能够交互式地处理高分辨率图像。

Learning-based image harmonization techniques are usually trained to undo synthetic global transformations, applied to a masked foreground in a single ground truth photo. This simulated data does not model many important appearance mismatches (illumination, object boundaries, etc.) between foreground and background in real composites, leading to models that do not generalize well and cannot model complex local changes. We propose a new semi-supervised training strategy that addresses this problem and lets us learn complex local appearance harmonization from unpaired real composites, where foreground and background come from different images. Our model is fully parametric. It uses RGB curves to correct the global colors and tone and a shading map to model local variations. Our approach outperforms previous work on established benchmarks and real composites, as shown in a user study, and processes high-resolution images interactively. The code and project page is available at https://kewang0622.github.io/sprih/.

VecFontSDF: Learning To Reconstruct and Synthesize High-Quality Vector Fonts via Signed Distance Functions
Xia, ZeqingandXiong, BojunandLian, Zhouhui



研究问题:本文旨在开发一种自动合成矢量字体的算法,以显著简化字体设计过程。
动机:现有的方法主要关注光栅图像生成,只有少数方法可以直接合成矢量字体。
方法:本文提出了一种可端到端训练的方法VecFontSDF,使用有符号距离函数(SDFs)重建和合成高质量的矢量字体。具体来说,基于提出的基于SDF的隐式形状表示,VecFontSDF学习将每个字形建模为由几个抛物线围成的几何形状原语,这些抛物线可以精确转换为广泛使用的二次贝塞尔曲线,这是矢量字体产品中常用的。
效果:在公开可用的数据集上进行的定性和定量实验表明,我们的方法在几个任务上获得了高质量的结果,包括矢量字体重建、插值和少样本矢量字体合成,明显优于最先进的技术。

Font design is of vital importance in the digital content design and modern printing industry. Developing algorithms capable of automatically synthesizing vector fonts can significantly facilitate the font design process. However, existing methods mainly concentrate on raster image generation, and only a few approaches can directly synthesize vector fonts. This paper proposes an end-to-end trainable method, VecFontSDF, to reconstruct and synthesize high-quality vector fonts using signed distance functions (SDFs). Specifically, based on the proposed SDF-based implicit shape representation, VecFontSDF learns to model each glyph as shape primitives enclosed by several parabolic curves, which can be precisely converted to quadratic Bezier curves that are widely used in vector font products. In this manner, most image generation methods can be easily extended to synthesize vector fonts. Qualitative and quantitative experiments conducted on a publicly-available dataset demonstrate that our method obtains high-quality results on several tasks, including vector font reconstruction, interpolation, and few-shot vector font synthesis, markedly outperforming the state of the art.

Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation
Huang, MengqiandMao, ZhendongandWang, QuanandZhang, Yongdong



研究问题:现有的自动回归模型在图像重建和生成中存在冗余,限制了模型的结构和效率。
动机:为了解决现有模型的问题,提高图像生成的效率和质量。
方法:提出了一种新颖的两阶段框架,包括Masked Quantization VAE(MQ-VAE)和Stackformer。MQ-VAE通过引入自适应掩码模块来消除冗余区域特征,然后通过自适应解掩码模块恢复原始网格图像特征图以忠实地重建量化后的原始图像。Stackformer则学习预测下一个代码及其在特征图中的位置。
效果:实验证明,该方法在各种图像生成任务上具有高效性和有效性。

Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook. However, existing codebook learning simply models all local region information of images without distinguishing their different perceptual importance, which brings redundancy in the learned codebook that not only limits the next stage's autoregressive model's ability to model important structure but also results in high training cost and slow generation speed. In this study, we borrow the idea of importance perception from classical image coding theory and propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) and Stackformer, to relieve the model from modeling redundancy. Specifically, MQ-VAE incorporates an adaptive mask module for masking redundant region features before quantization and an adaptive de-mask module for recovering the original grid image feature map to faithfully reconstruct the original images after quantization. Then, Stackformer learns to predict the combination of the next code and its position in the feature map. Comprehensive experiments on various image generation validate our effectiveness and efficiency.

Identity-Preserving Talking Face Generation With Landmark and Appearance Priors
Zhong, WeizhiandFang, ChaoweiandCai, YinqiandWei, PengxuandZhao, GangmingandLin, LiangandLi, Guanbin



研究问题:如何从音频生成逼真、唇形同步且保持身份特征的说话人视频。
动机:现有的针对特定人的方法和通用方法在生成真实和唇形同步的视频时存在困难,或需要目标说话人的录像进行训练或微调。
方法:提出了一个两阶段框架,包括音频到地标生成和地标到视频渲染过程。首先,设计了一个基于Transformer的地标生成器,从音频中推断嘴唇和下颌的地标。然后,构建了一个视频渲染模型,将生成的地标转换为面部图像。在此过程中,提取了下半部分遮挡的目标面部和静态参考图像的外观信息,以生成真实和保留身份特征的视觉内容。
效果:大量实验证明,该方法比现有的通用说话人脸生成方法能产生更真实、唇形同步且保持身份特征的视频。

Generating talking face videos from audio attracts lots of research interest. A few person-specific methods can generate vivid videos but require the target speaker's videos for training or fine-tuning. Existing person-generic methods have difficulty in generating realistic and lip-synced videos while preserving identity information. To tackle this problem, we propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures. First, we devise a novel Transformer-based landmark generator to infer lip and jaw landmarks from the audio. Prior landmark characteristics of the speaker's face are employed to make the generated landmarks coincide with the facial outline of the speaker. Then, a video rendering model is built to translate the generated landmarks into face images. During this stage, prior appearance information is extracted from the lower-half occluded target face and static reference images, which helps generate realistic and identity-preserving visual content. For effectively exploring the prior information of static reference images, we align static reference images with the target face's pose and expression based on motion fields. Moreover, auditory features are reused to guarantee that the generated face images are well synchronized with the audio. Extensive experiments demonstrate that our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.

MetaPortrait: Identity-Preserving Talking Head Generation With Fast Personalized Adaptation
Zhang, BowenandQi, ChenyangandZhang, PanandZhang, BoandWu, HsiangTaoandChen, DongandChen, QifengandWang, YongandWen, Fang



研究问题:提出一种ID保留的头像生成框架,以改进先前的方法。
动机:与稀疏流插值不同,我们认为密集地标对于实现准确的几何感知流场至关重要。同时,受到人脸交换方法的启发,我们在合成过程中自适应地融合源身份,使网络更好地保留图像肖像的关键特征。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

In this work, we propose an ID-preserving talking head generation framework, which advances previous methods in two aspects. First, as opposed to interpolating from sparse flow, we claim that dense landmarks are crucial to achieving accurate geometry-aware flow fields. Second, inspired by face-swapping methods, we adaptively fuse the source identity during synthesis, so that the network better preserves the key characteristics of the image portrait. Although the proposed model surpasses prior generation fidelity on established benchmarks, personalized fine-tuning is still needed to further make the talking head generation qualified for real usage. However, this process is rather computationally demanding that is unaffordable to standard users. To alleviate this, we propose a fast adaptation model using a meta-learning approach. The learned model can be adapted to a high-quality personalized model as fast as 30 seconds. Last but not least, a spatial-temporal enhancement module is proposed to improve the fine details while ensuring temporal coherency. Extensive experiments prove the significant superiority of our approach over the state of the arts in both one-shot and personalized settings.

CelebV-Text: A Large-Scale Facial Text-Video Dataset
Yu, JianhuiandZhu, HaoandJiang, LimingandLoy, ChenChangeandCai, WeidongandWu, Wayne



研究问题:本文旨在解决人脸文本驱动的视频生成任务,由于缺乏包含高质量视频和高度相关文本的合适数据集,这一任务仍然具有挑战性。
动机:目前,以文本驱动的视频生成模型在视频生成和编辑方面得到了蓬勃发展,但以人脸为中心的文本到视频生成仍然是一个挑战。
方法:本文提出了CelebV-Text,这是一个大型、多样化、高质量的面部文本视频对数据集,用于促进面部文本到视频生成任务的研究。CelebV-Text包含70,000个野外拍摄的人脸视频片段,每个片段都有20个通过提出的半自动文本生成策略生成的相关文本。
效果:通过对视频、文本和文本-视频相关性进行综合统计分析,证明了CelebV-Text优于其他数据集。通过大量的自我评估,进一步展示了CelebV-Text的有效性和潜力。通过构建代表性方法的基准测试,标准化了面部文本到视频生成任务的评估。所有数据和模型都是公开的。

Text-driven generation models are flourishing in video generation and editing. However, face-centric text-to-video generation remains a challenge due to the lack of a suitable dataset containing high-quality videos and highly relevant texts. This paper presents CelebV-Text, a large-scale, diverse, and high-quality dataset of facial text-video pairs, to facilitate research on facial text-to-video generation tasks. CelebV-Text comprises 70,000 in-the-wild face video clips with diverse visual content, each paired with 20 texts generated using the proposed semi-automatic text generation strategy. The provided texts are of high quality, describing both static and dynamic attributes precisely. The superiority of CelebV-Text over other datasets is demonstrated via comprehensive statistical analysis of the videos, texts, and text-video relevance. The effectiveness and potential of CelebV-Text are further shown through extensive self-evaluation. A benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task. All data and models are publicly available.

Diffusion-SDF: Text-To-Shape via Voxelized Diffusion
Li, MuhengandDuan, YueqiandZhou, JieandLu, Jiwen



研究问题:如何根据指定的条件(如文本)生成新的3D内容。
动机:随着工业界对3D虚拟建模技术的关注日益增加,基于特定条件生成新的3D内容已成为一个热门问题。
方法:提出了一种新的生成性3D建模框架Diffusion-SDF,用于具有挑战性的文本到形状合成任务。该框架包括一个SDF自动编码器和体素化的扩散模型,用于学习和生成3D形状的体素化有符号距离场(SDFs)表示。
效果:实验结果表明,与以往的方法相比,Diffusion-SDF能生成更高质量、更具多样性的3D形状,且这些形状能很好地符合给定的文本描述。

With the rising industrial attention to 3D virtual modeling technology, generating novel 3D content based on specified conditions (e.g. text) has become a hot issue. In this paper, we propose a new generative 3D modeling framework called Diffusion-SDF for the challenging task of text-to-shape synthesis. Previous approaches lack flexibility in both 3D data representation and shape generation, thereby failing to generate highly diversified 3D shapes conforming to the given text descriptions. To address this, we propose a SDF autoencoder together with the Voxelized Diffusion model to learn and generate representations for voxelized signed distance fields (SDFs) of 3D shapes. Specifically, we design a novel UinU-Net architecture that implants a local-focused inner network inside the standard U-Net architecture, which enables better reconstruction of patch-independent SDF representations. We extend our approach to further text-to-shape tasks including text-conditioned shape completion and manipulation. Experimental results show that Diffusion-SDF generates both higher quality and more diversified 3D shapes that conform well to given text descriptions when compared to previous approaches. Code is available at: https://github.com/ttlmh/Diffusion-SDF.

Semantic-Conditional Diffusion Networks for Image Captioning
Luo, JianjieandLi, YehaoandPan, YingweiandYao, TingandFeng, JianlinandChao, HongyangandMei, Tao



研究问题:如何利用扩散模型在图像描述任务中捕捉离散词的依赖关系并实现复杂的视觉语言对齐。
动机:现有的基于Transformer的编码器-解码器模型在图像描述任务中存在挑战,需要新的模型来改进。
方法:提出一种新的基于扩散模型的框架,命名为Semantic-Conditional Diffusion Networks (SCD-Net)。通过跨模态检索模型找到与输入图像语义相关的语句,将其丰富的语义信息作为语义先验,触发扩散Transformer的学习,生成输出语句。
效果:在COCO数据集上的大量实验表明,扩散模型在具有挑战性的图像描述任务中有潜力。

Recent advances on text-to-image generation have witnessed the rise of diffusion models which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent variable models to capture the dependency among discrete words and meanwhile pursue complex visual-language alignment in image captioning. In this paper, we break the deeply rooted conventions in learning Transformer-based encoder-decoder, and propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net). Technically, for each input image, we first search the semantically relevant sentences via cross-modal retrieval model to convey the comprehensive semantic information. The rich semantics are further regarded as semantic prior to trigger the learning of Diffusion Transformer, which produces the output sentence in a diffusion process. In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence in a cascaded manner. Furthermore, to stabilize the diffusion process, a new self-critical sequence training strategy is designed to guide the learning of SCD-Net with the knowledge of a standard autoregressive Transformer model. Extensive experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task. Source code is available at https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/scdnet.

Unite and Conquer: Plug \& Play Multi-Modal Synthesis Using Diffusion Models
Nair, NithinGopalakrishnanandBandara, WeleGedaraChamindaandPatel, VishalM.



研究问题:如何生成满足多种约束条件的照片,以解决内容创作行业中的问题。
动机:目前的方法需要所有模态的配对数据和相应的输出,且引入新条件需要重新训练所有模态的配对数据。
方法:提出一种基于去噪扩散概率模型(DDPMs)的解决方案。由于扩散模型具有灵活的内部结构,每个采样步骤都遵循高斯分布,因此存在一种闭型解,可以根据各种约束条件生成图像。该方法可以利用单个扩散模型进行多任务训练,并通过提出的采样策略提高组合任务的效果。
效果:在各种标准多模态任务上进行实验,证明了该方法的有效性。

Generating photos satisfying multiple constraints finds broad utility in the content creation industry. A key hurdle to accomplishing this task is the need for paired data consisting of all modalities (i.e., constraints) and their corresponding output. Moreover, existing methods need retraining using paired data across all modalities to introduce a new condition. This paper proposes a solution to this problem based on denoising diffusion probabilistic models (DDPMs). Our motivation for choosing diffusion models over other generative models comes from the flexible internal structure of diffusion models. Since each sampling step in the DDPM follows a Gaussian distribution, we show that there exists a closed-form solution for generating an image given various constraints. Our method can utilize a single diffusion model trained on multiple sub-tasks and improve the combined task through our proposed sampling strategy. We also introduce a novel reliability parameter that allows using different off-the-shelf diffusion models trained across various datasets during sampling time alone to guide it to the desired outcome satisfying multiple constraints. We perform experiments on various standard multimodal tasks to demonstrate the effectiveness of our approach. More details can be found at: https://nithin-gk.github.io/projectpages/Multidiff

Magic3D: High-Resolution Text-to-3D Content Creation
Lin, Chen-HsuanandGao, JunandTang, LumingandTakikawa, TowakiandZeng, XiaohuiandHuang, XunandKreis, KarstenandFidler, SanjaandLiu, Ming-YuandLin, Tsung-Yi



研究问题:如何优化Neural Radiance Fields (NeRF)的表示,并解决研究问题:如何优化Neural Radiance Fields (NeRF)的表示,并解决DreamFusion方法中存在的优化速度慢和低分辨率图像导致的3D模型质量低的问题。
动机:目前的文本到3D合成方法在优化NeRF表示时速度慢且需要长时间等待,同时由于使用低分辨率图像进行监督,生成的3D模型质量较低。
方法:本文提出了一种两阶段粗到精的优化框架。首先,使用稀疏的3D神经表示来加速优化,同时使用低分辨率的扩散先验。然后,使用从粗神经表示初始化的纹理网格模型,通过与高分辨率图像交互的高效可微渲染器进行优化。该方法被称为Magic3D。
效果:Magic3D可以在40分钟内生成一个3D网格模型,比DreamFusion快2倍,同时生成的模型分辨率高出8倍。用户研究中,61.7%的评分者更喜欢Magic3D的方法。此外,结合图像条件生成能力,我们为用户提供了新的控制3D合成的方式,为各种创意应用开辟了新途径。

Recently, DreamFusion demonstrated the utility of a pretrained text-to-image diffusion model to optimize Neural Radiance Fields (NeRF), achieving remarkable text-to-3D synthesis results. However, the method has two inherent limitations: 1) optimization of the NeRF representation is extremely slow, 2) NeRF is supervised by images at a low resolution (64x64), thus leading to low-quality 3D models with a long wait time. In this paper, we address these limitations by utilizing a two-stage coarse-to-fine optimization framework. In the first stage, we use a sparse 3D neural representation to accelerate optimization while using a low-resolution diffusion prior. In the second stage, we use a textured mesh model initialized from the coarse neural representation, allowing us to perform optimization with a very efficient differentiable renderer interacting with high-resolution images. Our method, dubbed Magic3D, can create a 3D mesh model in 40 minutes, 2x faster than DreamFusion (reportedly taking 1.5 hours on average), while achieving 8x higher resolution. User studies show 61.7% raters to prefer our approach than DreamFusion. Together with the image-conditioned generation capabilities, we provide users with new ways to control 3D synthesis, opening up new avenues to various creative applications.

SINE: Semantic-Driven Image-Based NeRF Editing With Prior-Guided Editing Field
Bao, ChongandZhang, YindaandYang, BangbangandFan, TianxingandYang, ZesongandBao, HujunandZhang, GuofengandCui, Zhaopeng



研究问题:尽管在二维编辑方面已经取得了巨大的成功,但在三维领域的类似功能仍然受限。
动机:本文提出了一种新的基于语义的NeRF编辑方法,使用户能够通过单张图像编辑神经辐射场,并忠实地生成具有高保真度和多视图一致性的编辑新视图。
方法:为了实现这一目标,我们提出了一种先验引导的编辑场来编码3D空间中的细粒度几何和纹理编辑,并开发了一系列技术来辅助编辑过程,包括使用代理网格的循环约束以促进几何监督、颜色合成机制以稳定语义驱动的纹理编辑以及基于特征聚类的正则化以保持无关内容不变。
效果:大量的实验和编辑示例证明,我们的方法仅使用一张编辑过的图像就能实现照片级的3D编辑,推动了在现实世界场景中基于语义的3D编辑的界限。

Despite the great success in 2D editing using user-friendly tools, such as Photoshop, semantic strokes, or even text prompts, similar capabilities in 3D areas are still limited, either relying on 3D modeling skills or allowing editing within only a few categories. In this paper, we present a novel semantic-driven NeRF editing approach, which enables users to edit a neural radiance field with a single image, and faithfully delivers edited novel views with high fidelity and multi-view consistency. To achieve this goal, we propose a prior-guided editing field to encode fine-grained geometric and texture editing in 3D space, and develop a series of techniques to aid the editing process, including cyclic constraints with a proxy mesh to facilitate geometric supervision, a color compositing mechanism to stabilize semantic-driven texture editing, and a feature-cluster-based regularization to preserve the irrelevant content unchanged. Extensive experiments and editing examples on both real-world and synthetic data demonstrate that our method achieves photo-realistic 3D editing using only a single edited image, pushing the bound of semantic-driven editing in 3D real-world scenes.

Fine-Grained Face Swapping via Regional GAN Inversion
Liu, ZhianandLi, MaomaoandZhang, YongandWang, CairongandZhang, QiandWang, JueandNie, Yongwei



研究问题:如何实现高保真人脸交换,同时保留所需的微妙几何和纹理细节。
动机:从细粒度人脸编辑的角度重新思考人脸交换,即编辑用于交换(E4S),并提出一个基于面部组件形状和纹理显式解耦的框架。
方法:提出了一种基于显式解耦形状和纹理的新颖区域生成对抗网络反转(RGI)方法,并在StyleGAN的潜在空间中进行人脸交换。设计了一个多尺度掩码引导编码器将每个面部组件的纹理投影到区域风格代码中,并设计了一个掩码引导注入模块用风格代码操纵特征图。
效果:通过大量的实验和与当前最先进的方法的比较,证明了该方法在保留纹理和形状细节以及处理高分辨率图像方面的优越性。

We present a novel paradigm for high-fidelity face swapping that faithfully preserves the desired subtle geometry and texture details. We rethink face swapping from the perspective of fine-grained face editing, i.e., editing for swapping (E4S), and propose a framework that is based on the explicit disentanglement of the shape and texture of facial components. Following the E4S principle, our framework enables both global and local swapping of facial features, as well as controlling the amount of partial swapping specified by the user. Furthermore, the E4S paradigm is inherently capable of handling facial occlusions by means of facial masks. At the core of our system lies a novel Regional GAN Inversion (RGI) method, which allows the explicit disentanglement of shape and texture. It also allows face swapping to be performed in the latent space of StyleGAN. Specifically, we design a multi-scale mask-guided encoder to project the texture of each facial component into regional style codes. We also design a mask-guided injection module to manipulate the feature maps with the style codes. Based on the disentanglement, face swapping is reformulated as a simplified problem of style and mask swapping. Extensive experiments and comparisons with current state-of-the-art methods demonstrate the superiority of our approach in preserving texture and shape details, as well as working with high resolution images. The project page is https://e4s2022.github.io

Where Is My Spot? Few-Shot Image Generation via Latent Subspace Optimization
Zheng, ChenxiandLiu, BangzhenandZhang, HuaidongandXu, XuemiaoandHe, Shengfeng



研究问题:本文旨在解决图像生成模型在面对少量未见过的类别样本时,难以产生多样化的图像的问题。
动机:目前的图像生成模型需要大量的训练数据才能生成多样化的未见过的类别图像,这在实际中是不现实的。因此,本文试图通过将稀疏的少数样本映射到潜在的连续潜在空间来解决这个问题。
方法:本文提出的方法是将少数样本投影到一个潜在的连续潜在空间中,这个空间可以生成无限的未见过的样本。具体来说,我们首先找到条件StyleGAN中的一个质心潜在位置,该位置对应的输出图像与给定样本的相似度最大。然后,我们假设质心周围的潜在子空间属于新的类别,并引入两个潜在子空间优化目标。第一个目标是使用少数样本作为新类别的正锚点,调整StyleGAN以产生带有新类别标签的结果。第二个目标是从另一个方向控制生成过程,通过改变质心及其周围的潜在子空间来更精确地生成新的类别。这两个互惠的优化目标将新类别注入StyleGAN的潜在子空间,从而可以通过从该空间采样图像来轻松生成新的未见过的样本。
效果:实验结果表明,该方法在少数样本生成性能上优于最先进的方法,特别是在多样性和生成质量方面。

Image generation relies on massive training data that can hardly produce diverse images of an unseen category according to a few examples. In this paper, we address this dilemma by projecting sparse few-shot samples into a continuous latent space that can potentially generate infinite unseen samples. The rationale behind is that we aim to locate a centroid latent position in a conditional StyleGAN, where the corresponding output image on that centroid can maximize the similarity with the given samples. Although the given samples are unseen for the conditional StyleGAN, we assume the neighboring latent subspace around the centroid belongs to the novel category, and therefore introduce two latent subspace optimization objectives. In the first one we use few-shot samples as positive anchors of the novel class, and adjust the StyleGAN to produce the corresponding results with the new class label condition. The second objective is to govern the generation process from the other way around, by altering the centroid and its surrounding latent subspace for a more precise generation of the novel class. These reciprocal optimization objectives inject a novel class into the StyleGAN latent subspace, and therefore new unseen samples can be easily produced by sampling images from it. Extensive experiments demonstrate superior few-shot generation performances compared with state-of-the-art methods, especially in terms of diversity and generation quality. Code is available at https://github.com/chansey0529/LSO.

Discrete Point-Wise Attack Is Not Enough: Generalized Manifold Adversarial Attack for Face Recognition
Li, QianandHu, YuxiaoandLiu, YeandZhang, DongxiaoandJin, XinandChen, Yuntian



研究问题:如何提高人脸识别模型对抗攻击的效果和泛化能力。
动机:现有的针对人脸识别的对抗攻击方法在未知身份状态下表现不佳,且容易被防御。
方法:提出了一种广义流形对抗攻击(GMAA)的新方法,通过将目标从单一扩展到多个,并将攻击点从离散扩展到流形,以扩大攻击范围并提高攻击效果。同时,设计了一种局部和全局约束的双重监督机制,以提高生成对抗样本的视觉质量。
效果:实验证明,GMAA能在语义连续的对抗空间中实现更高的泛化能力和视觉质量。

Classical adversarial attacks for Face Recognition (FR) models typically generate discrete examples for target identity with a single state image. However, such paradigm of point-wise attack exhibits poor generalization against numerous unknown states of identity and can be easily defended. In this paper, by rethinking the inherent relationship between the face of target identity and its variants, we introduce a new pipeline of Generalized Manifold Adversarial Attack (GMAA) to achieve a better attack performance by expanding the attack range. Specifically, this expansion lies on two aspects -- GMAA not only expands the target to be attacked from one to many to encourage a good generalization ability for the generated adversarial examples, but it also expands the latter from discrete points to manifold by leveraging the domain knowledge that face expression change can be continuous, which enhances the attack effect as a data augmentation mechanism did. Moreover, we further design a dual supervision with local and global constraints as a minor contribution to improve the visual quality of the generated adversarial examples. We demonstrate the effectiveness of our method based on extensive experiments, and reveal that GMAA promises a semantic continuous adversarial space with a higher generalization ability and visual quality.

Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation
Wang, HaochenandDu, XiaodanandLi, JiahaoandYeh, RaymondA.andShakhnarovich, Greg



研究问题:如何将预训练的2D模型用于3D数据生成?
动机:通过应用扩散模型和链式规则,将2D得分聚合为3D得分,以实现3D数据的生成。
方法:在可微渲染器(如体素辐射场)上应用扩散模型的梯度,并通过雅可比矩阵进行反向传播。
效果:解决了分布不匹配的技术挑战,并在多个现有的扩散图像生成模型上进行了实验,取得了良好的效果。

A diffusion model learns to predict a vector field of gradients. We propose to apply chain rule on the learned gradients, and back-propagate the score of a diffusion model through the Jacobian of a differentiable renderer, which we instantiate to be a voxel radiance field. This setup aggregates 2D scores at multiple camera viewpoints into a 3D score, and repurposes a pretrained 2D model for 3D data generation. We identify a technical challenge of distribution mismatch that arises in this application, and propose a novel estimation mechanism to resolve it. We run our algorithm on several off-the-shelf diffusion image generative models, including the recently released Stable Diffusion trained on the large-scale LAION dataset.

Generating Part-Aware Editable 3D Shapes Without 3D Supervision
Tertikas, KonstantinosandPaschalidou, DespoinaandPan, BoxiaoandPark, JeongJoonandUy, MikaelaAngelinaandEmiris, IoannisandAvrithis, YannisandGuibas, Leonidas



研究问题:如何生成具有局部控制和编辑能力的高质量3D形状。
动机:现有的方法虽然可以生成高质量的3D形状,但缺乏对形状的局部控制和编辑能力,限制了其在内容创作应用中的使用。
方法:本文提出了一种新的部分感知生成模型PartNeRF,该模型不需要任何显式的3D监督,而是将物体生成为一组局部定义的NeRFs,并附加仿射变换。通过强制将光线分配给特定部分,确保每个光线的颜色只由一个NeRF决定,从而实现对不同部分的独立操作和编辑。
效果:在各种ShapeNet类别上的评估表明,与需要3D监督或依赖NeRF的先前部分基础生成方法相比,PartNeRF能够生成具有改进保真度的可编辑3D对象。

Impressive progress in generative models and implicit representations gave rise to methods that can generate 3D shapes of high quality. However, being able to locally control and edit shapes is another essential property that can unlock several content creation applications. Local control can be achieved with part-aware models, but existing methods require 3D supervision and cannot produce textures. In this work, we devise PartNeRF, a novel part-aware generative model for editable 3D shape synthesis that does not require any explicit 3D supervision. Our model generates objects as a set of locally defined NeRFs, augmented with an affine transformation. This enables several editing operations such as applying transformations on parts, mixing parts from different objects etc. To ensure distinct, manipulable parts we enforce a hard assignment of rays to parts that makes sure that the color of each ray is only determined by a single NeRF. As a result, altering one part does not affect the appearance of the others. Evaluations on various ShapeNet categories demonstrate the ability of our model to generate editable 3D objects of improved fidelity, compared to previous part-based generative approaches that require 3D supervision or models relying on NeRFs.

High-Fidelity Facial Avatar Reconstruction From Monocular Video With Generative Priors
Bai, YunpengandFan, YanboandWang, XuanandZhang, YongandSun, JingxiangandYuan, ChunandShan, Ying



研究问题:如何从单目视频中重建高保真度的人脸化身。
动机:由于人脸的复杂动态性和单目视频中的3D信息缺失,使得从单目视频中重建高保真度的人脸化身成为一项重大的研究挑战。
方法:提出一种基于Neural Radiance Field (NeRF)的新方法来进行人脸化身重建,该方法利用了3D感知的生成先验。与现有依赖条件变形场进行动态建模的工作不同,我们提出了学习个性化生成先验的方法,该方法被构造为3D-GAN潜在空间中的一个局部低维子空间。
效果:通过少量的给定个体面部图像,可以有效地构建个性化的生成先验。学习后,它可以进行照片级的新颖视图渲染,并且通过在潜在空间中导航可以实现面部再演算。与现有的工作相比,我们获得了优越的新颖视图合成结果和忠实的面部再演算性能。

High-fidelity facial avatar reconstruction from a monocular video is a significant research problem in computer graphics and computer vision. Recently, Neural Radiance Field (NeRF) has shown impressive novel view rendering results and has been considered for facial avatar reconstruction. However, the complex facial dynamics and missing 3D information in monocular videos raise significant challenges for faithful facial reconstruction. In this work, we propose a new method for NeRF-based facial avatar reconstruction that utilizes 3D-aware generative prior. Different from existing works that depend on a conditional deformation field for dynamic modeling, we propose to learn a personalized generative prior, which is formulated as a local and low dimensional subspace in the latent space of 3D-GAN. We propose an efficient method to construct the personalized generative prior based on a small set of facial images of a given individual. After learning, it allows for photo-realistic rendering with novel views, and the face reenactment can be realized by performing navigation in the latent space. Our proposed method is applicable for different driven signals, including RGB images, 3DMM coefficients, and audio. Compared with existing works, we obtain superior novel view synthesis results and faithfully face reenactment performance. The code is available here https://github.com/bbaaii/HFA-GP.

Restoration of Hand-Drawn Architectural Drawings Using Latent Space Mapping With Degradation Generator
Choi, NakkwanandLee, SeungjaeandLee, YongsikandYang, Seungjoon



研究问题:如何恢复木结构遗产的手绘图纸。
动机:手绘图纸包含最重要的原始信息,但随着时间的推移,这些信息往往会严重退化。
方法:提出了一种基于矢量量化变分自动编码器的新型恢复方法。学习图纸和噪声的隐空间表示,用于将有噪声的图纸映射到清洁的图纸进行恢复,并生成真实的有噪声的图纸进行数据增强。
效果:该方法应用于文化遗产管理局归档的图纸,恢复后的图纸质量显著提高,可以更准确地解释信息。

This work presents the restoration of drawings of wooden built heritage. Hand-drawn drawings contain the most important original information but are often severely degraded over time. A novel restoration method based on the vector quantized variational autoencoders is presented. Latent space representations of drawings and noise are learned, which are used to map noisy drawings to clean drawings for restoration and to generate authentic noisy drawings for data augmentation. The proposed method is applied to the drawings archived in the Cultural Heritage Administration. Restored drawings show significant quality improvement and allow more accurate interpretations of information.

DiffusionRig: Learning Personalized Priors for Facial Appearance Editing
Ding, ZhengandZhang, XuanerandXia, ZhihaoandJebe, LarsandTu, ZhuowenandZhang, Xiuming



研究问题:如何从少量(例如20张)同一人物的肖像照片中学习人脸特定先验知识。
动机:通过这种方式,我们可以编辑特定人物的面部外观,如表情和照明,同时保留其身份和高频面部细节。
方法:我们的方法称为DiffusionRig,主要是通过一个扩散模型对从单张野外图像中估计出的粗略3D人脸模型进行“装配”。
效果:定性和定量实验表明,DiffusionRig在身份保持和照片真实感方面优于现有方法。

We address the problem of learning person-specific facial priors from a small number (e.g., 20) of portrait photos of the same person. This enables us to edit this specific person's facial appearance, such as expression and lighting, while preserving their identity and high-frequency facial details. Key to our approach, which we dub DiffusionRig, is a diffusion model conditioned on, or "rigged by," crude 3D face models estimated from single in-the-wild images by an off-the-shelf estimator. On a high level, DiffusionRig learns to map simplistic renderings of 3D face models to realistic photos of a given person. Specifically, DiffusionRig is trained in two stages: It first learns generic facial priors from a large-scale face dataset and then person-specific priors from a small portrait photo collection of the person of interest. By learning the CGI-to-photo mapping with such personalized priors, DiffusionRig can "rig" the lighting, facial expression, head pose, etc. of a portrait photo, conditioned only on coarse 3D models while preserving this person's identity and other high-frequency characteristics. Qualitative and quantitative experiments show that DiffusionRig outperforms existing approaches in both identity preservation and photorealism. Please see the project website: https://diffusionrig.github.io for the supplemental material, video, code, and data.

Delving StyleGAN Inversion for Image Editing: A Foundation Latent Space Viewpoint
Liu, HongyuandSong, YibingandChen, Qifeng



研究问题:如何通过StyleGAN映射将输入图像嵌入到W、W+和F的嵌入空间中,以同时保持图像保真度和有意义的操作。
动机:现有的GAN逆行方法主要探索W+和F空间以提高重建保真度,但忽视了作为StyleGAN基础的潜在空间W。
方法:首先在基础潜在空间W中找到适当的潜在码,然后引入对比学习对齐W和图像空间以发现适当的潜在码。接着利用跨注意力编码器将获得的W中的潜在码转换为W+和F。
效果:实验表明,对基础潜在空间W的探索提高了W+和F中潜在码和特征的表现能力,从而在标准基准上实现了最先进的重建保真度和可编辑性结果。

GAN inversion and editing via StyleGAN maps an input image into the embedding spaces (W, W^+, and F) to simultaneously maintain image fidelity and meaningful manipulation. From latent space W to extended latent space W^+ to feature space F in StyleGAN, the editability of GAN inversion decreases while its reconstruction quality increases. Recent GAN inversion methods typically explore W^+ and F rather than W to improve reconstruction fidelity while maintaining editability. As W^+ and F are derived from W that is essentially the foundation latent space of StyleGAN, these GAN inversion methods focusing on W^+ and F spaces could be improved by stepping back to W. In this work, we propose to first obtain the proper latent code in foundation latent space W. We introduce contrastive learning to align W and the image space for proper latent code discovery. Then, we leverage a cross-attention encoder to transform the obtained latent code in W into W^+ and F, accordingly. Our experiments show that our exploration of the foundation latent space W improves the representation ability of latent codes in W^+ and features in F, which yields state-of-the-art reconstruction fidelity and editability results on the standard benchmarks. Project page: https://kumapowerliu.github.io/CLCAE.

GlassesGAN: Eyewear Personalization Using Synthetic Appearance Discovery and Targeted Subspace Modeling
Plesh, RichardandPeer, PeterandStruc, Vitomir



研究问题:开发一种新颖的眼镜图像编辑框架,以实现眼镜的定制设计。
动机:目前的图像编辑框架在输出图像质量、编辑真实性和连续多风格编辑能力方面存在不足。
方法:提出了GlassesGAN,这是一种新的图像编辑框架,用于自定义眼镜设计。同时,还提出了Targeted Subspace Modelling(TSM)过程,通过在预训练的GAN生成器的潜空间中进行(合成)外观发现的新机制,构建了一个专门针对眼镜的(潜在)子空间,供编辑框架使用。此外,还引入了外观约束子空间初始化(SI)技术,将给定输入图像的潜在表示定位在构造的子空间的明确部分,以提高学习的编辑的可靠性。
效果:在两个高分辨率数据集(CelebA-HQ和SiblingsDB-HQf)上测试GlassesGAN,并将其与三种最先进的基线方法(即InterfaceGAN、GANSpace和MaskGAN)进行了比较。结果显示,GlassesGAN在所有竞争技术中表现优异,同时提供了竞争对手无法提供的功能(如细粒度的多风格编辑)。GlassesGAN的源代码已公开发布。

We present GlassesGAN, a novel image editing framework for custom design of glasses, that sets a new standard in terms of output-image quality, edit realism, and continuous multi-style edit capability. To facilitate the editing process with GlassesGAN, we propose a Targeted Subspace Modelling (TSM) procedure that, based on a novel mechanism for (synthetic) appearance discovery in the latent space of a pre-trained GAN generator, constructs an eyeglasses-specific (latent) subspace that the editing framework can utilize. Additionally, we also introduce an appearance-constrained subspace initialization (SI) technique that centers the latent representation of the given input image in the well-defined part of the constructed subspace to improve the reliability of the learned edits. We test GlassesGAN on two (diverse) high-resolution datasets (CelebA-HQ and SiblingsDB-HQf) and compare it to three state-of-the-art baselines, i.e., InterfaceGAN, GANSpace, and MaskGAN. The reported results show that GlassesGAN convincingly outperforms all competing techniques, while offering functionality (e.g., fine-grained multi-style editing) not available with any of the competitors. The source code for GlassesGAN is made publicly available.

Parametric Implicit Face Representation for Audio-Driven Facial Reenactment
Huang, RicongandLai, PeiwenandQin, YipengandLi, Guanbin



研究问题:本文旨在解决音频驱动的面部重演技术中存在的解释性和表现力之间的权衡问题。
动机:现有的工作要么使用明确的中间面部表示(如2D面部地标或3D面部模型),要么使用隐式的表示(如神经辐射场),因此在结果的控制性和质量方面存在权衡。
方法:本文提出了一种新的参数化隐式面部表示,并提出了一种新的音频驱动的面部重演框架,该框架既可控又可以生成高质量的说话头部。具体来说,我们的参数化隐式表示使用3D面部模型的解释性参数对隐式表示进行参数化,从而兼顾了显式和隐式方法的优点。此外,我们还提出了几种新技巧来改进我们框架的三个组件,包括将上下文信息纳入音频到表达参数编码、使用条件图像合成来参数化隐式表示并实施创新的三平面结构以实现有效学习,以及将面部重演公式化为条件图像修复问题并提出一种新的数据增强技术以提高模型的泛化能力。
效果:大量实验证明,与以往的方法相比,我们的方法可以生成更真实的结果,更能忠实于说话者的身份和讲话风格。

Audio-driven facial reenactment is a crucial technique that has a range of applications in film-making, virtual avatars and video conferences. Existing works either employ explicit intermediate face representations (e.g., 2D facial landmarks or 3D face models) or implicit ones (e.g., Neural Radiance Fields), thus suffering from the trade-offs between interpretability and expressive power, hence between controllability and quality of the results. In this work, we break these trade-offs with our novel parametric implicit face representation and propose a novel audio-driven facial reenactment framework that is both controllable and can generate high-quality talking heads. Specifically, our parametric implicit representation parameterizes the implicit representation with interpretable parameters of 3D face models, thereby taking the best of both explicit and implicit methods. In addition, we propose several new techniques to improve the three components of our framework, including i) incorporating contextual information into the audio-to-expression parameters encoding; ii) using conditional image synthesis to parameterize the implicit representation and implementing it with an innovative tri-plane structure for efficient learning; iii) formulating facial reenactment as a conditional image inpainting problem and proposing a novel data augmentation technique to improve model generalizability. Extensive experiments demonstrate that our method can generate more realistic results than previous methods with greater fidelity to the identities and talking styles of speakers.

Re-IQA: Unsupervised Learning for Image Quality Assessment in the Wild
Saha, AvinabandMishra, SandeepandBovik, AlanC.



研究问题:自动感知图像质量评估是一个影响数十亿互联网和社交媒体用户的日常挑战。
动机:为了推进这一领域的研究,我们提出了一种混合专家的方法,在无监督的设置中训练两个独立的编码器来学习高层次的内容和低层次的图像质量特征。
方法:我们的方法的独特新颖之处在于其能够生成与表示图像内容的高层特征互补的图像质量的低层表示。我们将用于训练这两个编码器的框架称为Re-IQA。
效果:我们在多个包含真实和合成失真的大规模图像质量评估数据库上部署了从Re-IQA框架获得的互补的低层和高层图像表示,以训练一个线性回归模型,该模型用于将图像表示映射到地面真实质量分数。我们的方法在多个大规模图像质量评估数据库上实现了最先进的性能,展示了深度神经网络如何在无监督的设置中产生感知相关的表示。

Automatic Perceptual Image Quality Assessment is a challenging problem that impacts billions of internet, and social media users daily. To advance research in this field, we propose a Mixture of Experts approach to train two separate encoders to learn high-level content and low-level image quality features in an unsupervised setting. The unique novelty of our approach is its ability to generate low-level representations of image quality that are complementary to high-level features representing image content. We refer to the framework used to train the two encoders as Re-IQA. For Image Quality Assessment in the Wild, we deploy the complementary low and high-level image representations obtained from the Re-IQA framework to train a linear regression model, which is used to map the image representations to the ground truth quality scores, refer Figure 1. Our method achieves state-of-the-art performance on multiple large-scale image quality assessment databases containing both real and synthetic distortions, demonstrating how deep neural networks can be trained in an unsupervised setting to produce perceptually relevant representations. We conclude from our experiments that the low and high-level features obtained are indeed complementary and positively impact the performance of the linear regressor. A public release of all the codes associated with this work will be made available on GitHub.

Catch Missing Details: Image Reconstruction With Frequency Augmented Variational Autoencoder
Lin, XinmiaoandLi, YikangandHsiao, JenhaoandHo, ChiumanandKong, Yu



研究问题:现有的VQ-VAE模型在图像重建中,随着压缩率的提高,图像质量快速下降。
动机:高压缩率导致高频谱视觉信号丢失,这是由于高频谱反映了像素空间的细节信息。
方法:提出频率补偿模块(FCM)架构来捕获丢失的频率信息以提升重建质量,并将其整合到VQ-VAE结构中,形成频率增强VAE(FA-VAE)。同时引入动态频谱损失(DSL)来动态平衡各种频率。
效果:在多个基准数据集上进行的大量重建实验表明,与现有方法相比,FA-VAE能更准确地恢复细节。此外,通过使用交叉注意力自回归变压器(CAT),在图像-文本生成任务上也实现了更好的语义对齐和生成质量。

The popular VQ-VAE models reconstruct images through learning a discrete codebook but suffer from a significant issue in the rapid quality degradation of image reconstruction as the compression rate rises. One major reason is that a higher compression rate induces more loss of visual signals on the higher frequency spectrum, which reflect the details on pixel space. In this paper, a Frequency Complement Module (FCM) architecture is proposed to capture the missing frequency information for enhancing reconstruction quality. The FCM can be easily incorporated into the VQ-VAE structure, and we refer to the new model as Frequancy Augmented VAE (FA-VAE). In addition, a Dynamic Spectrum Loss (DSL) is introduced to guide the FCMs to balance between various frequencies dynamically for optimal reconstruction. FA-VAE is further extended to the text-to-image synthesis task, and a Cross-attention Autoregressive Transformer (CAT) is proposed to obtain more precise semantic attributes in texts. Extensive reconstruction experiments with different compression rates are conducted on several benchmark datasets, and the results demonstrate that the proposed FA-VAE is able to restore more faithfully the details compared to SOTA methods. CAT also shows improved generation quality with better image-text semantic alignment.

RaBit: Parametric Modeling of 3D Biped Cartoon Characters With a Topological-Consistent Dataset
Luo, ZhongjinandCai, ShengcaiandDong, JinguoandMing, RuiboandQiu, LiangdongandZhan, XiaohangandHan, Xiaoguang



研究问题:如何有效地生成视觉上可信的3D卡通人物。
动机:虽然现有的学习型方法在3D真实人类数字化方面取得了前所未有的准确性和效率,但没有一个关注于3D双足卡通人物模型化,这在游戏和电影制作中也有大量需求。
方法:介绍了第一个大规模的3D双足卡通人物数据集3DBiCar和相应的参数化模型RaBit。该数据集包含1500个高质量的、由专业艺术家手工制作的拓扑一致的3D纹理模型。基于这些数据,RaBit被设计成一个类似SMPL的线性混合形状模型和一个基于StyleGAN的神经UV纹理生成器,同时表达形状、姿态和纹理。
效果:通过各种应用,包括单视图重建、草图建模和3D卡通动画,展示了3DBiCar和RaBit的实用性。实验进一步证明了我们的方法在定性和定量上的有效性。

Assisting people in efficiently producing visually plausible 3D characters has always been a fundamental research topic in computer vision and computer graphics. Recent learning-based approaches have achieved unprecedented accuracy and efficiency in the area of 3D real human digitization. However, none of the prior works focus on modeling 3D biped cartoon characters, which are also in great demand in gaming and filming. In this paper, we introduce 3DBiCar, the first large-scale dataset of 3D biped cartoon characters, and RaBit, the corresponding parametric model. Our dataset contains 1,500 topologically consistent high-quality 3D textured models which are manually crafted by professional artists. Built upon the data, RaBit is thus designed with a SMPL-like linear blend shape model and a StyleGAN-based neural UV-texture generator, simultaneously expressing the shape, pose, and texture. To demonstrate the practicality of 3DBiCar and RaBit, various applications are conducted, including single-view reconstruction, sketch-based modeling, and 3D cartoon animation. For the single-view reconstruction setting, we find a straightforward global mapping from input images to the output UV-based texture maps tends to lose detailed appearances of some local parts (e.g., nose, ears). Thus, a part-sensitive texture reasoner is adopted to make all important local areas perceived. Experiments further demonstrate the effectiveness of our method both qualitatively and quantitatively. 3DBiCar and RaBit are available at gaplab.cuhk.edu.cn/projects/RaBit.

Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars
Sun, JingxiangandWang, XuanandWang, LizhenandLi, XiaoyuandZhang, YongandZhang, HongwenandLiu, Yebin



研究问题:如何从无结构的2D图像中,使用3D变形模型(3DMM)进行高保真和多视角一致的面部图像合成。
动机:目前的3D-GAN方法在处理面部属性的细节控制时,要么可以精确控制面部表情但不能处理由头发和配饰引起的拓扑变化,要么能处理不同的拓扑结构但因未受约束的形变场而泛化能力有限。
方法:提出一种新的3D-GAN框架,用于从无结构的2D图像中进行无监督学习,生成高质量且3D一致的面部头像。为了同时实现形变准确性和拓扑灵活性,提出了一种称为“生成纹理光栅化三平面”的3D表示法。该方法在学习参数化网格模板上的生成神经纹理后,通过光栅化将它们投影到三个正交的视角特征平面上,形成用于体积渲染的三平面特征表示。
效果:实验表明,该方法在3D感知的合成质量和动画能力方面均达到最先进的水平。此外,作为3D先验,这种可动画化的3D表示法提高了包括单次面部头像和3D感知风格化在内的多种应用的性能。

3D-aware generative adversarial networks (GANs) synthesize high-fidelity and multi-view-consistent facial images using only collections of single-view 2D imagery. Towards fine-grained control over facial attributes, recent efforts incorporate 3D Morphable Face Model (3DMM) to describe deformation in generative radiance fields either explicitly or implicitly. Explicit methods provide fine-grained expression control but cannot handle topological changes caused by hair and accessories, while implicit ones can model varied topologies but have limited generalization caused by the unconstrained deformation fields. We propose a novel 3D GAN framework for unsupervised learning of generative, high-quality and 3D-consistent facial avatars from unstructured 2D images. To achieve both deformation accuracy and topological flexibility, we propose a 3D representation called Generative Texture-Rasterized Tri-planes. The proposed representation learns Generative Neural Textures on top of parametric mesh templates and then projects them into three orthogonal-viewed feature planes through rasterization, forming a tri-plane feature representation for volume rendering. In this way, we combine both fine-grained expression control of mesh-guided explicit deformation and the flexibility of implicit volumetric representation. We further propose specific modules for modeling mouth interior which is not taken into account by 3DMM. Our method demonstrates state-of-the-art 3Daware synthesis quality and animation ability through extensive experiments. Furthermore, serving as 3D prior, our animatable 3D representation boosts multiple applications including one-shot facial avatars and 3D-aware stylization.

Linking Garment With Person via Semantically Associated Landmarks for Virtual Try-On
Yan, KeyuandGao, TingweiandZhang, HuiandXie, Chengjun



研究问题:本文提出了一种新的虚拟试穿算法,称为SAL-VTON,通过语义关联的地标将服装与人体连接起来,以减轻错位问题。
动机:现有的虚拟试穿技术在处理服装和人体的整体变形时存在错位问题。
方法:SAL-VTON算法通过在店内服装图像和试穿图像上找到具有相同局部语义的一系列地标对来链接服装和人。然后,利用这些地标对有效地模拟服装和人体的局部语义关联,弥补了整体变形中的错位。
效果:实验结果表明,SAL-VTON能够处理错位问题,并在定性和定量上都优于现有方法。同时,我们还提出了一个新的地标数据集,为各种风格的服装提供了统一的地标标注规则。

In this paper, a novel virtual try-on algorithm, dubbed SAL-VTON, is proposed, which links the garment with the person via semantically associated landmarks to alleviate misalignment. The semantically associated landmarks are a series of landmark pairs with the same local semantics on the in-shop garment image and the try-on image. Based on the semantically associated landmarks, SAL-VTON effectively models the local semantic association between garment and person, making up for the misalignment in the overall deformation of the garment. The outcome is achieved with a three-stage framework: 1) the semantically associated landmarks are estimated using the landmark localization model; 2) taking the landmarks as input, the warping model explicitly associates the corresponding parts of the garment and person for obtaining the local flow, thus refining the alignment in the global flow; 3) finally, a generator consumes the landmarks to better capture local semantics and control the try-on results.Moreover, we propose a new landmark dataset with a unified labelling rule of landmarks for diverse styles of garments. Extensive experimental results on popular datasets demonstrate that SAL-VTON can handle misalignment and outperform state-of-the-art methods both qualitatively and quantitatively. The dataset is available on https://modelscope.cn/datasets/damo/SAL-HG/summary.

ConZIC: Controllable Zero-Shot Image Captioning by Sampling-Based Polishing
Zeng, ZequnandZhang, HaoandLu, RuiyingandWang, DongshengandChen, BoandWang, Zhengjue



研究问题:如何提高零样本图像描述(IC)的多样性和推理速度,并解决其控制性问题。
动机:现有的零样本图像描述方法虽然有效,但其自回归生成和梯度导向搜索机制限制了描述的多样性和推理速度,同时也未考虑控制性问题。
方法:提出一种名为ConZIC的控制性零样本图像描述框架,核心是一种新颖的基于采样的非自回归语言模型GibbsBERT,可以生成并连续优化每个词。
效果:实验结果表明,ConZIC在零样本图像描述和控制性零样本图像描述方面均表现出优越性能,特别是与ZeroCap相比,生成速度提高了约5倍,多样性得分提高了约1.5倍,并能根据不同的控制信号进行准确生成。

Zero-shot capability has been considered as a new revolution of deep learning, letting machines work on tasks without curated training data. As a good start and the only existing outcome of zero-shot image captioning (IC), ZeroCap abandons supervised training and sequentially searching every word in the caption using the knowledge of large-scale pre-trained models. Though effective, its autoregressive generation and gradient-directed searching mechanism limit the diversity of captions and inference speed, respectively. Moreover, ZeroCap does not consider the controllability issue of zero-shot IC. To move forward, we propose a framework for Controllable Zero-shot IC, named ConZIC. The core of ConZIC is a novel sampling-based non-autoregressive language model named GibbsBERT, which can generate and continuously polish every word. Extensive quantitative and qualitative results demonstrate the superior performance of our proposed ConZIC for both zero-shot IC and controllable zero-shot IC. Especially, ConZIC achieves about 5x faster generation speed than ZeroCap, and about 1.5x higher diversity scores, with accurate generation given different control signals.

EDGE: Editable Dance Generation From Music
Tseng, JonathanandCastellon, RodrigoandLiu, Karen



研究问题:如何有效地生成与音乐相匹配的、物理上可信的舞蹈动作?
动机:现有的舞蹈生成方法往往难以生成具有物理可信度的舞蹈,同时编辑功能也相对有限。
方法:本文提出了一种基于转换器的扩散模型EDGE,结合强大的音乐特征提取器Jukebox,实现了对舞蹈动作的精细编辑和生成。
效果:通过大量实验和用户研究,证明EDGE在生成物理上可信、与音乐匹配的舞蹈方面优于现有方法。

Dance is an important human art form, but creating new dances can be difficult and time-consuming. In this work, we introduce Editable Dance GEneration (EDGE), a state-of-the-art method for editable dance generation that is capable of creating realistic, physically-plausible dances while remaining faithful to the input music. EDGE uses a transformer-based diffusion model paired with Jukebox, a strong music feature extractor, and confers powerful editing capabilities well-suited to dance, including joint-wise conditioning, and in-betweening. We introduce a new metric for physical plausibility, and evaluate dance quality generated by our method extensively through (1) multiple quantitative metrics on physical plausibility, alignment, and diversity benchmarks, and more importantly, (2) a large-scale user study, demonstrating a significant improvement over previous state-of-the-art methods. Qualitative samples from our model can be found at our website.

HumanGen: Generating Human Radiance Fields With Explicit Priors
Jiang, SuyiandJiang, HaoranandWang, ZiyuandLuo, HaiminandChen, WenzhengandXu, Lan



研究问题:如何生成具有详细几何形状和360度真实自由视角渲染的高质量人类辐射场。
动机:现有的方法由于采用的人类相关先验知识有限,导致高质量的人类辐射场生成仍然具有挑战性。
方法:提出了一种新的3D人类生成方案HumanGen,通过设计“锚定图像”将3D人类生成与2D生成器和3D重建器的多种先验知识明确结合起来。引入了使用锚定图像的混合特征表示,以连接HumanGen的潜在空间和现有的2D生成器。然后采用了一种分阶段的设计方案,对几何形状和外观的生成进行解耦。借助锚定图像,适应了一个用于细粒度细节合成的3D重建器,并提出了两级混合方案来提升外观生成。
效果:广泛的实验表明,在几何细节、纹理质量和自由视角性能方面,HumanGen在最先进的3D人类生成方面非常有效。值得注意的是,HumanGen还可以整合各种现成的2D潜在编辑方法,无缝地将它们提升到3D。

Recent years have witnessed the tremendous progress of 3D GANs for generating view-consistent radiance fields with photo-realism. Yet, high-quality generation of human radiance fields remains challenging, partially due to the limited human-related priors adopted in existing methods. We present HumanGen, a novel 3D human generation scheme with detailed geometry and 360deg realistic free-view rendering. It explicitly marries the 3D human generation with various priors from the 2D generator and 3D reconstructor of humans through the design of "anchor image". We introduce a hybrid feature representation using the anchor image to bridge the latent space of HumanGen with the existing 2D generator. We then adopt a pronged design to disentangle the generation of geometry and appearance. With the aid of the anchor image, we adapt a 3D reconstructor for fine-grained details synthesis and propose a two-stage blending scheme to boost appearance generation. Extensive experiments demonstrate our effectiveness for state-of-the-art 3D human generation regarding geometry details, texture quality, and free-view performance. Notably, HumanGen can also incorporate various off-the-shelf 2D latent editing methods, seamlessly lifting them into 3D.

Towards Practical Plug-and-Play Diffusion Models
Go, HyojunandLee, YunsungandKim, Jin-YoungandLee, SeunghyunandJeong, MyeonghoandLee, HyunSeungandChoi, Seungtaek



研究问题:扩散基生成模型在图像生成方面取得了显著的成功,但其对噪声输入的直接使用效果不佳。
动机:目前的实践中,为了解决这一问题,通常的做法是使用标注数据进行微调,但这存在两个问题:一是单个指导模型难以处理各种噪声输入;二是收集标注数据集阻碍了任务的扩展。
方法:本文提出了一种新的策略,即利用多个专家来指导反向扩散过程,每个专家专门处理特定的噪声范围。同时,为了避免管理多个网络和利用标注数据的困难,我们提出了一种实用的指导框架——实用即插即用(PPAP),它利用参数高效的微调和无数据的知识转移。
效果:通过ImageNet类别条件生成实验,我们发现该方法可以成功地指导扩散,且只需少量可训练参数和无需标注数据。此外,我们还展示了图像分类器、深度估计器和语义分割模型可以通过我们的框架以即插即用的方式指导公开的GLIDE模型。

Diffusion-based generative models have achieved remarkable success in image generation. Their guidance formulation allows an external model to plug-and-play control the generation process for various tasks without fine-tuning the diffusion model. However, the direct use of publicly available off-the-shelf models for guidance fails due to their poor performance on noisy inputs. For that, the existing practice is to fine-tune the guidance models with labeled data corrupted with noises. In this paper, we argue that this practice has limitations in two aspects: (1) performing on inputs with extremely various noises is too hard for a single guidance model; (2) collecting labeled datasets hinders scaling up for various tasks. To tackle the limitations, we propose a novel strategy that leverages multiple experts where each expert is specialized in a particular noise range and guides the reverse process of the diffusion at its corresponding timesteps. However, as it is infeasible to manage multiple networks and utilize labeled data, we present a practical guidance framework termed Practical Plug-And-Play (PPAP), which leverages parameter-efficient fine-tuning and data-free knowledge transfer. We exhaustively conduct ImageNet class conditional generation experiments to show that our method can successfully guide diffusion with small trainable parameters and no labeled data. Finally, we show that image classifiers, depth estimators, and semantic segmentation models can guide publicly available GLIDE through our framework in a plug-and-play manner. Our code is available at https://github.com/riiid/PPAP.

Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
Wu, QiuchengandLiu, YujianandZhao, HandongandKale, AjinkyaandBui, TrungandYu, TongandLin, ZheandZhang, YangandChang, Shiyu



研究问题:探索扩散模型是否具有固有的解耦能力,即能否在不改变语义内容的情况下修改图像风格。
动机:解耦能力是图像生成模型的重要特性,可以对不同图像进行通用的修改参数操作。
方法:通过改变输入文本嵌入的风格描述,同时固定去噪过程中引入的高斯随机噪声,观察生成的图像是否能被修改为目标风格而不影响语义内容。
效果:研究发现稳定的扩散模型具有这种解耦能力,进一步提出了一种简单、轻量级的图像编辑算法,该算法不需要微调扩散模型本身,只需优化两个文本嵌入的混合权重即可实现风格匹配和内容保留。实验证明该方法能修改多种属性,且性能优于需要微调的扩散模型基础的图像编辑算法。

Generative models have been widely studied in computer vision. Recently, diffusion models have drawn substantial attention due to the high quality of their generated images. A key desired property of image generative models is the ability to disentangle different attributes, which should enable modification towards a style without changing the semantic content, and the modification parameters should generalize to different images. Previous studies have found that generative adversarial networks (GANs) are inherently endowed with such disentanglement capability, so they can perform disentangled image editing without re-training or fine-tuning the network. In this work, we explore whether diffusion models are also inherently equipped with such a capability. Our finding is that for stable diffusion models, by partially changing the input text embedding from a neutral description (e.g., "a photo of person") to one with style (e.g., "a photo of person with smile") while fixing all the Gaussian random noises introduced during the denoising process, the generated images can be modified towards the target style without changing the semantic content. Based on this finding, we further propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation. This entire process only involves optimizing over around 50 parameters and does not fine-tune the diffusion model itself. Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms that require fine-tuning. The optimized weights generalize well to different images. Our code is publicly available at https://github.com/UCSB-NLP-Chang/DiffusionDisentanglement.

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
Zhang, WenxuanandCun, XiaodongandWang, XuanandZhang, YongandShen, XiandGuo, YuandShan, YingandWang, Fei



研究问题:通过面部图像和语音音频生成谈话头部视频仍存在许多挑战,如不自然的头部运动、扭曲的表情和身份修改。
动机:这些问题主要是由于从耦合的二维运动场中学习引起的。另一方面,显式使用三维信息也会导致表情僵硬和视频不一致的问题。
方法:我们提出了SadTalker,它从音频中生成3DMM的三维运动系数(头部姿势、表情),并隐式地调整一种新的3D感知人脸渲染器进行谈话头部生成。为了学习真实的运动系数,我们显式地分别建模了音频与不同类型的运动系数之间的连接。具体来说,我们提出了ExpNet,通过蒸馏系数和3D渲染的人脸来从音频中学习准确的面部表情。对于头部姿势,我们设计了PoseVAE,通过条件变分自动编码器合成不同风格的头部运动。最后,生成的三维运动系数映射到所提出的面部渲染器的无监督三维关键点空间,以合成最终的视频。
效果:我们进行了广泛的实验,结果显示我们的方法在运动和视频质量方面具有优越性。

Generating talking head videos through a face image and a piece of speech audio still contains many challenges. i.e., unnatural head movement, distorted expression, and identity modification. We argue that these issues are mainly caused by learning from the coupled 2D motion fields. On the other hand, explicitly using 3D information also suffers problems of stiff expression and incoherent video. We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation. To learn the realistic motion coefficients, we explicitly model the connections between audio and different types of motion coefficients individually. Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces. As for the head pose, we design PoseVAE via a conditional VAE to synthesize head motion in different styles. Finally, the generated 3D motion coefficients are mapped to the unsupervised 3D keypoints space of the proposed face render to synthesize the final video. We conducted extensive experiments to show the superior of our method in terms of motion and video quality.

Scaling Up GANs for Text-to-Image Synthesis
Kang, MingukandZhu, Jun-YanandZhang, RichardandPark, JaesikandShechtman, EliandParis, SylvainandPark, Taesung



研究问题:我们能否扩大GAN的规模,使其从像LAION这样的大型数据集中受益?
动机:最近文本到图像合成的成功引起了公众的广泛关注,这标志着设计生成图像模型的首选架构发生了根本性的变化。
方法:我们引入了GigaGAN,一种新的GAN架构,它远远超过了这个限制,证明了GAN是文本到图像合成的一个可行选项。
效果:GigaGAN在推理时间上快了几个数量级,只需0.13秒就能合成一张512px的图像。其次,它可以合成高分辨率的图像,例如,在3.66秒内合成16百万像素的图像。最后,GigaGAN支持各种潜在空间编辑应用,如潜在插值、风格混合和向量算术运算。

The recent success of text-to-image synthesis has taken the world by storm and captured the general public's imagination. From a technical standpoint, it also marked a drastic change in the favored architecture to design generative image models. GANs used to be the de facto choice, with techniques like StyleGAN. With DALL-E 2, auto-regressive and diffusion models became the new standard for large-scale generative models overnight. This rapid shift raises a fundamental question: can we scale up GANs to benefit from large datasets like LAION? We find that naively increasing the capacity of the StyleGAN architecture quickly becomes unstable. We introduce GigaGAN, a new GAN architecture that far exceeds this limit, demonstrating GANs as a viable option for text-to-image synthesis. GigaGAN offers three major advantages. First, it is orders of magnitude faster at inference time, taking only 0.13 seconds to synthesize a 512px image. Second, it can synthesize high-resolution images, for example, 16-megapixel images in 3.66 seconds. Finally, GigaGAN supports various latent space editing applications such as latent interpolation, style mixing, and vector arithmetic operations.

A Soma Segmentation Benchmark in Full Adult Fly Brain
Liu, XiaoyuandHu, BoandLi, MingxingandHuang, WeiandZhang, YueyiandXiong, Zhiwei



研究问题:如何从高分辨率电子显微镜(EM)数据中重建果蝇全脑的神经元,特别是对于神经细胞体(somas)的准确分布和形态。
动机:由于缺乏专门注释了somas的EM数据集,现有的深度学习方法无法直接提供准确的soma分布和形态。同时,由于EM数据的庞大规模,全脑神经元重建过程非常耗时。
方法:首先制作了一个具有精细3D手动注释somas的高分辨率EM数据集。然后,基于此数据集,提出了一种高效、两阶段的深度学习算法,用于预测3D soma实例的精确位置和边界。最后,为全脑执行上述算法部署了一个并行化的高吞吐量数据处理管道。
效果:通过在测试集上进行定量和定性基准比较,验证了该方法的优越性。同时,从生物学角度对重建的果蝇全脑中的somas进行了初步统计。

Neuron reconstruction in a full adult fly brain from high-resolution electron microscopy (EM) data is regarded as a cornerstone for neuroscientists to explore how neurons inspire intelligence. As the central part of neurons, somas in the full brain indicate the origin of neurogenesis and neural functions. However, due to the absence of EM datasets specifically annotated for somas, existing deep learning-based neuron reconstruction methods cannot directly provide accurate soma distribution and morphology. Moreover, full brain neuron reconstruction remains extremely time-consuming due to the unprecedentedly large size of EM data. In this paper, we develop an efficient soma reconstruction method for obtaining accurate soma distribution and morphology information in a full adult fly brain. To this end, we first make a high-resolution EM dataset with fine-grained 3D manual annotations on somas. Relying on this dataset, we propose an efficient, two-stage deep learning algorithm for predicting accurate locations and boundaries of 3D soma instances. Further, we deploy a parallelized, high-throughput data processing pipeline for executing the above algorithm on the full brain. Finally, we provide quantitative and qualitative benchmark comparisons on the testset to validate the superiority of the proposed method, as well as preliminary statistics of the reconstructed somas in the full adult fly brain from the biological perspective. We release our code and dataset at https://github.com/liuxy1103/EMADS.

DiffPose: Toward More Reliable 3D Pose Estimation
Gong, JiaandFoo, LinGengandFan, ZhipengandKe, QiuhongandRahmani, HosseinandLiu, Jun



研究问题:单目三维人体姿态估计由于其固有的模糊性和遮挡性,常常导致高度的不确定性和不明确性。
动机:受扩散模型在从噪声中生成高质量图像方面的有效性启发,我们探索了一种新颖的姿态估计框架(DiffPose),将3D姿态估计表述为反向扩散过程。
方法:我们在DiffPose中融入了新的设计以促进3D姿态估计的扩散过程:特定于姿态的姿态不确定性分布初始化、基于高斯混合模型的前向扩散过程以及上下文条件反向扩散过程。
效果:我们提出的DiffPose在广泛使用的姿态估计基准Human3.6M和MPI-INF-3DHP上显著优于现有方法。

Monocular 3D human pose estimation is quite challenging due to the inherent ambiguity and occlusion, which often lead to high uncertainty and indeterminacy. On the other hand, diffusion models have recently emerged as an effective tool for generating high-quality images from noise. Inspired by their capability, we explore a novel pose estimation framework (DiffPose) that formulates 3D pose estimation as a reverse diffusion process. We incorporate novel designs into our DiffPose to facilitate the diffusion process for 3D pose estimation: a pose-specific initialization of pose uncertainty distributions, a Gaussian Mixture Model-based forward diffusion process, and a context-conditioned reverse diffusion process. Our proposed DiffPose significantly outperforms existing methods on the widely used pose estimation benchmarks Human3.6M and MPI-INF-3DHP. Project page: https://gongjia0208.github.io/Diffpose/.

Lift3D: Synthesize 3D Training Data by Lifting 2D GAN to 3D Generative Radiance Field
Li, LehengandLian, QingandWang, LuozhouandMa, NingningandChen, Ying-Cong



研究问题:如何利用3D生成模型合成训练数据以提升3D视觉任务的性能?
动机:现有的基于NeRF的3D GANs由于其设计生成流程和缺乏明确的3D监督,难以生成与真实世界场景相匹配的高分辨率照片般真实的数据。
方法:提出Lift3D,一种逆向2D到3D的生成框架,通过提升良好解缠的2D GAN到3D对象NeRF,为生成的对象提供明确的3D信息,从而为后续任务提供准确的3D标注。
效果:通过在自动驾驶数据集上进行增强实验,结果显示该数据生成框架能有效提升3D物体检测器的性能。

This work explores the use of 3D generative models to synthesize training data for 3D vision tasks. The key requirements of the generative models are that the generated data should be photorealistic to match the real-world scenarios, and the corresponding 3D attributes should be aligned with given sampling labels. However, we find that the recent NeRF-based 3D GANs hardly meet the above requirements due to their designed generation pipeline and the lack of explicit 3D supervision. In this work, we propose Lift3D, an inverted 2D-to-3D generation framework to achieve the data generation objectives. Lift3D has several merits compared to prior methods: (1) Unlike previous 3D GANs that the output resolution is fixed after training, Lift3D can generalize to any camera intrinsic with higher resolution and photorealistic output. (2) By lifting well-disentangled 2D GAN to 3D object NeRF, Lift3D provides explicit 3D information of generated objects, thus offering accurate 3D annotations for downstream tasks. We evaluate the effectiveness of our framework by augmenting autonomous driving datasets. Experimental results demonstrate that our data generation framework can effectively improve the performance of 3D object detectors. Code: len-li.github.io/lift3d-web

LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation
Zheng, GuangcongandZhou, XianpanandLi, XueweiandQi, ZhongangandShan, YingandLi, Xi



研究问题:如何对复杂的多对象场景进行布局到图像的生成,实现全局布局图和每个详细对象的强控制。
动机:现有的扩散模型在图像合成上取得了成功,但在处理具有复杂多对象场景的布局到图像生成任务时,如何实现更强的控制仍是一个挑战。
方法:提出了一种名为LayoutDiffusion的扩散模型,通过构建带有区域信息的结构化图像补丁,并将其转换为特殊的布局以与正常布局进行统一融合,克服了图像和布局的多模态融合难题。同时,设计了布局融合模块(LFM)和对象感知交叉注意力(OaCA),以精确控制空间相关信息。
效果:实验表明,我们的LayoutDiffusion在FID、CAS等指标上比先前的方法提高了46.35%、26.70%,在COCO-stuff和VG数据集上分别提高了44.29%、41.82%。

Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the global layout map and each detailed object remains a challenging task. In this paper, we propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention (OaCA) are proposed to model the relationship among multiple objects and designed to be object-aware and position-sensitive, allowing for precisely controlling the spatial related information. Extensive experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is available at https://github.com/ZGCTroy/LayoutDiffusion.

BBDM: Image-to-Image Translation With Brownian Bridge Diffusion Models
Li, BoandXue, KaitaoandLiu, BinandLai, Yu-Kun



研究问题:本文旨在解决计算机视觉和图像处理中的重要且具有挑战性的问题——图像到图像的转换。
动机:现有的扩散模型在高质量的图像合成和图像到图像的转换任务上表现出了强大的潜力,但在不同领域的应用中存在明显的鸿沟。
方法:提出了一种基于布朗桥扩散模型(BBDM)的新型图像到图像的转换方法,将图像到图像的转换建模为随机布朗桥过程,并通过双向扩散过程直接学习两个领域之间的转换,而不是通过条件生成过程。
效果:实验结果表明,所提出的BBDM模型在各种基准测试上都取得了有竞争力的性能,无论是通过视觉检查还是可测量的指标。

Image-to-image translation is an important and challenging problem in computer vision and image processing. Diffusion models(DM) have shown great potentials for high-quality image synthesis, and have gained competitive performance on the task of image-to-image translation. However, most of the existing diffusion models treat image-to-image translation as conditional generation processes, and suffer heavily from the gap between distinct domains. In this paper, a novel image-to-image translation method based on the Brownian Bridge Diffusion Model(BBDM) is proposed, which models image-to-image translation as a stochastic Brownian Bridge process, and learns the translation between two domains directly through the bidirectional diffusion process rather than a conditional generation process. To the best of our knowledge, it is the first work that proposes Brownian Bridge diffusion process for image-to-image translation. Experimental results on various benchmarks demonstrate that the proposed BBDM model achieves competitive performance through both visual inspection and measurable metrics.

Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting
Wang, SuandSaharia, ChitwanandMontgomery, CesleeandPont-Tuset, JordiandNoy, ShaiandPellegrini, StefanoandOnoe, YasumasaandLaszlo, SarahandFleet, DavidJ.andSoricut, RaduandBaldridge, JasonandNorouzi, MohammadandAnderson, PeterandChan, William



研究问题:如何生成忠实于文本提示且与输入图像一致的编辑?
动机:为了支持创新应用,需要一种能够对文本引导的图像进行编辑的技术。
方法:通过在训练过程中引入对象检测器来提出修复掩模,构建了一个级联扩散模型Imagen Editor。
效果:实验结果表明,Imagen Editor的编辑结果忠实于文本提示,并且在处理物体、属性和场景等方面表现出色,优于DALL-E 2和Stable Diffusion等模型。

Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to the input text prompt, while consistent with the input image. We present Imagen Editor, a cascaded diffusion model, built by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by incorporating object detectors for proposing inpainting masks during training. In addition, text-guided image inpainting captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.

OSRT: Omnidirectional Image Super-Resolution With Distortion-Aware Transformer
Yu, FanghuaandWang, XintaoandCao, MingdengandLi, GenandShan, YingandDong, Chao



研究问题:如何提高全景图像的分辨率,以获取更丰富的沉浸式体验。
动机:现有的方法通过等距投影图像的超分辨率来解决全景图像分辨率不足的问题,但忽视了等距投影在退化过程中的几何特性,且其模型难以泛化到真实的等距投影图像上。
方法:我们提出了一种模仿真实世界成像过程的鱼眼降采样方法,合成了更真实的低分辨率样本。然后设计了一种失真感知的Transformer(OSRT)来连续和自适应地调整等距投影失真。
效果:无需繁琐的过程,OSRT就能比之前的方法在PSNR上提高约0.2dB。此外,我们还提出了一种方便的数据增强策略,即从普通图像合成伪等距投影图像。这种简单的策略可以缓解大网络的过拟合问题,并显著提高全景图像超分辨率的性能。大量的实验表明,我们的OSRT具有最先进的性能。

Omnidirectional images (ODIs) have obtained lots of research interest for immersive experiences. Although ODIs require extremely high resolution to capture details of the entire scene, the resolutions of most ODIs are insufficient. Previous methods attempt to solve this issue by image super-resolution (SR) on equirectangular projection (ERP) images. However, they omit geometric properties of ERP in the degradation process, and their models can hardly generalize to real ERP images. In this paper, we propose Fisheye downsampling, which mimics the real-world imaging process and synthesizes more realistic low-resolution samples. Then we design a distortion-aware Transformer (OSRT) to modulate ERP distortions continuously and self-adaptively. Without a cumbersome process, OSRT outperforms previous methods by about 0.2dB on PSNR. Moreover, we propose a convenient data augmentation strategy, which synthesizes pseudo ERP images from plain images. This simple strategy can alleviate the over-fitting problem of large networks and significantly boost the performance of ODI SR. Extensive experiments have demonstrated the state-of-the-art performance of our OSRT.

KD-DLGAN: Data Limited Image Generation via Knowledge Distillation
Cui, KaiwenandYu, YingchenandZhan, FangnengandLiao, ShengcaiandLu, ShijianandXing, EricP.



研究问题:如何利用有限的训练数据,提高生成对抗网络(GAN)在图像生成任务上的效果。
动机:当训练数据有限时,GAN的判别器往往会出现严重的过拟合现象,导致生成效果下降,尤其是生成多样性降低。
方法:提出一种基于知识蒸馏的生成框架KD-GAN,引入预训练的视觉语言模型来训练有效的数据受限的图像生成模型。KD-GAN包含两个创新设计:一是聚合生成的知识蒸馏,通过给判别器提供更难的学习任务和从预训练模型中提炼更具有泛化性的知识,以减轻判别器的过拟合;二是相关的生成知识蒸馏,通过提炼和保留预训练模型中的多样化图像文本关联性,以提高生成的多样性。
效果:大量实验表明,KD-GAN在多个基准测试中实现了优秀的图像生成效果,且与最先进的技术相比,其性能有稳定且显著的提升。

Generative Adversarial Networks (GANs) rely heavily on large-scale training data for training high-quality image generation models. With limited training data, the GAN discriminator often suffers from severe overfitting which directly leads to degraded generation especially in generation diversity. Inspired by the recent advances in knowledge distillation (KD), we propose KD-GAN, a knowledge-distillation based generation framework that introduces pre-trained vision-language models for training effective data-limited image generation models. KD-GAN consists of two innovative designs. The first is aggregated generative KD that mitigates the discriminator overfitting by challenging the discriminator with harder learning tasks and distilling more generalizable knowledge from the pre-trained models. The second is correlated generative KD that improves the generation diversity by distilling and preserving the diverse image-text correlation within the pre-trained models. Extensive experiments over multiple benchmarks show that KD-GAN achieves superior image generation with limited training data. In addition, KD-GAN complements the state-of-the-art with consistent and substantial performance gains. Note that codes will be released.

HouseDiffusion: Vector Floorplan Generation via a Diffusion Model With Discrete and Continuous Denoising
Shabani, MohammadAminandHosseini, SepidehsadatandFurukawa, Yasutaka



研究问题:本文提出了一种新颖的矢量平面图生成方法,通过去噪模型对房间/门角的二维坐标进行去噪。
动机:现有的矢量平面图生成方法无法精确地反转连续前向过程,也无法建立几何关联关系,如平行、垂直和共享角等。
方法:本文提出的去噪模型采用Transformer架构作为核心,根据输入的图形约束控制注意力掩码,并通过离散和连续的去噪过程直接生成矢量图形平面图。
效果:在RPLAN数据集上进行的评估表明,该方法在所有指标上都取得了显著改进,与最先进的方法相比具有显著的优势,同时能够生成非曼哈顿结构和控制每个房间的角的确切数量。

The paper presents a novel approach for vector-floorplan generation via a diffusion model, which denoises 2D coordinates of room/door corners with two inference objectives: 1) a single-step noise as the continuous quantity to precisely invert the continuous forward process; and 2) the final 2D coordinate as the discrete quantity to establish geometric incident relationships such as parallelism, orthogonality, and corner-sharing. Our task is graph-conditioned floorplan generation, a common workflow in floorplan design. We represent a floorplan as 1D polygonal loops, each of which corresponds to a room or a door. Our diffusion model employs a Transformer architecture at the core, which controls the attention masks based on the input graph-constraint and directly generates vector-graphics floorplans via a discrete and continuous denoising process. We have evaluated our approach on RPLAN dataset. The proposed approach makes significant improvements in all the metrics against the state-of-the-art with significant margins, while being capable of generating non-Manhattan structures and controlling the exact number of corners per room. We will share all our code and models.

Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation
Zhao, RuiandLi, WeiandHu, ZhipengandLi, LinchengandZou, ZhengxiaandShi, ZhenweiandFan, Changjie



研究问题:如何实现零射击的文字驱动游戏角色自动创建。
动机:现有的游戏角色自动创建系统大多基于图像,需要优化面部参数使渲染的角色与参考照片相似,缺乏个性化和自定义功能。
方法:提出一种新的文本到参数转换方法(T2P),利用大规模预训练的多模态CLIP和神经渲染技术,在统一框架中搜索连续面部参数和离散面部参数。
效果:实验结果表明,T2P可以生成高质量的生动游戏角色,并在客观评估和主观评估上都优于其他最先进的文本到3D生成方法。

Recent popular Role-Playing Games (RPGs) saw the great success of character auto-creation systems. The bone-driven face model controlled by continuous parameters (like the position of bones) and discrete parameters (like the hairstyles) makes it possible for users to personalize and customize in-game characters. Previous in-game character auto-creation systems are mostly image-driven, where facial parameters are optimized so that the rendered character looks similar to the reference face photo. This paper proposes a novel text-to-parameter translation method (T2P) to achieve zero-shot text-driven game character auto-creation. With our method, users can create a vivid in-game character with arbitrary text description without using any reference photo or editing hundreds of parameters manually. In our method, taking the power of large-scale pre-trained multi-modal CLIP and neural rendering, T2P searches both continuous facial parameters and discrete facial parameters in a unified framework. Due to the discontinuous parameter representation, previous methods have difficulty in effectively learning discrete facial parameters. T2P, to our best knowledge, is the first method that can handle the optimization of both discrete and continuous parameters. Experimental results show that T2P can generate high-quality and vivid game characters with given text prompts. T2P outperforms other SOTA text-to-3D generation methods on both objective evaluations and subjective evaluations.

Generating Holistic 3D Human Motion From Speech
Yi, HongweiandLiang, HualinandLiu, YifeiandCao, QiongandWen, YandongandBolkart, TimoandTao, DachengandBlack, MichaelJ.



研究问题:如何从人类语音中生成3D全身运动。
动机:目前的模型在生成语音对应的全身运动时,无法同时生成真实且多样化的面部表情、身体姿态和手势。
方法:构建了一个高质量的同步语音3D全身网格数据集,并定义了一种新的语音到运动的生成框架,将面部、身体和手部分别建模。面部采用自动编码器进行建模,而身体姿态和手势则采用组合式向量量化变分自编码器(VQ-VAE)。此外,还提出了一种跨条件自回归模型来生成身体姿态和手势,以产生连贯且真实的运动。
效果:实验和用户研究表明,该方法在定性和定量上都达到了最先进的性能。

This work addresses the problem of generating 3D holistic body motions from human speech. Given a speech recording, we synthesize sequences of 3D body poses, hand gestures, and facial expressions that are realistic and diverse. To achieve this, we first build a high-quality dataset of 3D holistic body meshes with synchronous speech. We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately. The separated modeling stems from the fact that face articulation strongly correlates with human speech, while body poses and hand gestures are less correlated. Specifically, we employ an autoencoder for face motions, and a compositional vector-quantized variational autoencoder (VQ-VAE) for the body and hand motions. The compositional VQ-VAE is key to generating diverse results. Additionally, we propose a cross conditional autoregressive model that generates body poses and hand gestures, leading to coherent and realistic motions. Extensive experiments and user studies demonstrate that our proposed approach achieves state-of-the-art performance both qualitatively and quantitatively. Our novel dataset and code will be released for research purposes.

Unifying Layout Generation With a Decoupled Diffusion Model
Hui, MudeandZhang, ZhizhengandZhang, XiaoyiandXie, WenxuanandWang, YuwangandLu, Yan



研究问题:本文旨在解决布局生成任务中的各种子任务的统一问题,包括有条件和无条件的生成。
动机:布局生成是减轻繁重的图形设计工作负担的重要任务,但不同的应用场景对布局生成提出了挑战,需要统一各种子任务。
方法:本文提出了一种布局扩散生成模型(LDGM),通过单一的解耦扩散模型实现子任务的统一。LDGM将任意缺失或粗糙的元素属性布局视为从完整布局开始的中间扩散状态。由于不同的属性具有各自的语义和特性,我们提出解耦它们的扩散过程,以提高训练样本的多样性,并联合学习反向过程以利用全局范围的上下文来促进生成。
效果:实验结果表明,我们的LDGM在功能和性能上都优于现有的布局生成模型。

Layout generation aims to synthesize realistic graphic scenes consisting of elements with different attributes including category, size, position, and between-element relation. It is a crucial task for reducing the burden on heavy-duty graphic design works for formatted scenes, e.g., publications, documents, and user interfaces (UIs). Diverse application scenarios impose a big challenge in unifying various layout generation subtasks, including conditional and unconditional generation. In this paper, we propose a Layout Diffusion Generative Model (LDGM) to achieve such unification with a single decoupled diffusion model. LDGM views a layout of arbitrary missing or coarse element attributes as an intermediate diffusion status from a completed layout. Since different attributes have their individual semantics and characteristics, we propose to decouple the diffusion processes for them to improve the diversity of training samples and learn the reverse process jointly to exploit global-scope contexts for facilitating generation. As a result, our LDGM can generate layouts either from scratch or conditional on arbitrary available attributes. Extensive qualitative and quantitative experiments demonstrate our proposed LDGM outperforms existing layout generation models in both functionality and performance.

Human Guided Ground-Truth Generation for Realistic Image Super-Resolution
Chen, DuandLiang, JieandZhang, XindongandLiu, MingandZeng, HuiandZhang, Lei



研究问题:如何生成真实的图像超分辨率(Real-ISR)模型的地面实况(GT)图像是一个关键问题。
动机:现有的方法主要采用一组高分辨率(HR)图像作为GT,并应用各种降级技术来模拟其低分辨率(LR)对应物。虽然取得了很大的进展,但这种LR-HR对生成方案存在几个限制。首先,HR图像的感知质量可能不够高,限制了Real-ISR输出的质量。其次,现有的方案在GT生成时没有充分考虑人的感知,训练的模型往往产生过度平滑的结果或令人不愉快的伪影。
方法:我们提出了一种人引导的GT生成方案。我们首先训练多个图像增强模型以提高HR图像的感知质量,并使一张LR图像具有多个HR对应物。然后让人类参与者标注增强的HR图像中高质量的区域作为GT,并将带有不愉快伪影的区域标记为负样本。然后构建一个包含正负样本的人引导GT图像数据集,并提出一个损失函数来训练Real-ISR模型。
效果:实验表明,在我们数据集上训练的Real-ISR模型可以产生更具感知真实性且伪影更少的结果。数据集和代码可在https://github.com/ChrisDud0257/HGGT找到。

How to generate the ground-truth (GT) image is a critical issue for training realistic image super-resolution (Real-ISR) models. Existing methods mostly take a set of high-resolution (HR) images as GTs and apply various degradations to simulate their low-resolution (LR) counterparts. Though great progress has been achieved, such an LR-HR pair generation scheme has several limitations. First, the perceptual quality of HR images may not be high enough, limiting the quality of Real-ISR outputs. Second, existing schemes do not consider much human perception in GT generation, and the trained models tend to produce over-smoothed results or unpleasant artifacts. With the above considerations, we propose a human guided GT generation scheme. We first elaborately train multiple image enhancement models to improve the perceptual quality of HR images, and enable one LR image having multiple HR counterparts. Human subjects are then involved to annotate the high quality regions among the enhanced HR images as GTs, and label the regions with unpleasant artifacts as negative samples. A human guided GT image dataset with both positive and negative samples is then constructed, and a loss function is proposed to train the Real-ISR models. Experiments show that the Real-ISR models trained on our dataset can produce perceptually more realistic results with less artifacts. Dataset and codes can be found at https://github.com/ChrisDud0257/HGGT.

SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene
Son, MinjungandPark, JeongJoonandGuibas, LeonidasandWetzstein, Gordon



研究问题:本文旨在利用少量输入图像训练一种3D感知的生成模型SinGRAF,以生成不同场景布局的3D场景。
动机:现有的3D生成模型需要大量训练数据,而SinGRAF通过使用单一场景的几张输入图像进行训练,可以更好地利用有限的数据资源。
方法:基于最新的3D GAN架构,引入一种新的渐进尺度补丁判别方法进行训练。
效果:实验结果表明,SinGRAF在质量和多样性方面都优于最相关的工作。

Generative models have shown great promise in synthesizing photorealistic 3D objects, but they require large amounts of training data. We introduce SinGRAF, a 3D-aware generative model that is trained with a few input images of a single scene. Once trained, SinGRAF generates different realizations of this 3D scene that preserve the appearance of the input while varying scene layout. For this purpose, we build on recent progress in 3D GAN architectures and introduce a novel progressive-scale patch discrimination approach during training. With several experiments, we demonstrate that the results produced by SinGRAF outperform the closest related works in both quality and diversity by a large margin.

Dimensionality-Varying Diffusion Process
Zhang, HanandFeng, RuiliandYang, ZhantaoandHuang, LianghuaandLiu, YuandZhang, YifeiandShen, YujunandZhao, DeliandZhou, JingrenandCheng, Fan



研究问题:扩散模型在生成新数据时需要每一步的信号维度相同,但作者认为对于图像信号的空间冗余性,无需在演变过程中保持高维度。
动机:考虑到图像信号的空间冗余性,特别是在早期生成阶段,不需要维持高维度。
方法:通过信号分解对前向扩散过程进行理论推广。具体来说,将图像分解为多个正交分量,并在干扰图像时控制每个分量的衰减。随着噪声强度的增加,可以减小那些无关紧要的分量,从而使用较低维的信号来表示源,几乎不丢失信息。
效果:这种方法大大减少了计算成本,并在一系列数据集上实现了与基线方法相当甚至更好的合成性能。同时,该方法也有助于高分辨率图像合成,并提高了在1024x1024分辨率下训练的FFHQ扩散模型的FID值(从52.40降至10.46)。

Diffusion models, which learn to reverse a signal destruction process to generate new data, typically require the signal at each step to have the same dimension. We argue that, considering the spatial redundancy in image signals, there is no need to maintain a high dimensionality in the evolution process, especially in the early generation phase. To this end, we make a theoretical generalization of the forward diffusion process via signal decomposition. Concretely, we manage to decompose an image into multiple orthogonal components and control the attenuation of each component when perturbing the image. That way, along with the noise strength increasing, we are able to diminish those inconsequential components and thus use a lower-dimensional signal to represent the source, barely losing information. Such a reformulation allows to vary dimensions in both training and inference of diffusion models. Extensive experiments on a range of datasets suggest that our approach substantially reduces the computational cost and achieves on-par or even better synthesis performance compared to baseline methods. We also show that our strategy facilitates high-resolution image synthesis and improves FID of diffusion model trained on FFHQ at 1024x1024 resolution from 52.40 to 10.46. Code is available at https://github.com/damo-vilab/dvdp.

RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation
Anciukevi\v{c



研究问题:目前的图像扩散模型在条件和无条件的图像生成上取得了最先进的性能,但不支持3D理解所需的任务,如视图一致的3D生成或单视图对象重建。
动机:本文提出了第一个用于3D生成和推理的扩散模型RenderDiffusion,该模型仅使用单目2D监督进行训练。
方法:我们的方法的核心是一个新颖的图像去噪架构,它在每个去噪步骤中生成并渲染场景的中间三维表示。这在扩散过程中强制了一个强烈的归纳结构,提供了一种3D一致的表示,同时只需要2D监督。
效果:我们在FFHQ、AFHQ、ShapeNet和CLEVR数据集上评估了RenderDiffusion,结果显示其在3D场景生成和从2D图像推理3D场景方面具有竞争力。此外,我们的基于扩散的方法允许我们使用2D修复来编辑3D场景。

Diffusion models currently achieve state-of-the-art performance for both conditional and unconditional image generation. However, so far, image diffusion models do not support tasks required for 3D understanding, such as view-consistent 3D generation or single-view object reconstruction. In this paper, we present RenderDiffusion, the first diffusion model for 3D generation and inference, trained using only monocular 2D supervision. Central to our method is a novel image denoising architecture that generates and renders an intermediate three-dimensional representation of a scene in each denoising step. This enforces a strong inductive structure within the diffusion process, providing a 3D consistent representation while only requiring 2D supervision. The resulting 3D representation can be rendered from any view. We evaluate RenderDiffusion on FFHQ, AFHQ, ShapeNet and CLEVR datasets, showing competitive performance for generation of 3D scenes and inference of 3D scenes from 2D images. Additionally, our diffusion-based approach allows us to use 2D inpainting to edit 3D scenes.

Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures
Metzer, GalandRichardson, EladandPatashnik, OrandGiryes, RajaandCohen-Or, Daniel



研究问题:如何利用文本引导3D模型生成?
动机:近年来,文本引导图像生成取得了快速进展,激发了在文本引导形状生成方面的主要突破。
方法:将得分蒸馏应用于计算效率高的预训练自动编码器的潜伏扩散模型,并将NeRF模型引入潜伏空间,形成潜伏-NeRF。同时,使用草图形状作为约束直接指导潜伏-NeRF的3D生成过程。
效果:实验证明,通过文本和形状的双重引导,可以增加对生成过程的控制力。此外,得分蒸馏也可以直接应用于3D网格,从而在给定几何体上生成高质量的纹理。

Text-guided image generation has progressed rapidly in recent years, inspiring major breakthroughs in text-guided shape generation. Recently, it has been shown that using score distillation, one can successfully text-guide a NeRF model to generate a 3D object. We adapt the score distillation to the publicly available, and computationally efficient, Latent Diffusion Models, which apply the entire diffusion process in a compact latent space of a pretrained autoencoder. As NeRFs operate in image space, a naive solution for guiding them with latent score distillation would require encoding to the latent space at each guidance step. Instead, we propose to bring the NeRF to the latent space, resulting in a Latent-NeRF. Analyzing our Latent-NeRF, we show that while Text-to-3D models can generate impressive results, they are inherently unconstrained and may lack the ability to guide or enforce a specific 3D structure. To assist and direct the 3D generation, we propose to guide our Latent-NeRF using a Sketch-Shape: an abstract geometry that defines the coarse structure of the desired object. Then, we present means to integrate such a constraint directly into a Latent-NeRF. This unique combination of text and shape guidance allows for increased control over the generation process. We also show that latent score distillation can be successfully applied directly on 3D meshes. This allows for generating high-quality textures on a given geometry. Our experiments validate the power of our different forms of guidance and the efficiency of using latent rendering.

Learning Generative Structure Prior for Blind Text Image Super-Resolution
Li, XiaomingandZuo, WangmengandLoy, ChenChange



研究问题:盲文图像超分辨率(SR)在处理不同字体风格和未知退化时具有挑战性。
动机:现有的方法通过损失约束或中间特征条件来并行执行字符识别以规范SR任务,但在遇到严重退化时,高级先验可能会失败。
方法:我们提出了一种新的先验方法,该方法更关注字符结构。具体来说,我们在StyleGAN中学习封装丰富多样的结构,并利用这种生成结构先验进行恢复。为了限制StyleGAN的生成空间,使其遵守字符的结构,同时保持对不同字体风格的灵活性,我们将每个字符的离散特征存储在一个代码簿中。该代码随后驱动StyleGAN生成高分辨率的结构细节以帮助文本SR。
效果:与基于字符识别的先验相比,所提出的结构先验对指定字符的忠实和精确的笔画施加更强的字符特定指导。在合成和真实数据集上的大量实验表明,所提出的生成结构先验在促进鲁棒的文本SR方面具有引人注目的性能。

Blind text image super-resolution (SR) is challenging as one needs to cope with diverse font styles and unknown degradation. To address the problem, existing methods perform character recognition in parallel to regularize the SR task, either through a loss constraint or intermediate feature condition. Nonetheless, the high-level prior could still fail when encountering severe degradation. The problem is further compounded given characters of complex structures, e.g., Chinese characters that combine multiple pictographic or ideographic symbols into a single character. In this work, we present a novel prior that focuses more on the character structure. In particular, we learn to encapsulate rich and diverse structures in a StyleGAN and exploit such generative structure priors for restoration. To restrict the generative space of StyleGAN so that it obeys the structure of characters yet remains flexible in handling different font styles, we store the discrete features for each character in a codebook . The code subsequently drives the StyleGAN to generate high-resolution structural details to aid text SR. Compared to priors based on character recognition, the proposed structure prior exerts stronger character-specific guidance to restore faithful and precise strokes of a designated character. Extensive experiments on synthetic and real datasets demonstrate the compelling performance of the proposed generative structure prior in facilitating robust text SR. Our code is available at https://github.com/csxmli2016/MARCONet.

LayoutDM: Discrete Diffusion Model for Controllable Layout Generation
Inoue, NaotoandKikuchi, KotaroandSimo-Serra, EdgarandOtani, MayuandYamaguchi, Kota



研究问题:本文旨在解决通过离散状态空间扩散模型在单一模型中解决广泛的布局生成任务的问题。
动机:目前的布局生成任务需要处理各种可选的约束条件,如特定元素的类型或位置。因此,需要一个能够处理结构化布局数据的模型。
方法:本文提出了一种基于离散状态空间扩散模型的布局生成模型LayoutDM,该模型可以自然地处理结构化布局数据,并通过逐步推理从初始输入中推断出无噪声的布局。
效果:实验结果表明,LayoutDM能够成功生成高质量的布局,并在多个布局任务上优于特定任务和通用任务的基线方法。

Controllable layout generation aims at synthesizing plausible arrangement of element bounding boxes with optional constraints, such as type or position of a specific element. In this work, we try to solve a broad range of layout generation tasks in a single model that is based on discrete state-space diffusion models. Our model, named LayoutDM, naturally handles the structured layout data in the discrete representation and learns to progressively infer a noiseless layout from the initial input, where we model the layout corruption process by modality-wise discrete diffusion. For conditional generation, we propose to inject layout constraints in the form of masking or logit adjustment during inference. We show in the experiments that our LayoutDM successfully generates high-quality layouts and outperforms both task-specific and task-agnostic baselines on several layout tasks.

PREIM3D: 3D Consistent Precise Image Attribute Editing From a Single Image
Li, JianhuiandLi, JianminandZhang, HaojiandLiu, ShilongandWang, ZhengyiandXiao, ZihaoandZheng, KaiwenandZhu, Jun



研究问题:本文研究了3D感知的图像属性编辑问题,该问题在实际应用中具有广泛的应用。
动机:尽管现有的方法在输入视图附近取得了有希望的结果,但在大相机姿态下生成的图像仍然存在3D不一致性和图像属性编辑不精确的问题。
方法:我们训练了一个共享编码器来映射所有图像,并提出了两种新的方法来解决大相机姿态下的3D不一致性和主体身份问题。同时,我们将GAN模型的隐空间和逆映射进行比较,发现在逆映射中进行编辑可以获得更好的结果。
效果:实验结果表明,我们的方法生成的图像更具3D一致性,并且比之前的工作实现了更精确的图像编辑。

We study the 3D-aware image attribute editing problem in this paper, which has wide applications in practice. Recent methods solved the problem by training a shared encoder to map images into a 3D generator's latent space or by per-image latent code optimization and then edited images in the latent space. Despite their promising results near the input view, they still suffer from the 3D inconsistency of produced images at large camera poses and imprecise image attribute editing, like affecting unspecified attributes during editing. For more efficient image inversion, we train a shared encoder for all images. To alleviate 3D inconsistency at large camera poses, we propose two novel methods, an alternating training scheme and a multi-view identity loss, to maintain 3D consistency and subject identity. As for imprecise image editing, we attribute the problem to the gap between the latent space of real images and that of generated images. We compare the latent space and inversion manifold of GAN models and demonstrate that editing in the inversion manifold can achieve better results in both quantitative and qualitative evaluations. Extensive experiments show that our method produces more 3D consistent images and achieves more precise image editing than previous work. Source code and pretrained models can be found on our project page: https://mybabyyh.github.io/Preim3D.

MaskSketch: Unpaired Structure-Guided Masked Image Generation
Bashkirova, DinaandLezama, Jos\'eandSohn, KihyukandSaenko, KateandEssa, Irfan



研究问题:目前的图像生成方法主要通过标签或文本提示进行条件化,限制了对生成结果的控制。
动机:提出一种名为MaskSketch的图像生成方法,使用指导草图作为额外的条件信号进行采样,实现空间条件化的生成结果。
方法:利用预训练的掩蔽生成变压器,无需模型训练或配对监督,可处理不同抽象程度的输入草图。通过观察发现,掩蔽生成变压器的中间自我注意力映射编码了输入图像的重要结构信息,如场景布局和物体形状。基于此观察,提出了一种新的基于结构的采样方法以实现结构化引导的生成。
效果:实验结果表明,MaskSketch能够实现高度真实的图像生成,并对指导结构具有良好的保真度。在标准基准数据集上评估,MaskSketch在草图到图像转换以及无配对图像到图像转换方面优于最先进的方法。

Recent conditional image generation methods produce images of remarkable diversity, fidelity and realism. However, the majority of these methods allow conditioning only on labels or text prompts, which limits their level of control over the generation result. In this paper, we introduce MaskSketch, an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling. MaskSketch utilizes a pre-trained masked generative transformer, requiring no model training or paired supervision, and works with input sketches of different levels of abstraction. We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image, such as scene layout and object shape, and we propose a novel sampling method based on this observation to enable structure-guided generation. Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure. Evaluated on standard benchmark datasets, MaskSketch outperforms state-of-the-art methods for sketch-to-image translation, as well as unpaired image-to-image translation approaches. The code can be found on our project website: https://masksketch.github.io/

Sequential Training of GANs Against GAN-Classifiers Reveals Correlated ''Knowledge Gaps'' Present Among Independently Trained GAN Instances
Pathak, ArkanathandDufour, Nicholas



研究问题:本研究旨在探索生成对抗网络(GANs)训练过程中的知识差距,以及通过研究问题:本研究旨在探索生成对抗网络(GANs)训练过程中的知识差距,以及通过迭代训练GAN分类器和“欺骗”分类器的GAN来填补这些差距的效果。
动机:先前的研究表明,在GAN训练中存在知识差距(样本间的分布不一致现象),并且可以通过训练与判别器分开的GAN分类器来证明这一点。本研究希望通过迭代训练GAN分类器和“欺骗”分类器的GAN来填补这些知识差距。
方法:本研究在两个设置下进行实验,一个是在低维图像(MNIST)上训练的小尺寸DCGAN架构,另一个是在高维图像(FFHQ)上训练的最新GAN架构StyleGAN2。通过迭代训练GAN分类器和“欺骗”分类器的GAN,观察其对GAN训练动态、输出质量和GAN分类器泛化的影响。
效果:研究发现,DCGAN无法在不影响输出质量的情况下有效地“欺骗”保留的GAN分类器。然而,StyleGAN2可以在不影响输出质量的情况下“欺骗”保留的分类器,并且这种效果在多个回合的GAN/分类器训练中持续存在,这似乎揭示了生成器参数空间中的优化顺序。最后,研究了不同的分类器架构,并表明GAN分类器的架构对其学习到的artifacts有强烈影响。

Modern Generative Adversarial Networks (GANs) generate realistic images remarkably well. Previous work has demonstrated the feasibility of "GAN-classifiers" that are distinct from the co-trained discriminator, and operate on images generated from a frozen GAN. That such classifiers work at all affirms the existence of "knowledge gaps" (out-of-distribution artifacts across samples) present in GAN training. We iteratively train GAN-classifiers and train GANs that "fool" the classifiers (in an attempt to fill the knowledge gaps), and examine the effect on GAN training dynamics, output quality, and GAN-classifier generalization. We investigate two settings, a small DCGAN architecture trained on low dimensional images (MNIST), and StyleGAN2, a SOTA GAN architecture trained on high dimensional images (FFHQ). We find that the DCGAN is unable to effectively fool a held-out GAN-classifier without compromising the output quality. However, StyleGAN2 can fool held-out classifiers with no change in output quality, and this effect persists over multiple rounds of GAN/classifier training which appears to reveal an ordering over optima in the generator parameter space. Finally, we study different classifier architectures and show that the architecture of the GAN-classifier has a strong influence on the set of its learned artifacts.

Lookahead Diffusion Probabilistic Models for Refining Mean Estimation
Zhang, GuoqiangandNiwa, KentaandKleijn, W.Bastiaan



研究问题:如何提高扩散概率模型中条件高斯分布均值估计的准确性。
动机:现有的扩散概率模型在处理深度神经网络输出的后续时间步长上的相关性时,无法精确地提炼条件高斯分布的均值估计。
方法:提出一种前瞻性扩散概率模型(LA-DPM),通过在扩散概率模型中引入额外的连接,对最近的状态和索引进行前向传播,以获取更准确的数据样本x的估计。
效果:实验表明,将此额外连接插入DDPM、DDIM、DEIS、S-PNDM和高阶DPM-Solvers等模型中,可以显著提高Frechet inception距离(FID)得分,从而改善模型性能。

We propose lookahead diffusion probabilistic models (LA-DPMs) to exploit the correlation in the outputs of the deep neural networks (DNNs) over subsequent timesteps in diffusion probabilistic models (DPMs) to refine the mean estimation of the conditional Gaussian distributions in the backward process. A typical DPM first obtains an estimate of the original data sample x by feeding the most recent state z_i and index i into the DNN model and then computes the mean vector of the conditional Gaussian distribution for z_ i-1 . We propose to calculate a more accurate estimate for x by performing extrapolation on the two estimates of x that are obtained by feeding (z_ i+1 , i+1) and (z_i, i) into the DNN model. The extrapolation can be easily integrated into the backward process of existing DPMs by introducing an additional connection over two consecutive timesteps, and fine-tuning is not required. Extensive experiments showed that plugging in the additional connection into DDPM, DDIM, DEIS, S-PNDM, and high-order DPM-Solvers leads to a significant performance gain in terms of Frechet inception distance (FID) score. Our implementation is available at https://github.com/guoqiangzhang-x/LA-DPM.

PyramidFlow: High-Resolution Defect Contrastive Localization Using Pyramid Normalizing Flow
Lei, JiaruiandHu, XiaoboandWang, YueandLiu, Dong



研究问题:工业生产过程中,由于不可控因素,产品可能出现无法预见的缺陷。尽管无监督方法在缺陷定位上取得了成功,但常用的预训练模型会导致低分辨率的输出,影响视觉性能。
动机:为了解决这个问题,我们提出了PyramidFlow,这是第一个没有预训练模型的全归一化流方法,可以实现高分辨率的缺陷定位。
方法:我们提出了一种基于潜在模板的缺陷对比定位范式,以减少类内方差,这与预训练模型的做法不同。此外,PyramidFlow利用类似金字塔的归一化流进行多尺度融合和体积归一化,以帮助泛化。
效果:我们在MVTecAD上的全面研究表明,所提出的方法优于不使用外部先验的可比算法,甚至在更具挑战性的BTAD场景中实现了最先进的性能。

During industrial processing, unforeseen defects may arise in products due to uncontrollable factors. Although unsupervised methods have been successful in defect localization, the usual use of pre-trained models results in low-resolution outputs, which damages visual performance. To address this issue, we propose PyramidFlow, the first fully normalizing flow method without pre-trained models that enables high-resolution defect localization. Specifically, we propose a latent template-based defect contrastive localization paradigm to reduce intra-class variance, as the pre-trained models do. In addition, PyramidFlow utilizes pyramid-like normalizing flows for multi-scale fusing and volume normalization to help generalization. Our comprehensive studies on MVTecAD demonstrate the proposed method outperforms the comparable algorithms that do not use external priors, even achieving state-of-the-art performance in more challenging BTAD scenarios.

DF-Platter: Multi-Face Heterogeneous Deepfake Dataset
Narayan, KartikandAgarwal, HarshandThakral, KartikandMittal, SurbhiandVatsa, MayankandSingh, Richa



研究问题:深度伪造检测在学术界的重要性日益增长,尤其是在高质量图像和视频方面的研究。然而,目前的生成算法已经能够生成低分辨率视频、遮挡的深度伪造以及多主体深度伪造。
动机:本研究模拟了深度伪造生成和传播的真实世界场景,并提出了DF-Platter数据集,该数据集包含使用多种生成技术生成的低分辨率和高分辨率深度伪造,以及单主体和多主体深度伪造,面部图像为印度裔。
方法:我们使用32个GPU连续工作116天,累计使用了1800GB内存来准备这个数据库。数据集大小超过500GB,总共包含133,260个视频,分为三组。我们还提供了在多个评估设置下使用流行和最先进的深度伪造检测模型的基准结果。
效果:实验结果表明,现有的技术在低分辨率深度伪造和多主体深度伪造上的性能大幅下降。我们断言,这个数据库将通过将深度伪造检测算法的能力扩展到真实世界场景来提高现有技术水平。该数据库可在http://iab-rubric.org/df-platter-database获取。

Deepfake detection is gaining significant importance in the research community. While most of the research efforts are focused around high-quality images and videos, deepfake generation algorithms today have the capability to generate low-resolution videos, occluded deepfakes, and multiple-subject deepfakes. In this research, we emulate the real-world scenario of deepfake generation and spreading, and propose the DF-Platter dataset, which contains (i) both low-resolution and high-resolution deepfakes generated using multiple generation techniques and (ii) single-subject and multiple-subject deepfakes, with face images of Indian ethnicity. Faces in the dataset are annotated for various attributes such as gender, age, skin tone, and occlusion. The database is prepared in 116 days with continuous usage of 32 GPUs accounting to 1,800 GB cumulative memory. With over 500 GBs in size, the dataset contains a total of 133,260 videos encompassing three sets. To the best of our knowledge, this is one of the largest datasets containing vast variability and multiple challenges. We also provide benchmark results under multiple evaluation settings using popular and state-of-the-art deepfake detection models. Further, benchmark results under c23 and c40 compression are provided. The results demonstrate a significant performance reduction in the deepfake detection task on low-resolution deepfakes and show that the existing techniques fail drastically on multiple-subject deepfakes. It is our assertion that this database will improve the state-of-the-art by extending the capabilities of deepfake detection algorithms to real-world scenarios. The database is available at: http://iab-rubric.org/df-platter-database.

Robust Unsupervised StyleGAN Image Restoration
Poirier-Ginter, YohanandLalonde, Jean-Fran\c{c



研究问题:本文旨在解决现有无监督方法在图像恢复任务中需要对每个任务和退化级别进行仔细调整的问题。
动机:为了提高图像恢复的鲁棒性,使一个超参数集能适用于广泛的退化级别,并处理多种退化的组合,而无需重新调整。
方法:提出了一种基于StyleGAN的图像恢复方法,该方法依赖于3阶段的渐进潜在空间扩展和保守优化器,避免了任何额外的正则化项的需求。
效果:通过大量的实验表明,该方法在各种退化级别的图像修复、上采样、去噪和去伪影等任务上都表现出了强大的鲁棒性,优于其他基于StyleGAN的反转技术。与扩散基恢复相比,该方法也产生了更真实的反转结果。

GAN-based image restoration inverts the generative process to repair images corrupted by known degradations. Existing unsupervised methods must carefully be tuned for each task and degradation level. In this work, we make StyleGAN image restoration robust: a single set of hyperparameters works across a wide range of degradation levels. This makes it possible to handle combinations of several degradations, without the need to retune. Our proposed approach relies on a 3-phase progressive latent space extension and a conservative optimizer, which avoids the need for any additional regularization terms. Extensive experiments demonstrate robustness on inpainting, upsampling, denoising, and deartifacting at varying degradations levels, outperforming other StyleGAN-based inversion techniques. Our approach also favorably compares to diffusion-based restoration by yielding much more realistic inversion results. Code will be released upon publication.

Blemish-Aware and Progressive Face Retouching With Limited Paired Data
Xie, LianxinandXue, WenandXu, ZhenandWu, SiandYu, ZhiwenandWong, HauSan



研究问题:本文旨在解决面部修饰中的主要挑战,即在去除面部瑕疵的同时保持图像的文本细节。
动机:现有的面部修饰方法需要昂贵的配对训练数据,且耗时耗力。因此,本文提出了一种能够区分瑕疵和面部特征(如痣)的新颖方法。
方法:本文提出的模型分为两个阶段进行逐步的瑕疵去除。首先,编码器-解码器模块学习粗略地去除瑕疵;然后,将得到的中间特征注入生成器以丰富局部细节。同时,通过引入注意力模块来抑制瑕疵,进一步提高模型性能。
效果:实验结果表明,该方法在各种面部修饰任务上取得了显著的性能提升。并且,通过对无配对样本施加有效的正则化,降低了对配对训练样本的依赖。

Face retouching aims to remove facial blemishes, while at the same time maintaining the textual details of a given input image. The main challenge lies in distinguishing blemishes from the facial characteristics, such as moles. Training an image-to-image translation network with pixel-wise supervision suffers from the problem of expensive paired training data, since professional retouching needs specialized experience and is time-consuming. In this paper, we propose a Blemish-aware and Progressive Face Retouching model, which is referred to as BPFRe. Our framework can be partitioned into two manageable stages to perform progressive blemish removal. Specifically, an encoder-decoder-based module learns to coarsely remove the blemishes at the first stage, and the resulting intermediate features are injected into a generator to enrich local detail at the second stage. We find that explicitly suppressing the blemishes can contribute to an effective collaboration among the components. Toward this end, we incorporate an attention module, which learns to infer a blemish-aware map and further determine the corresponding weights, which are then used to refine the intermediate features transferred from the encoder to the decoder, and from the decoder to the generator. Therefore, BPFRe is able to deliver significant performance gains on a wide range of face retouching tasks. It is worth noting that we reduce the dependence of BPFRe on paired training samples by imposing effective regularization on unpaired ones.

Synthesizing Photorealistic Virtual Humans Through Cross-Modal Disentanglement
Ravichandran, SiddarthandTexler, Ond\v{r



研究问题:如何生成高质量的虚拟人脸,并实现准确的唇部运动和实时运行。
动机:随着虚拟领域的增强,如数字助手和元宇宙,需要生成逼真的人类视觉描绘,但目前的深度伪造和头部生成方法在质量、唇同步、分辨率等方面存在不足,且无法实时运行。
方法:提出一种端到端的框架,利用音节作为中间音频表示,采用分层图像合成策略进行数据增强,以分离控制全局头部运动的模态。
效果:该方法能够实时运行,并在质量、唇部运动等方面优于当前最先进的技术。

Over the last few decades, many aspects of human life have been enhanced with virtual domains, from the advent of digital assistants such as Amazon's Alexa and Apple's Siri to the latest metaverse efforts of the rebranded Meta. These trends underscore the importance of generating photorealistic visual depictions of humans. This has led to the rapid growth of so-called deepfake and talking-head generation methods in recent years. Despite their impressive results and popularity, they usually lack certain qualitative aspects such as texture quality, lips synchronization, or resolution, and practical aspects such as the ability to run in real-time. To allow for virtual human avatars to be used in practical scenarios, we propose an end-to-end framework for synthesizing high-quality virtual human faces capable of speaking with accurate lip motion with a special emphasis on performance. We introduce a novel network utilizing visemes as an intermediate audio representation and a novel data augmentation strategy employing a hierarchical image synthesis approach that allows disentanglement of the different modalities used to control the global head motion. Our method runs in real-time, and is able to deliver superior results compared to the current state-of-the-art.

LipFormer: High-Fidelity and Generalizable Talking Face Generation With a Pre-Learned Facial Codebook
Wang, JiayuandZhao, KangandZhang, ShiweiandZhang, YingyaandShen, YujunandZhao, DeliandZhou, Jingren



研究问题:如何从输入音频序列生成说话人视频,这是一个实际但具有挑战性的任务。
动机:大多数现有方法无法捕捉到精细的面部细节,或者需要为每个身份训练一个特定的模型。我们认为,预先在高质量的人脸图像上学习代码簿可以作为有用的先验知识,有助于高保真和可泛化的头部说话合成。
方法:我们提出了LipFormer,一种基于变压器的框架,通过模拟音频视觉一致性并基于输入音频特征预测唇码序列来生成说话人视频。我们还引入了一个自适应人脸变形模块,帮助将参考人脸变形为目标姿势的特征空间,以减轻在不同姿势下预测唇码的难度。
效果:实验表明,与之前的方法相比,LipFormer能够生成更真实的说话人视频,并能忠实地推广到未见的身份。

Generating a talking face video from the input audio sequence is a practical yet challenging task. Most existing methods either fail to capture fine facial details or need to train a specific model for each identity. We argue that a codebook pre-learned on high-quality face images can serve as a useful prior that facilitates high-fidelity and generalizable talking head synthesis. Thanks to the strong capability of the codebook in representing face textures, we simplify the talking face generation task as finding proper lip-codes to characterize the variation of lips during a portrait talking. To this end, we propose LipFormer, a transformer-based framework, to model the audio-visual coherence and predict the lip-codes sequence based on the input audio features. We further introduce an adaptive face warping module, which helps warp the reference face to the target pose in the feature space, to alleviate the difficulty of lip-code prediction under different poses. By this means, LipFormer can make better use of the pre-learned priors in images and is robust to posture change. Extensive experiments show that LipFormer can produce more realistic talking face videos compared to previous methods and faithfully generalize to unseen identities.

UTM: A Unified Multiple Object Tracking Model With Identity-Aware Feature Enhancement
You, SisiandYao, HantaoandBao, Bing-KunandXu, Changsheng



研究问题:现有的多目标跟踪方法在对象检测、特征嵌入和身份关联三个步骤中,身份关联是独立的,导致身份相关的知识没有被用于提升检测和嵌入模块。
动机:为了克服现有方法的局限性,提出了一种新的统一跟踪模型(UTM),通过建立正反馈循环,使这三个组件相互受益。
方法:UTM的关键思想是身份感知特征增强(IAFE),通过利用身份相关的知识来提升检测和嵌入,连接并促进这三个组件。具体来说,IAFE包括身份感知增强注意力(IABA)和身份感知擦除注意力(IAEA),其中IABA增强了当前帧特征与身份相关知识的一致性区域,IAEA抑制了当前帧特征中的干扰区域。
效果:通过在三个基准测试集上进行大量实验,证明了UTM的鲁棒性。

Recently, Multiple Object Tracking has achieved great success, which consists of object detection, feature embedding, and identity association. Existing methods apply the three-step or two-step paradigm to generate robust trajectories, where identity association is independent of other components. However, the independent identity association results in the identity-aware knowledge contained in the tracklet not be used to boost the detection and embedding modules. To overcome the limitations of existing methods, we introduce a novel Unified Tracking Model (UTM) to bridge those three components for generating a positive feedback loop with mutual benefits. The key insight of UTM is the Identity-Aware Feature Enhancement (IAFE), which is applied to bridge and benefit these three components by utilizing the identity-aware knowledge to boost detection and embedding. Formally, IAFE contains the Identity-Aware Boosting Attention (IABA) and the Identity-Aware Erasing Attention (IAEA), where IABA enhances the consistent regions between the current frame feature and identity-aware knowledge, and IAEA suppresses the distracted regions in the current frame feature. With better detections and embeddings, higher-quality tracklets can also be generated. Extensive experiments of public and private detections on three benchmarks demonstrate the robustness of UTM.

High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity
Takagi, YuandNishimoto, Shinji



研究问题:如何通过人脑活动重建视觉体验,并理解计算机视觉模型与视觉系统之间的联系。
动机:虽然深度生成模型已被用于此任务,但用高语义保真度重建真实图像仍是一个挑战。
方法:提出一种基于扩散模型(DM)的新方法,从通过功能磁共振成像(fMRI)获取的人脑活动中重建图像。具体来说,我们依赖于称为稳定扩散的潜伏扩散模型(LDM)。
效果:该方法能直接以高保真度重建高分辨率图像,无需复杂深度学习模型的任何额外训练和微调。同时,我们还从神经科学的角度对LDM的不同组件进行了定量解释。总的来说,这项研究提出了一种从人脑活动重建图像的有前景的方法,并为理解DMs提供了一个新的框架。

Reconstructing visual experiences from human brain activity offers a unique way to understand how the brain represents the world, and to interpret the connection between computer vision models and our visual system. While deep generative models have recently been employed for this task, reconstructing realistic images with high semantic fidelity is still a challenging problem. Here, we propose a new method based on a diffusion model (DM) to reconstruct images from human brain activity obtained via functional magnetic resonance imaging (fMRI). More specifically, we rely on a latent diffusion model (LDM) termed Stable Diffusion. This model reduces the computational cost of DMs, while preserving their high generative performance. We also characterize the inner mechanisms of the LDM by studying how its different components (such as the latent vector Z, conditioning inputs C, and different elements of the denoising U-Net) relate to distinct brain functions. We show that our proposed method can reconstruct high-resolution images with high fidelity in straightforward fashion, without the need for any additional training and fine-tuning of complex deep-learning models. We also provide a quantitative interpretation of different LDM components from a neuroscientific perspective. Overall, our study proposes a promising method for reconstructing images from human brain activity, and provides a new framework for understanding DMs. Please check out our webpage at https://sites.google.com/view/stablediffusion-withbrain/.

Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences Between Pretrained Generative Models
Olson, MatthewL.andLiu, ShusenandAnirudh, RushilandThiagarajan, JayaramanJ.andBremer, Peer-TimoandWong, Weng-Keen



研究问题:训练生成对抗网络(GANs)具有挑战性,特别是在复杂分布和有限数据的情况下。
动机:为了审计已训练的网络,例如识别偏见或确保公平性,需要可解释的工具。
方法:提出了一种新的GAN比较方法——Cross-GAN Auditing (xGA),该方法通过比较新开发的GAN和已有的基线GAN,共同识别两者共有、客户特有的或客户GAN中缺失的语义属性。
效果:定量分析表明,xGA优于基线方法。同时,通过对各种图像数据集上训练的GAN进行定性分析,展示了xGA识别出的共有、独特和缺失的属性。

Generative Adversarial Networks (GANs) are notoriously difficult to train especially for complex distributions and with limited data. This has driven the need for interpretable tools to audit trained networks, for example, to identify biases or ensure fairness. Existing GAN audit tools are restricted to coarse-grained, model-data comparisons based on summary statistics such as FID or recall. In this paper, we propose an alternative approach that compares a newly developed GAN against a prior baseline. To this end, we introduce Cross-GAN Auditing (xGA) that, given an established "reference" GAN and a newly proposed "client" GAN, jointly identifies semantic attributes that are either common across both GANs, novel to the client GAN, or missing from the client GAN. This provides both users and model developers an intuitive assessment of similarity and differences between GANs. We introduce novel metrics to evaluate attribute-based GAN auditing approaches and use these metrics to demonstrate quantitatively that xGA outperforms baseline approaches. We also include qualitative results that illustrate the common, novel and missing attributes identified by xGA from GANs trained on a variety of image datasets.

SINE: SINgle Image Editing With Text-to-Image Diffusion Models
Zhang, ZhixingandHan, LigongandGhosh, ArnabandMetaxas, DimitrisN.andRen, Jian



研究问题:如何利用预训练的扩散模型进行单图像编辑。
动机:现有的基于扩散模型的图像生成方法在处理单图像编辑任务时,由于信息泄露问题,无法保持与给定图像相同的内容并创建新的语言指导特征。
方法:提出一种基于无分类器指导的新型模型,将单个图像训练的知识提炼到预训练的扩散模型中,即使只有一张给定的图像也能进行内容创作。同时,提出一种基于补丁的微调方法,能有效帮助模型生成任意分辨率的图像。
效果:通过大量实验验证了该方法的设计选择,展示了其在改变风格、添加内容和操作对象等方面的优秀编辑能力。

Recent works on diffusion models have demonstrated a strong capability for conditioning image generation, e.g., text-guided image synthesis. Such success inspires many efforts trying to use large-scale pre-trained diffusion models for tackling a challenging problem--real image editing. Works conducted in this area learn a unique textual token corresponding to several images containing the same object. However, under many circumstances, only one image is available, such as the painting of the Girl with a Pearl Earring. Using existing works on fine-tuning the pre-trained diffusion models with a single image causes severe overfitting issues. The information leakage from the pre-trained diffusion models makes editing can not keep the same content as the given image while creating new features depicted by the language guidance. This work aims to address the problem of single-image editing. We propose a novel model-based guidance built upon the classifier-free guidance so that the knowledge from the model trained on a single image can be distilled into the pre-trained diffusion model, enabling content creation even with one given image. Additionally, we propose a patch-based fine-tuning that can effectively help the model generate images of arbitrary resolution. We provide extensive experiments to validate the design choices of our approach and show promising editing capabilities, including changing style, content addition, and object manipulation.

Diffusion-Based Signed Distance Fields for 3D Shape Generation
Shim, JaehyeokandKang, ChangwooandJoo, Kyungdon



研究问题:本文旨在提出一种使用去噪扩散模型和通过有符号距离场(SDF)进行连续3D表示的3D形状生成框架。
动机:大多数现有方法依赖于点云等不连续形式,而我们的SDF-Diffusion框架可以生成高分辨率的3D形状,同时通过将生成过程分为两个阶段来缓解内存问题。
方法:我们的方法首先使用基于扩散的生成模型生成3D形状的低分辨率SDF,然后使用估计的低分辨率SDF作为条件,第二个阶段的扩散模型执行超分辨率以生成高分辨率SDF。
效果:在ShapeNet数据集上,我们的模型表现出与最先进的方法相当的性能,并在无需修改的情况下显示出在形状完成任务上的适用性。

We propose a 3D shape generation framework (SDF-Diffusion in short) that uses denoising diffusion models with continuous 3D representation via signed distance fields (SDF). Unlike most existing methods that depend on discontinuous forms, such as point clouds, SDF-Diffusion generates high-resolution 3D shapes while alleviating memory issues by separating the generative process into two-stage: generation and super-resolution. In the first stage, a diffusion-based generative model generates a low-resolution SDF of 3D shapes. Using the estimated low-resolution SDF as a condition, the second stage diffusion model performs super-resolution to generate high-resolution SDF. Our framework can generate a high-fidelity 3D shape despite the extreme spatial complexity. On the ShapeNet dataset, our model shows competitive performance to the state-of-the-art methods and shows applicability on the shape completion task without modification.

CAP-VSTNet: Content Affinity Preserved Versatile Style Transfer
Wen, LinfengandGao, ChengyingandZou, Changqing



研究问题:本文旨在解决内容亲和性损失(包括特征和像素亲和性)在照片级真实感和视频风格转换中导致伪影的主要问题。
动机:传统的可逆网络在保留内容亲和性的同时,会引入冗余信息,而本文提出的新框架CAP-VSTNet则能解决这个问题,从而更好地进行风格转换。
方法:本文提出了一种新的可逆残差网络和一种无偏线性变换模块,形成了一个名为CAP-VSTNet的新框架,用于实现多样化的风格转换。
效果:实验结果表明,与传统的先进方法相比,CAP-VSTNet在多样化的风格转换上能够产生更好的定性和定量结果。

Content affinity loss including feature and pixel affinity is a main problem which leads to artifacts in photorealistic and video style transfer. This paper proposes a new framework named CAP-VSTNet, which consists of a new reversible residual network and an unbiased linear transform module, for versatile style transfer. This reversible residual network can not only preserve content affinity but not introduce redundant information as traditional reversible networks, and hence facilitate better stylization. Empowered by Matting Laplacian training loss which can address the pixel affinity loss problem led by the linear transform, the proposed framework is applicable and effective on versatile style transfer. Extensive experiments show that CAP-VSTNet can produce better qualitative and quantitative results in comparison with the state-of-the-art methods.

FIANCEE: Faster Inference of Adversarial Networks via Conditional Early Exits
Karpikova, PolinaandRadionova, EkaterinaandYaschenko, AnastasiaandSpiridonov, AndreiandKostyushko, LeonidandFabbricatore, RiccardoandIvakhnenko, Aleksei



研究问题:生成对抗网络(DNNs)在图像合成方面功能强大,但计算负载大,且对于不同特性的图像,输出质量不均。
动机:为了解决这一问题,我们提出了一种通过添加早期退出分支来降低计算负载的方法,并根据渲染输出的难度动态切换计算路径。
方法:我们将这种方法应用于两种不同的SOTA模型,一种是从语义地图生成,另一种是面部表情的交叉重演。结果显示,该方法能够输出具有自定义较低质量阈值的图像。
效果:对于LPIPS <=0.1的阈值,我们可以将计算负载减少多达一半,这对于需要控制质量损失同时减少大部分输入计算的实时应用(如人脸合成)尤其重要。

Generative DNNs are a powerful tool for image synthesis, but they are limited by their computational load. On the other hand, given a trained model and a task, e.g. faces generation within a range of characteristics, the output image quality will be unevenly distributed among images with different characteristics. It follows, that we might restrain the model's complexity on some instances, maintaining a high quality. We propose a method for diminishing computations by adding so-called early exit branches to the original architecture, and dynamically switching the computational path depending on how difficult it will be to render the output. We apply our method on two different SOTA models performing generative tasks: generation from a semantic map, and cross reenactment of face expressions; showing it is able to output images with custom lower quality thresholds. For a threshold of LPIPS <=0.1, we diminish their computations by up to a half. This is especially relevant for real-time applications such as synthesis of faces, when quality loss needs to be contained, but most of the inputs need fewer computations than the complex instances.

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
Ruiz, NatanielandLi, YuanzhenandJampani, VarunandPritch, YaelandRubinstein, MichaelandAberman, Kfir



研究问题:目前的大规模文本到图像模型缺乏模仿给定参考集中主体外观并在不同的上下文中合成它们的新颖版本的能力。
动机:为了解决这一问题,我们提出了一种个性化的文本到图像扩散模型的新方法。
方法:我们的方法首先使用少量主题图像对预训练的文本到图像模型进行微调,使模型学习将唯一的标识符与特定主题绑定。然后,一旦主题被嵌入模型的输出域,就可以使用该唯一标识符合成主题在不同场景、姿态、视角和光照条件下的新颖照片般真实的图像。
效果:通过利用模型中嵌入的语义先验以及一种新的自主类特定先验保留损失,我们的技术能够在参考图像中不存在的场景、姿态、视角和光照条件下合成主题,同时保持主题的关键特征。我们在几个先前无法解决的任务上应用了我们的方法,包括主题再语境化、文本引导的视图合成和艺术渲染。我们还为这个新的主题驱动生成任务提供了一个新的数据集和评估协议。

Large text-to-image models achieved a remarkable leap in the evolution of AI, enabling high-quality and diverse synthesis of images from a given text prompt. However, these models lack the ability to mimic the appearance of subjects in a given reference set and synthesize novel renditions of them in different contexts. In this work, we present a new approach for "personalization" of text-to-image diffusion models. Given as input just a few images of a subject, we fine-tune a pretrained text-to-image model such that it learns to bind a unique identifier with that specific subject. Once the subject is embedded in the output domain of the model, the unique identifier can be used to synthesize novel photorealistic images of the subject contextualized in different scenes. By leveraging the semantic prior embedded in the model with a new autogenous class-specific prior preservation loss, our technique enables synthesizing the subject in diverse scenes, poses, views and lighting conditions that do not appear in the reference images. We apply our technique to several previously-unassailable tasks, including subject recontextualization, text-guided view synthesis, and artistic rendering, all while preserving the subject's key features. We also provide a new dataset and evaluation protocol for this new task of subject-driven generation. Project page: https://dreambooth.github.io/

Inferring and Leveraging Parts From Object Shape for Improving Semantic Image Synthesis
Wei, YuxiangandJi, ZhilongandWu, XiaoheandBai, JinfengandZhang, LeiandZuo, Wangmeng



研究问题:如何从输入的语义地图生成逼真的部分图像。
动机:尽管在语义图像合成方面取得了进展,但从输入的语义地图生成逼真的部分仍然是一个挑战。
方法:本文提出了一种从物体形状推断部分(iPOSE)的方法,并将其用于改进语义图像合成。通过使用预定义的支持部分地图,我们可以学习一个PartNet来预测对象部分地图。
效果:实验表明,我们的iPOSE不仅能够生成具有丰富部分细节的对象,而且可以灵活地控制图像合成。此外,我们的iPOSE在定量和定性评估方面都优于最先进的方法。

Despite the progress in semantic image synthesis, it remains a challenging problem to generate photo-realistic parts from input semantic map. Integrating part segmentation map can undoubtedly benefit image synthesis, but is bothersome and inconvenient to be provided by users. To improve part synthesis, this paper presents to infer Parts from Object ShapE (iPOSE) and leverage it for improving semantic image synthesis. However, albeit several part segmentation datasets are available, part annotations are still not provided for many object categories in semantic image synthesis. To circumvent it, we resort to few-shot regime to learn a PartNet for predicting the object part map with the guidance of pre-defined support part maps. PartNet can be readily generalized to handle a new object category when a small number (e.g., 3) of support part maps for this category are provided. Furthermore, part semantic modulation is presented to incorporate both inferred part map and semantic map for image synthesis. Experiments show that our iPOSE not only generates objects with rich part details, but also enables to control the image synthesis flexibly. And our iPOSE performs favorably against the state-of-the-art methods in terms of quantitative and qualitative evaluation. Our code will be publicly available at https://github.com/csyxwei/iPOSE.

Pose-Disentangled Contrastive Learning for Self-Supervised Facial Representation
Liu, YuanyuanandWang, WenbinandZhan, YibingandFeng, ShaozeandLiu, KejunandChen, Zhe



研究问题:现有的对比学习在面部表示学习中存在姿态细节无法描绘的问题。
动机:为了解决现有对比学习的局限性,提高面部理解性能。
方法:提出一种新颖的Pose-disentangled Contrastive Learning(PCL)方法,通过设计的姿态解耦解码器和姿态相关对比学习方案进行面部表示学习。
效果:实验结果表明,PCL显著优于先进的自监督学习方法,并在面部表情识别、人脸识别、AU检测和头部姿态估计等任务上表现出色。

Self-supervised facial representation has recently attracted increasing attention due to its ability to perform face understanding without relying on large-scale annotated datasets heavily. However, analytically, current contrastive-based self-supervised learning (SSL) still performs unsatisfactorily for learning facial representation. More specifically, existing contrastive learning (CL) tends to learn pose-invariant features that cannot depict the pose details of faces, compromising the learning performance. To conquer the above limitation of CL, we propose a novel Pose-disentangled Contrastive Learning (PCL) method for general self-supervised facial representation. Our PCL first devises a pose-disentangled decoder (PDD) with a delicately designed orthogonalizing regulation, which disentangles the pose-related features from the face-aware features; therefore, pose-related and other pose-unrelated facial information could be performed in individual subnetworks and do not affect each other's training. Furthermore, we introduce a pose-related contrastive learning scheme that learns pose-related information based on data augmentation of the same image, which would deliver more effective face-aware representation for various downstream tasks. We conducted linear evaluation on four challenging downstream facial understanding tasks, i.e., facial expression recognition, face recognition, AU detection and head pose estimation.Experimental results demonstrate that PCL significantly outperforms cutting-edge SSL methods. Our Code is available at https://github.com/DreamMr/PCL.

Attribute-Preserving Face Dataset Anonymization via Latent Code Optimization
Barattin, SimoneandTzelepis, ChristosandPatras, IoannisandSebe, Nicu



研究问题:本文旨在解决在保护被描绘人脸的隐私的同时,如何有效地对图像数据集进行匿名化处理,以便用于训练机器学习模型等下游任务。
动机:现有的最先进方法存在两个主要缺点,一是需要额外训练专用神经网络,成本高昂;二是无法在匿名化图像中保留原始图像的人脸属性,这对于其在下游任务中的使用至关重要。
方法:我们提出了一种与任务无关的匿名化程序,直接优化预训练生成对抗网络(GAN)的潜在表示空间中的图像潜在表示。通过直接优化潜在代码,我们确保身份与原始身份的距离符合要求(具有身份混淆损失),同时保留面部属性(使用FaRL深度特征空间中的新型特征匹配损失)。
效果:通过一系列定性和定量实验,我们的方法能够有效地对图像的身份进行匿名化,同时更好地保留面部属性。我们将代码和预训练模型公开发布在 https://github.com/chi0tzp/FALCO。

This work addresses the problem of anonymizing the identity of faces in a dataset of images, such that the privacy of those depicted is not violated, while at the same time the dataset is useful for downstream task such as for training machine learning models. To the best of our knowledge, we are the first to explicitly address this issue and deal with two major drawbacks of the existing state-of-the-art approaches, namely that they (i) require the costly training of additional, purpose-trained neural networks, and/or (ii) fail to retain the facial attributes of the original images in the anonymized counterparts, the preservation of which is of paramount importance for their use in downstream tasks. We accordingly present a task-agnostic anonymization procedure that directly optimises the images' latent representation in the latent space of a pre-trained GAN. By optimizing the latent codes directly, we ensure both that the identity is of a desired distance away from the original (with an identity obfuscation loss), whilst preserving the facial attributes (using a novel feature-matching loss in FaRL's deep feature space). We demonstrate through a series of both qualitative and quantitative experiments that our method is capable of anonymizing the identity of the images whilst--crucially--better-preserving the facial attributes. We make the code and the pre-trained models publicly available at: https://github.com/chi0tzp/FALCO.

Delving Into Discrete Normalizing Flows on SO(3) Manifold for Probabilistic Rotation Modeling
Liu, YulinandLiu, HaoranandYin, YingdaandWang, YangandChen, BaoquanandWang, He



研究问题:如何有效地在三维旋转流形SO(3)上构建概率模型。
动机:由于遮挡和对称性的存在,旋转在计算机视觉、图形学和机器人学中常常存在模糊性,需要概率模型来处理。
方法:提出一种新的基于莫比乌斯变换耦合层和四元数仿射变换的SO(3)上的正则化流。
效果:实验表明,这种旋转正则化流不仅能有效表达SO(3)上的任意分布,还能根据输入观察结果有条件地构建目标分布,并在无条件和有条件任务上都显著优于基线。

Normalizing flows (NFs) provide a powerful tool to construct an expressive distribution by a sequence of trackable transformations of a base distribution and form a probabilistic model of underlying data.Rotation, as an important quantity in computer vision, graphics, and robotics, can exhibit many ambiguities when occlusion and symmetry occur and thus demands such probabilistic models. Though much progress has been made for NFs in Euclidean space, there are no effective normalizing flows without discontinuity or many-to-one mapping tailored for SO(3) manifold. Given the unique non-Euclidean properties of the rotation manifold, adapting the existing NFs to SO(3) manifold is non-trivial. In this paper, we propose a novel normalizing flow on SO(3) by combining a Mobius transformation-based coupling layer and a quaternion affine transformation. With our proposed rotation normalizing flows, one can not only effectively express arbitrary distributions on SO(3), but also conditionally build the target distribution given input observations. Extensive experiments show that our rotation normalizing flows significantly outperform the baselines on both unconditional and conditional tasks.

DeepVecFont-v2: Exploiting Transformers To Synthesize Vector Fonts With Higher Quality
Wang, YuqingandWang, YizhiandYu, LonghuiandZhu, YueshengandLian, Zhouhui



研究问题:本文旨在解决计算机视觉和计算机图形领域中的矢量字体合成问题,特别是处理长序列数据和依赖图像引导轮廓细化后处理的问题。
动机:虽然最近提出的DeepVecFont通过利用矢量字体的图像和序列信息实现了最先进的性能,但其在处理长序列数据方面的能力有限,且严重依赖于图像引导的轮廓细化后处理。因此,由DeepVecFont合成的矢量字形仍然经常包含一些畸变和人工痕迹,无法与人类设计的结果相媲美。
方法:本文提出了一种增强版的DeepVecFont,主要通过以下三个新颖的技术贡献来实现:首先,采用Transformers代替RNNs来处理序列数据,并设计了一种放松表示法用于矢量轮廓,显著提高了模型合成长而复杂轮廓的能力和稳定性;其次,除了控制点外,还建议采样辅助点以精确对齐生成的目标贝塞尔曲线或线;最后,为了减轻序列生成过程中的错误累积,开发了一个基于另一Transformer-based解码器的上下文自修正模块,以消除初步合成字形中的人工痕迹。
效果:定性和定量的结果表明,所提出的方法有效地解决了原始DeepVecFont的内在问题,并在生成具有复杂结构和多样风格的英文和中文矢量字体方面优于现有方法。

Vector font synthesis is a challenging and ongoing problem in the fields of Computer Vision and Computer Graphics. The recently-proposed DeepVecFont achieved state-of-the-art performance by exploiting information of both the image and sequence modalities of vector fonts. However, it has limited capability for handling long sequence data and heavily relies on an image-guided outline refinement post-processing. Thus, vector glyphs synthesized by DeepVecFont still often contain some distortions and artifacts and cannot rival human-designed results. To address the above problems, this paper proposes an enhanced version of DeepVecFont mainly by making the following three novel technical contributions. First, we adopt Transformers instead of RNNs to process sequential data and design a relaxation representation for vector outlines, markedly improving the model's capability and stability of synthesizing long and complex outlines. Second, we propose to sample auxiliary points in addition to control points to precisely align the generated and target Bezier curves or lines. Finally, to alleviate error accumulation in the sequential generation process, we develop a context-based self-refinement module based on another Transformer-based decoder to remove artifacts in the initially synthesized glyphs. Both qualitative and quantitative results demonstrate that the proposed method effectively resolves those intrinsic problems of the original DeepVecFont and outperforms existing approaches in generating English and Chinese vector fonts with complicated structures and diverse styles.

Continuous Landmark Detection With 3D Queries
Chandran, PrashanthandZoss, GaspardandGotardo, PauloandBradley, Derek



研究问题:现有的面部地标检测神经网络受限于固定的一组地标和指定的布局,且需要手动标注相应的地标配置进行训练。
动机:提出首个能预测连续、无限制地标的面部地标检测网络,允许在推理时指定所需地标的数量和位置。
方法:将简单的图像特征提取器与查询地标预测器结合,用户可以指定任意相对于3D模板人脸网格的连续查询点作为输入。
效果:由于不局限于固定的一组地标,该方法能够利用所有现有的2D地标数据集进行训练,即使它们具有不一致的地标配置。因此,提出了一种非常强大的面部地标检测器,可以一次性训练,并可轻松用于诸如3D人脸重建、任意人脸分割等众多应用,甚至与头盔式摄像头兼容,从而大大简化媒体和娱乐应用中的人脸跟踪工作流程。

Neural networks for facial landmark detection are notoriously limited to a fixed set of landmarks in a dedicated layout, which must be specified at training time. Dedicated datasets must also be hand-annotated with the corresponding landmark configuration for training. We propose the first facial landmark detection network that can predict continuous, unlimited landmarks, allowing to specify the number and location of the desired landmarks at inference time. Our method combines a simple image feature extractor with a queried landmark predictor, and the user can specify any continuous query points relative to a 3D template face mesh as input. As it is not tied to a fixed set of landmarks, our method is able to leverage all pre-existing 2D landmark datasets for training, even if they have inconsistent landmark configurations. As a result, we present a very powerful facial landmark detector that can be trained once, and can be used readily for numerous applications like 3D face reconstruction, arbitrary face segmentation, and is even compatible with helmeted mounted cameras, and therefore could vastly simplify face tracking workflows for media and entertainment applications.

PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360deg
An, SizheandXu, HongyiandShi, YichunandSong, GuoxianandOgras, UmitY.andLuo, Linjie



研究问题:如何利用大规模无结构图像训练一种3D人头生成模型,实现高质量、全方位、具有多样性外观和详细几何形状的头部图像合成。
动机:现有的最先进的3D GANs在3D人头合成方面存在限制,如只能生成近正面视图,或在大视角下难以保持3D一致性。
方法:提出PanoHead,这是一种首个3D感知的生成模型,它使用仅在野外采集的无结构图像进行训练,能够实现全方位、具有多样性外观和详细几何形状的头部图像合成。我们的核心方法是提升最新的3D GANs的表示能力,并弥合从野外广泛分布的视角进行训练时的数据对齐差距。
效果:我们的模型显著优于先前的3D GANs,能生成具有准确几何形状和多样外观的高质量3D头部图像,即使对于长发卷曲和非洲式发型也能处理。此外,我们的系统还能从单个输入图像重建完整的3D头部,为个性化的真实3D头像提供可能。

Synthesis and reconstruction of 3D human head has gained increasing interests in computer vision and computer graphics recently. Existing state-of-the-art 3D generative adversarial networks (GANs) for 3D human head synthesis are either limited to near-frontal views or hard to preserve 3D consistency in large view angles. We propose PanoHead, the first 3D-aware generative model that enables high-quality view-consistent image synthesis of full heads in 360deg with diverse appearance and detailed geometry using only in-the-wild unstructured images for training. At its core, we lift up the representation power of recent 3D GANs and bridge the data alignment gap when training from in-the-wild images with widely distributed views. Specifically, we propose a novel two-stage self-adaptive image alignment for robust 3D GAN training. We further introduce a tri-grid neural volume representation that effectively addresses front-face and back-head feature entanglement rooted in the widely-adopted tri-plane formulation. Our method instills prior knowledge of 2D image segmentation in adversarial learning of 3D neural scene structures, enabling compositable head synthesis in diverse backgrounds. Benefiting from these designs, our method significantly outperforms previous 3D GANs, generating high-quality 3D heads with accurate geometry and diverse appearances, even with long wavy and afro hairstyles, renderable from arbitrary poses. Furthermore, we show that our system can reconstruct full 3D heads from single input images for personalized realistic 3D avatars.

Text2Scene: Text-Driven Indoor Scene Stylization With Part-Aware Details
Hwang, InwooandKim, HyeonwooandKim, YoungMin



研究问题:如何为由多个物体组成的虚拟场景自动创建逼真的纹理?
动机:现有的方法在为虚拟场景添加纹理时,通常在整个场景上应用单一的风格化处理,无法保持结构上下文。
方法:提出Text2Scene方法,通过参考图像和文本描述,对房间中的3D几何形状添加详细的纹理,使生成的颜色尊重由相似材料组成的分层结构或语义部分。首先从几何分割中获得弱语义线索,然后为单个物体添加纹理细节,使其在图像空间上的投影展示出与输入嵌入对齐的特征嵌入。
效果:该方法能够为具有多个物体的场景创建详细且逼真的纹理,同时保持结构上下文。这是第一个实用且可扩展的方法,不需要由熟练艺术家设计的高质量纹理的专用数据集。

We propose Text2Scene, a method to automatically create realistic textures for virtual scenes composed of multiple objects. Guided by a reference image and text descriptions, our pipeline adds detailed texture on labeled 3D geometries in the room such that the generated colors respect the hierarchical structure or semantic parts that are often composed of similar materials. Instead of applying flat stylization on the entire scene at a single step, we obtain weak semantic cues from geometric segmentation, which are further clarified by assigning initial colors to segmented parts. Then we add texture details for individual objects such that their projections on image space exhibit feature embedding aligned with the embedding of the input. The decomposition makes the entire pipeline tractable to a moderate amount of computation resources and memory. As our framework utilizes the existing resources of image and text embedding, it does not require dedicated datasets with high-quality textures designed by skillful artists. To the best of our knowledge, it is the first practical and scalable approach that can create detailed and realistic textures of the desired style that maintain structural context for scenes with multiple objects.

TAPS3D: Text-Guided 3D Textured Shape Generation From Pseudo Supervision
Wei, JiachengandWang, HaoandFeng, JiashiandLin, GuoshengandYap, Kim-Hui



研究问题:如何从给定的文本描述生成可控的3D纹理形状。
动机:现有的方法需要真实标签或大量的优化时间,我们提出一个新的框架TAPS3D来解决这些问题。
方法:我们基于渲染的2D图像,从CLIP词汇表中检索相关单词并使用模板构造伪标题,为生成的3D形状提供高级语义监督。同时,为了产生精细纹理和增加几何多样性,我们采用低级别的图像正则化使假渲染的图像与真实的图像对齐。在推理阶段,我们的模型可以从给定的文本生成3D纹理形状,无需任何额外的优化。
效果:通过广泛的实验,我们分析了我们提出的每个组件,并展示了我们的框架在生成高保真度3D纹理和与文本相关的形状方面的有效性。

In this paper, we investigate an open research task of generating controllable 3D textured shapes from the given textual descriptions. Previous works either require ground truth caption labeling or extensive optimization time. To resolve these issues, we present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions. Specifically, based on rendered 2D images, we retrieve relevant words from the CLIP vocabulary and construct pseudo captions using templates. Our constructed captions provide high-level semantic supervision for generated 3D shapes. Further, in order to produce fine-grained textures and increase geometry diversity, we propose to adopt low-level image regularization to enable fake-rendered images to align with the real ones. During the inference phase, our proposed model can generate 3D textured shapes from the given text without any additional optimization. We conduct extensive experiments to analyze each of our proposed components and show the efficacy of our framework in generating high-fidelity 3D textured and text-relevant shapes.

Learning Personalized High Quality Volumetric Head Avatars From Monocular RGB Videos
Bai, ZiqianandTan, FeitongandHuang, ZengandSarkar, KripasindhuandTang, DanhangandQiu, DiandMeka, AbhimitraandDu, RuofeiandDou, MingsongandOrts-Escolano, SergioandPandey, RohitandTan, PingandBeeler, ThaboandFanello, SeanandZhang, Yinda



研究问题:如何从野外捕获的单目RGB视频中学习高质量的隐式3D头部化身。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:提出一种从野外捕获的单目RGB视频中学习高质量隐式3D头部化身的方法,该方法结合了3DMM的几何先验和动态追踪以及神经辐射场,以实现精细控制和照片级真实感。
效果:实验结果表明,该方法能够重建高质量的化身,具有更准确的表情相关细节,良好的泛化能力,并能产生比其他最先进的方法更优秀的渲染效果。

We propose a method to learn a high-quality implicit 3D head avatar from a monocular RGB video captured in the wild. The learnt avatar is driven by a parametric face model to achieve user-controlled facial expressions and head poses. Our hybrid pipeline combines the geometry prior and dynamic tracking of a 3DMM with a neural radiance field to achieve fine-grained control and photorealism. To reduce over-smoothing and improve out-of-model expressions synthesis, we propose to predict local features anchored on the 3DMM geometry. These learnt features are driven by 3DMM deformation and interpolated in 3D space to yield the volumetric radiance at a designated query point. We further show that using a Convolutional Neural Network in the UV space is critical in incorporating spatial context and producing representative local features. Extensive experiments show that we are able to reconstruct high-quality avatars, with more accurate expression-dependent details, good generalization to out-of-training expressions, and quantitatively superior renderings compared to other state-of-the-art approaches.

LayoutFormer++: Conditional Graphic Layout Generation via Constraint Serialization and Decoding Space Restriction
Jiang, ZhaoyunandGuo, JiaqiandSun, ShizhaoandDeng, HuayuandWu, ZhongkaiandMijovic, VuksanandYang, ZijiangJamesandLou, Jian-GuangandZhang, Dongmei



研究问题:如何灵活、统一地处理多样的用户约束,同时在满足用户约束的前提下提高布局生成质量。
动机:现有的布局生成模型在处理用户约束时灵活性不足,且常常牺牲生成质量以满足约束。
方法:提出LayoutFormer++模型,通过约束序列化方案将不同用户约束表示为预设格式的令牌序列;将条件布局生成建模为序列到序列转换任务,并采用基于Transformer的编码器-解码器框架进行实现;提出解码空间限制策略,通过剔除明显违反用户约束和可能导致低质量布局的选项,对预测分布进行剪枝,使模型从受限分布中采样。
效果:实验表明,LayoutFormer++在生成质量和约束违反方面均优于现有方法。

Conditional graphic layout generation, which generates realistic layouts according to user constraints, is a challenging task that has not been well-studied yet. First, there is limited discussion about how to handle diverse user constraints flexibly and uniformly. Second, to make the layouts conform to user constraints, existing work often sacrifices generation quality significantly. In this work, we propose LayoutFormer++ to tackle the above problems. First, to flexibly handle diverse constraints, we propose a constraint serialization scheme, which represents different user constraints as sequences of tokens with a predefined format. Then, we formulate conditional layout generation as a sequence-to-sequence transformation, and leverage encoder-decoder framework with Transformer as the basic architecture. Furthermore, to make the layout better meet user requirements without harming quality, we propose a decoding space restriction strategy. Specifically, we prune the predicted distribution by ignoring the options that definitely violate user constraints and likely result in low-quality layouts, and make the model samples from the restricted distribution. Experiments demonstrate that LayoutFormer++ outperforms existing approaches on all the tasks in terms of both better generation quality and less constraint violation.

Implicit Identity Driven Deepfake Face Swapping Detection
Huang, BaojinandWang, ZhongyuanandYang, JifanandAi, JiaxinandZou, QinandWang, QianandYe, Dengpan



研究问题:本文从人脸身份的角度考虑人脸交换检测。
动机:人脸交换的目的是将目标人脸替换为源人脸,生成人类无法区分真假的假人脸。我们认为假人脸包含显性和隐性两种身份,分别对应于人脸交换过程中源人脸和目标人脸的身份。
方法:我们提出了一种基于隐性身份的新的人脸交换检测框架。具体来说,我们设计了一个显性身份对比(EIC)损失和一个隐性身份探索(IIE)损失,这两个损失监督CNN主干将人脸图像嵌入到隐性身份空间中。在EIC的指导下,真实样本被拉向其显性身份,而假样本则被推离其显性身份。此外,IIE来源于基于间隔的分类损失函数,它鼓励已知目标身份的假人脸具有类内紧凑性和类间多样性。
效果:我们在几个数据集上进行了广泛的实验和可视化,证明了我们的方法相对于最先进的对手具有良好的泛化能力。

In this paper, we consider the face swapping detection from the perspective of face identity. Face swapping aims to replace the target face with the source face and generate the fake face that the human cannot distinguish between real and fake. We argue that the fake face contains the explicit identity and implicit identity, which respectively corresponds to the identity of the source face and target face during face swapping. Note that the explicit identities of faces can be extracted by regular face recognizers. Particularly, the implicit identity of real face is consistent with the its explicit identity. Thus the difference between explicit and implicit identity of face facilitates face swapping detection. Following this idea, we propose a novel implicit identity driven framework for face swapping detection. Specifically, we design an explicit identity contrast (EIC) loss and an implicit identity exploration (IIE) loss, which supervises a CNN backbone to embed face images into the implicit identity space. Under the guidance of EIC, real samples are pulled closer to their explicit identities, while fake samples are pushed away from their explicit identities. Moreover, IIE is derived from the margin-based classification loss function, which encourages the fake faces with known target identities to enjoy intra-class compactness and inter-class diversity. Extensive experiments and visualizations on several datasets demonstrate the generalization of our method against the state-of-the-art counterparts.

Logical Consistency and Greater Descriptive Power for Facial Hair Attribute Learning
Wu, HaiyuandBezold, GraceandBhatta, AmanandBowyer, KevinW.



研究问题:本文旨在解决面部属性研究中的一些问题,如对面部毛发属性的简单二元分类,缺乏逻辑一致性和完整性等。
动机:目前的面部属性研究只使用了简单的二进制属性进行面部毛发分类,如胡须/无胡须,且在处理逻辑一致性和完整性方面存在问题。
方法:作者创建了一种新的、更具描述性的面部毛发注释方案,并应用于创建一个新的面部毛发属性数据集FH37K。同时,作者提出了一种逻辑一致预测损失(LCPLoss)来帮助学习属性间的逻辑一致性,以及一种标签补偿训练策略来解决相关属性中没有正预测的问题。
效果:通过在FH37K数据集上训练的面部毛发属性分类器,作者研究了面部毛发如何影响面部识别的准确性,包括不同人口统计群体之间的变化。结果显示,面部发型的相似性和差异性对面部识别中的冒名顶替者和真实得分分布有重要影响。

Face attribute research has so far used only simple binary attributes for facial hair; e.g., beard / no beard. We have created a new, more descriptive facial hair annotation scheme and applied it to create a new facial hair attribute dataset, FH37K. Face attribute research also so far has not dealt with logical consistency and completeness. For example, in prior research, an image might be classified as both having no beard and also having a goatee (a type of beard). We show that the test accuracy of previous classification methods on facial hair attribute classification drops significantly if logical consistency of classifications is enforced. We propose a logically consistent prediction loss, LCPLoss, to aid learning of logical consistency across attributes, and also a label compensation training strategy to eliminate the problem of no positive prediction across a set of related attributes. Using an attribute classifier trained on FH37K, we investigate how facial hair affects face recognition accuracy, including variation across demographics. Results show that similarity and difference in facial hairstyle have important effects on the impostor and genuine score distributions in face recognition. The code is at https:// github.com/ HaiyuWu/ facial hair logical.

Diffusion Probabilistic Model Made Slim
Yang, XingyiandZhou, DaquanandFeng, JiashiandWang, Xinchao



研究问题:扩散概率模型(DPM)在视觉上取得了令人满意的结果,但其巨大的计算成本一直是其长期存在的缺陷,这极大地限制了其在资源有限平台上的应用。
动机:尽管现有的方法试图提高DPM的效率,但大多数都集中在加速测试阶段,而忽视了其复杂的结构和庞大的规模。
方法:本文提出了一种名为Spectral Diffusion(SD)的轻量级DPM设计。SD在其架构中引入了小波门控,以在每一步反向步骤中提取频率动态特征,并通过基于频谱幅度的反向加权目标进行频谱感知蒸馏,以促进高频恢复。
效果:实验结果表明,SD在一系列有条件和无条件的图像生成任务上,与潜在的扩散模型相比,实现了8-18倍的计算复杂度降低,同时保持了具有竞争力的图像保真度。

Despite the visually-pleasing results achieved, the massive computational cost has been a long-standing flaw for diffusion probabilistic models (DPMs), which, in turn, greatly limits their applications on resource-limited platforms. Prior methods towards efficient DPM, however, have largely focused on accelerating the testing yet overlooked their huge complexity and size. In this paper, we make a dedicated attempt to lighten DPM while striving to preserve its favourable performance. We start by training a small-sized latent diffusion model (LDM) from scratch but observe a significant fidelity drop in the synthetic images. Through a thorough assessment, we find that DPM is intrinsically biased against high-frequency generation, and learns to recover different frequency components at different time-steps. These properties make compact networks unable to represent frequency dynamics with accurate high-frequency estimation. Towards this end, we introduce a customized design for slim DPM, which we term as Spectral Diffusion (SD), for lightweight image synthesis. SD incorporates wavelet gating in its architecture to enable frequency dynamic feature extraction at every reverse steps, and conducts spectrum-aware distillation to promote high-frequency recovery by inverse weighting the objective based on spectrum magnitudes. Experimental results demonstrate that, SD achieves 8-18x computational complexity reduction as compared to the latent diffusion models on a series of conditional and unconditional image generation tasks while retaining competitive image fidelity.

Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style
Lu, HaomingandTunanyan, HazarapetandWang, KaiandNavasardyan, ShantandWang, ZhangyangandShi, Humphrey



研究问题:如何通过少量图像(例如少于10张)对预训练的扩散模型进行微调,以生成任意对象的高质量特定风格图像。
动机:现有的扩散模型在文本条件图像合成方面表现出色,通过个性化预训练的扩散模型可以生成特定的目标对象或风格的图像,具有广泛的应用前景。
方法:提出了一种新的微调技术工具包,包括文本到图像的定制数据增强、促进内容和风格分离的内容损失以及只关注少数时间步长的稀疏更新。通过这些技术,可以在少量的图像上微调预训练的扩散模型,使其能够生成任意对象的高质量特定风格图像。
效果:实验结果表明,该方法在学习高度复杂的风格方面优于现有的几种少数样本扩散模型,如Textual Inversion和DreamBooth,具有超样本高效的微调性能。此外,还可以将该方法集成到Textual Inversion之上,进一步提高性能,甚至在非常不寻常的风格上也能达到良好的效果。

Diffusion models have demonstrated impressive capability of text-conditioned image synthesis, and broader application horizons are emerging by personalizing those pretrained diffusion models toward generating some specialized target object or style. In this paper, we aim to learn an unseen style by simply fine-tuning a pre-trained diffusion model with a handful of images (e.g., less than 10), so that the fine-tuned model can generate high-quality images of arbitrary objects in this style. Such extremely lowshot fine-tuning is accomplished by a novel toolkit of finetuning techniques, including text-to-image customized data augmentations, a content loss to facilitate content-style disentanglement, and sparse updating that focuses on only a few time steps. Our framework, dubbed Specialist Diffusion, is plug-and-play to existing diffusion model backbones and other personalization techniques. We demonstrate it to outperform the latest few-shot personalization alternatives of diffusion models such as Textual Inversion and DreamBooth, in terms of learning highly sophisticated styles with ultra-sample-efficient tuning. We further show that Specialist Diffusion can be integrated on top of textual inversion to boost performance further, even on highly unusual styles. Our codes are available at: https://github.com/Picsart-AI-Research/Specialist-Diffusion

HyperCUT: Video Sequence From a Single Blurry Image Using Unsupervised Ordering
Pham, Bang-DangandTran, PhongandTran, AnhandPham, CuongandNguyen, RangandHoai, Minh



研究问题:训练图像到视频去模糊模型,解决模糊图像输入对应的清晰图像序列恢复问题。
动机:图像到视频去模糊模型的训练存在帧排序的模糊性问题,前向和后向序列都可能是合理的解决方案。
方法:提出一种有效的自我监督排序方案,通过在高维潜在空间中为每个视频序列映射向量并定义超平面,使正向和反向序列的向量位于超平面的两侧,从而确定序列的顺序。
效果:实验结果证实了该方法的有效性,同时提出了一个覆盖人脸、手部和街道等多种流行领域的真实图像数据集。

We consider the challenging task of training models for image-to-video deblurring, which aims to recover a sequence of sharp images corresponding to a given blurry image input. A critical issue disturbing the training of an image-to-video model is the ambiguity of the frame ordering since both the forward and backward sequences are plausible solutions. This paper proposes an effective self-supervised ordering scheme that allows training high-quality image-to-video deblurring models. Unlike previous methods that rely on order-invariant losses, we assign an explicit order for each video sequence, thus avoiding the order-ambiguity issue. Specifically, we map each video sequence to a vector in a latent high-dimensional space so that there exists a hyperplane such that for every video sequence, the vectors extracted from it and its reversed sequence are on different sides of the hyperplane. The side of the vectors will be used to define the order of the corresponding sequence. Last but not least, we propose a real-image dataset for the image-to-video deblurring problem that covers a variety of popular domains, including face, hand, and street. Extensive experimental results confirm the effectiveness of our method. Code and data are available at https://github.com/VinAIResearch/HyperCUT.git

Document Image Shadow Removal Guided by Color-Aware Background
Zhang, LingandHe, YinghaoandZhang, QingandLiu, ZhengandZhang, XiaolongandXiao, Chunxia



研究问题:现有的文档图像阴影移除方法主要依赖于从图像中学习和利用恒定的背景(纸的颜色),但这种方法忽视了其他背景颜色,如印刷颜色,导致结果失真。
动机:提出一种颜色感知的背景提取网络(CBENet)来准确描绘文档的背景颜色,并使用预测的随空间变化的背景作为辅助信息,设计了一种背景引导的文档图像阴影移除网络(BGShadowNet)。
方法:BGShadowNet由两个阶段组成。在第一阶段,设计了一个背景约束的解码器以促进粗略的结果。然后在第二阶段,通过背景基于的注意力模块(BAModule)和细节改进模块(DEModule)对粗略结果进行细化,以保持外观一致性并增强纹理细节。
效果:实验证明,该方法在两个基准数据集上均优于现有技术。

Existing works on document image shadow removal mostly depend on learning and leveraging a constant background (the color of the paper) from the image. However, the constant background is less representative and frequently ignores other background colors, such as the printed colors, resulting in distorted results. In this paper, we present a color-aware background extraction network (CBENet) for extracting a spatially varying background image that accurately depicts the background colors of the document. Furthermore, we propose a background-guided document images shadow removal network (BGShadowNet) using the predicted spatially varying background as auxiliary information, which consists of two stages. At Stage I, a background-constrained decoder is designed to promote a coarse result. Then, the coarse result is refined with a background-based attention module (BAModule) to maintain a consistent appearance and a detail improvement module (DEModule) to enhance the texture details at Stage II. Experiments on two benchmark datasets qualitatively and quantitatively validate the superiority of the proposed approach over state-of-the-arts.

CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes From Natural Language
Sanghi, AdityaandFu, RaoandLiu, VivianandWillis, KarlD.D.andShayani, HoomanandKhasahmadi, AmirH.andSridhar, SrinathandRitchie, Daniel



研究问题:如何通过自然语言生成和编辑高保真度和多样性的3D形状。
动机:现有的方法在生成3D形状时,其保真度和多样性有限。
方法:提出CLIP-Sculptor方法,该方法采用多分辨率方式,首先在低维潜在空间中生成,然后升级到更高分辨率以提高形状保真度。为了提高形状多样性,使用了一个由转换器建模的离散潜在空间,该转换器以CLIP的图像-文本嵌入空间为条件。
效果:实验结果表明,CLIP-Sculptor优于最先进的基线方法。

Recent works have demonstrated that natural language can be used to generate and edit 3D shapes. However, these methods generate shapes with limited fidelity and diversity. We introduce CLIP-Sculptor, a method to address these constraints by producing high-fidelity and diverse 3D shapes without the need for (text, shape) pairs during training. CLIP-Sculptor achieves this in a multi-resolution approach that first generates in a low-dimensional latent space and then upscales to a higher resolution for improved shape fidelity. For improved shape diversity, we use a discrete latent space which is modeled using a transformer conditioned on CLIP's image-text embedding space. We also present a novel variant of classifier-free guidance, which improves the accuracy-diversity trade-off. Finally, we perform extensive experiments demonstrating that CLIP-Sculptor outperforms state-of-the-art baselines.

VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models
Jain, AjayandXie, AmberandAbbeel, Pieter



研究问题:如何利用扩散模型生成可导出为SVG格式的矢量图形。
动机:设计师常使用SVG等矢量图形进行数字图标、图像和贴纸的设计,但现有的方法需要大量带字幕的SVG数据集,缺乏可行性。
方法:本文提出一种基于文本条件扩散模型的方法,通过优化可微分的矢量图形栅格化器,将抽象语义知识从预训练的扩散模型中提炼出来,生成连贯的像素艺术和草图。
效果:实验结果表明,该方法在生成矢量图形方面优于现有方法CLIP,并能生成连贯的像素艺术和草图。

Diffusion models have shown impressive results in text-to-image synthesis. Using massive datasets of captioned images, diffusion models learn to generate raster images of highly diverse objects and scenes. However, designers frequently use vector representations of images like Scalable Vector Graphics (SVGs) for digital icons, graphics and stickers. Vector graphics can be scaled to any size, and are compact. In this work, we show that a text-conditioned diffusion model trained on pixel representations of images can be used to generate SVG-exportable vector graphics. We do so without access to large datasets of captioned SVGs. Instead, inspired by recent work on text-to-3D synthesis, we vectorize a text-to-image diffusion sample and fine-tune with a Score Distillation Sampling loss. By optimizing a differentiable vector graphics rasterizer, our method distills abstract semantic knowledge out of a pretrained diffusion model. By constraining the vector representation, we can also generate coherent pixel art and sketches. Our approach, VectorFusion, produces more coherent graphics than prior works that optimize CLIP, a contrastive image-text model.

Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models
Somepalli, GowthamiandSingla, VasuandGoldblum, MicahandGeiping, JonasandGoldstein, Tom



研究问题:扩散模型生成的图像是否具有原创性,还是直接复制了训练集的内容?
动机:为了探究扩散模型在商业艺术和图形设计方面的应用潜力。
方法:通过图像检索框架比较生成的图像与训练样本,检测内容复制情况。
效果:发现扩散模型会复制训练数据,且训练集大小会影响内容复制率。

Cutting-edge diffusion models produce images with high quality and customizability, enabling them to be used for commercial art and graphic design purposes. But do diffusion models create unique works of art, or are they replicating content directly from their training sets? In this work, we study image retrieval frameworks that enable us to compare generated images with training samples and detect when content has been replicated. Applying our frameworks to diffusion models trained on multiple datasets including Oxford flowers, Celeb-A, ImageNet, and LAION, we discuss how factors such as training set size impact rates of content replication. We also identify cases where diffusion models, including the popular Stable Diffusion model, blatantly copy from their training data.

AltFreezing for More General Video Face Forgery Detection
Wang, ZhendongandBao, JianminandZhou, WengangandWang, WeilunandLi, Houqiang



研究问题:现有的人脸伪造检测模型主要通过检测空间或时间伪影进行判别,但面临跨领域伪影时性能会显著下降。
动机:为了解决这一问题,本文提出在一个模型中同时捕获空间和时间伪影进行人脸伪造检测。
方法:采用一种名为AltFreezing的新颖训练策略,将一个时空网络的权重分为两组:空间相关和时间相关。在训练过程中交替冻结这两组权重,使模型能够学习空间和时间特征以区分真实或伪造的视频。此外,还引入了各种视频级数据增强方法以提高伪造检测模型的泛化能力。
效果:大量实验表明,我们的框架在对未见过的操作和数据集进行泛化方面优于现有方法。

Existing face forgery detection models try to discriminate fake images by detecting only spatial artifacts (e.g., generative artifacts, blending) or mainly temporal artifacts (e.g., flickering, discontinuity). They may experience significant performance degradation when facing out-domain artifacts. In this paper, we propose to capture both spatial and temporal artifacts in one model for face forgery detection. A simple idea is to leverage a spatiotemporal model (3D ConvNet). However, we find that it may easily rely on one type of artifact and ignore the other. To address this issue, we present a novel training strategy called AltFreezing for more general face forgery detection. The AltFreezing aims to encourage the model to detect both spatial and temporal artifacts. It divides the weights of a spatiotemporal network into two groups: spatial- and temporal-related. Then the two groups of weights are alternately frozen during the training process so that the model can learn spatial and temporal features to distinguish real or fake videos. Furthermore, we introduce various video-level data augmentation methods to improve the generalization capability of the forgery detection model. Extensive experiments show that our framework outperforms existing methods in terms of generalization to unseen manipulations and datasets.

MoDi: Unconditional Motion Synthesis From Diverse Data
Raab, SigalandLeibovitch, InbalandLi, PeizhuoandAberman, KfirandSorkine-Hornung, OlgaandCohen-Or, Daniel



研究问题:如何从给定的分布中无条件地合成多样化的高质量运动。
动机:尽管神经网络的出现革新了运动合成领域,但学习从给定的分布中无条件地合成多样化的运动仍然具有挑战性。
方法:我们提出了MoDi,这是一种在无监督设置中从极度多样化、无结构和未标记的数据集中训练的生成模型。在推理过程中,MoDi可以合成高质量的多样化运动。
效果:我们的模型产生了一个表现良好且高度结构化的潜在空间,可以进行语义聚类,构成强大的运动先验,有助于各种应用,包括语义编辑和人群动画。此外,我们还展示了一种编码器,可以将真实运动转换为MoDi的自然运动流形,解决了各种病态挑战,如前缀完成和空间编辑。我们的定性和定量实验取得了超越最新SOTA技术的结果。

The emergence of neural networks has revolutionized the field of motion synthesis. Yet, learning to unconditionally synthesize motions from a given distribution remains challenging, especially when the motions are highly diverse. In this work, we present MoDi -- a generative model trained in an unsupervised setting from an extremely diverse, unstructured and unlabeled dataset. During inference, MoDi can synthesize high-quality, diverse motions. Despite the lack of any structure in the dataset, our model yields a well-behaved and highly structured latent space, which can be semantically clustered, constituting a strong motion prior that facilitates various applications including semantic editing and crowd animation. In addition, we present an encoder that inverts real motions into MoDi's natural motion manifold, issuing solutions to various ill-posed challenges such as completion from prefix and spatial editing. Our qualitative and quantitative experiments achieve state-of-the-art results that outperform recent SOTA techniques. Code and trained models are available at https://sigal-raab.github.io/MoDi.

Generative Diffusion Prior for Unified Image Restoration and Enhancement
Fei, BenandLyu, ZhaoyangandPan, LiangandZhang, JunzheandYang, WeidongandLuo, TianyueandZhang, BoandDai, Bo



研究问题:现有的图像恢复方法大多利用自然图像的后验分布,但它们通常假设已知的退化过程,并且需要监督训练,这限制了它们在复杂真实应用中的适应性。
动机:为了解决这些问题,我们提出了生成扩散先验(GDP),以无监督采样的方式有效地对后验分布进行建模。
方法:GDP利用预训练的去噪扩散生成模型(DDPM)来解决线性、非线性或盲问题。具体来说,GDP系统地探索了一种条件引导协议,该协议被证明比常用的引导方式更实用。此外,GDP还在去噪过程中优化了退化模型的参数,实现了盲图像恢复。我们还设计了分层引导和基于补丁的方法,使GDP能够生成任意分辨率的图像。
效果:实验表明,GDP在多个图像数据集上具有通用性,可以处理线性问题如超分辨率、去模糊、修复和着色,以及非线性和盲问题如低光增强和HDR图像恢复。在重建质量和感知质量的各种基准测试中,GDP优于当前领先的无监督方法。此外,GDP还很好地泛化到自然图像或来自ImageNet训练集分布外的各种任务的任意大小合成图像。

Existing image restoration methods mostly leverage the posterior distribution of natural images. However, they often assume known degradation and also require supervised training, which restricts their adaptation to complex real applications. In this work, we propose the Generative Diffusion Prior (GDP) to effectively model the posterior distributions in an unsupervised sampling manner. GDP utilizes a pre-train denoising diffusion generative model (DDPM) for solving linear inverse, non-linear, or blind problems. Specifically, GDP systematically explores a protocol of conditional guidance, which is verified more practical than the commonly used guidance way. Furthermore, GDP is strength at optimizing the parameters of degradation model during denoising process, achieving blind image restoration. Besides, we devise hierarchical guidance and patch-based methods, enabling the GDP to generate images of arbitrary resolutions. Experimentally, we demonstrate GDP's versatility on several image datasets for linear problems, such as super-resolution, deblurring, inpainting, and colorization, as well as non-linear and blind issues, such as low-light enhancement and HDR image recovery. GDP outperforms the current leading unsupervised methods on the diverse benchmarks in reconstruction quality and perceptual quality. Moreover, GDP also generalizes well for natural images or synthesized images with arbitrary sizes from various tasks out of the distribution of the ImageNet training set.

CF-Font: Content Fusion for Few-Shot Font Generation
Wang, ChiandZhou, MinandGe, TiezhengandJiang, YuningandBao, HujunandXu, Weiwei



研究问题:如何有效地实现少数字体生成?
动机:现有的内容和风格解耦方法在提取代表性字体的内容特征时可能不是最优的。
方法:提出一种内容融合模块(CFM),将内容特征投影到由基本字体内容特征定义的线性空间中,以考虑不同字体引起的内容特征变化。同时,通过轻量级迭代样式向量优化(ISR)策略优化参考图像的风格表示向量。
效果:在包含300种字体、每种字体6500个字符的数据集上进行评估,实验结果表明,该方法比现有的最先进的少数字体生成方法有显著改进。

Content and style disentanglement is an effective way to achieve few-shot font generation. It allows to transfer the style of the font image in a source domain to the style defined with a few reference images in a target domain. However, the content feature extracted using a representative font might not be optimal. In light of this, we propose a content fusion module (CFM) to project the content feature into a linear space defined by the content features of basis fonts, which can take the variation of content features caused by different fonts into consideration. Our method also allows to optimize the style representation vector of reference images through a lightweight iterative style-vector refinement (ISR) strategy. Moreover, we treat the 1D projection of a character image as a probability distribution and leverage the distance between two distributions as the reconstruction loss (namely projected character loss, PCL). Compared to L2 or L1 reconstruction loss, the distribution distance pays more attention to the global shape of characters. We have evaluated our method on a dataset of 300 fonts with 6.5k characters each. Experimental results verify that our method outperforms existing state-of-the-art few-shot font generation methods by a large margin. The source code can be found at https://github.com/wangchi95/CF-Font.

3D-Aware Multi-Class Image-to-Image Translation With NeRFs
Li, SenmaoandvandeWeijer, JoostandWang, YaxingandKhan, FahadShahbazandLiu, MeiqinandYang, Jian



研究问题:如何实现3D一致的多类别图像到图像转换(3D-aware I2I)翻译。
动机:现有的二维图像到图像转换方法在处理形状和身份变化时会产生不真实的效果,而直接使用这些方法进行三维感知的多类别图像到图像转换则无法保证视图一致性。
方法:将学习过程分为两个步骤,首先通过提出新的条件架构和有效的训练策略,训练一个能够保持视图一致性的多类别3D-aware GAN;然后基于训练好的GAN架构,构建一个3D-aware I2I翻译系统,并进一步提出U型网络适配器设计、分层表示约束和相对正则化损失等新方法来减少视图一致性问题。
效果:在两个数据集上的大量实验表明,该方法能成功实现具有多视角一致性的3D-aware I2I翻译。

Recent advances in 3D-aware generative models (3D-aware GANs) combined with Neural Radiance Fields (NeRF) have achieved impressive results. However no prior works investigate 3D-aware GANs for 3D consistent multi-class image-to-image (3D-aware I2I) translation. Naively using 2D-I2I translation methods suffers from unrealistic shape/identity change. To perform 3D-aware multi-class I2I translation, we decouple this learning process into a multi-class 3D-aware GAN step and a 3D-aware I2I translation step. In the first step, we propose two novel techniques: a new conditional architecture and an effective training strategy. In the second step, based on the well-trained multi-class 3D-aware GAN architecture, that preserves view-consistency, we construct a 3D-aware I2I translation system. To further reduce the view-consistency problems, we propose several new techniques, including a U-net-like adaptor network design, a hierarchical representation constrain and a relative regularization loss. In extensive experiments on two datasets, quantitative and qualitative results demonstrate that we successfully perform 3D-aware I2I translation with multi-view consistency.

Seeing Beyond the Brain: Conditional Diffusion Model With Sparse Masked Modeling for Vision Decoding
Chen, ZijiaoandQing, JiaxinandXiang, TiangeandYue, WanLinandZhou, JuanHelen



研究问题:如何从大脑记录中解码视觉刺激,以深化对人类视觉系统的理解,并通过脑机接口建立人与计算机视觉的联系。
动机:由于大脑信号的复杂底层表示和数据标注的稀缺性,从大脑记录中重建具有正确语义的高质图像是一个挑战。
方法:我们提出了MinD-Vis,这是一种用于人类视觉解码的稀疏掩蔽脑模型,使用双重条件潜扩散模型。首先,我们在大的潜在空间中通过掩蔽建模学习有效的功能性磁共振成像数据的自监督表示,这受到初级视觉皮层信息稀疏编码的启发。然后,通过增强潜扩散模型的双重条件,我们证明MinD-Vis可以使用非常少的配对标注从大脑记录重建高度可信的、具有语义匹配细节的图像。
效果:我们的模型在定性和定量上进行了基准测试;实验结果表明,我们的方法在语义映射(100类语义分类)和生成质量(FID)方面分别比最先进的方法提高了66%和41%。我们还进行了详尽的分析来分析我们的框架。

Decoding visual stimuli from brain recordings aims to deepen our understanding of the human visual system and build a solid foundation for bridging human and computer vision through the Brain-Computer Interface. However, reconstructing high-quality images with correct semantics from brain recordings is a challenging problem due to the complex underlying representations of brain signals and the scarcity of data annotations. In this work, we present MinD-Vis: Sparse Masked Brain Modeling with Double-Conditioned Latent Diffusion Model for Human Vision Decoding. Firstly, we learn an effective self-supervised representation of fMRI data using mask modeling in a large latent space inspired by the sparse coding of information in the primary visual cortex. Then by augmenting a latent diffusion model with double-conditioning, we show that MinD-Vis can reconstruct highly plausible images with semantically matching details from brain recordings using very few paired annotations. We benchmarked our model qualitatively and quantitatively; the experimental results indicate that our method outperformed state-of-the-art in both semantic mapping (100-way semantic classification) and generation quality (FID) by 66% and 41% respectively. An exhaustive ablation study was also conducted to analyze our framework.

High-Fidelity Generalized Emotional Talking Face Generation With Multi-Modal Emotion Space Learning
Xu, ChaoandZhu, JunweiandZhang, JiangningandHan, YueandChu, WenqingandTai, YingandWang, ChengjieandXie, ZhifengandLiu, Yong



研究问题:现有的情感说话人脸生成方法在实际应用中缺乏灵活性,无法处理未见过的情感风格。
动机:为了解决上述问题,本文提出了一个更灵活、更通用的框架。
方法:通过在文本提示中补充情感风格,并使用对齐的多模态情感编码器将文本、图像和音频情感模态嵌入统一空间,从而继承来自CLIP的丰富语义先验。此外,还提出了情感感知的音频到3DMM转换器来连接情感条件和音频序列的结构表示。最后设计了一个基于风格的高保真情感人脸生成器来生成任意高分辨率的真实身份。
效果:实验结果表明,该方法在情感控制方面具有灵活性和泛化性,并且在高质量人脸合成方面具有有效性。

Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.

Masked and Adaptive Transformer for Exemplar Based Image Translation
Jiang, ChangandGao, FeiandMa, BiaoandLin, YuhaoandWang, NannanandXu, Gang



研究问题:本文旨在提出一种新的基于范例的图像翻译框架,以解决跨领域语义匹配的挑战。
动机:当前先进的图像翻译方法主要关注建立跨领域的语义对应关系,但这在图像生成中会导致局部风格控制的问题。
方法:我们提出了一种遮蔽和自适应的变压器(MAT)来学习准确的跨领域对应关系,并执行上下文感知的特征增强。同时,我们使用输入源特征和范例的全局风格代码作为补充信息进行图像解码。此外,我们还设计了一种新的对比风格学习方法,以获取质量判别的风格表示,从而有利于高质量的图像生成。
效果:实验结果表明,我们的MATEBIT方法在各种图像翻译任务上的表现明显优于最先进的方法。

We present a novel framework for exemplar based image translation. Recent advanced methods for this task mainly focus on establishing cross-domain semantic correspondence, which sequentially dominates image generation in the manner of local style control. Unfortunately, cross domain semantic matching is challenging; and matching errors ultimately degrade the quality of generated images. To overcome this challenge, we improve the accuracy of matching on the one hand, and diminish the role of matching in image generation on the other hand. To achieve the former, we propose a masked and adaptive transformer (MAT) for learning accurate cross-domain correspondence, and executing context-aware feature augmentation. To achieve the latter, we use source features of the input and global style codes of the exemplar, as supplementary information, for decoding an image. Besides, we devise a novel contrastive style learning method, for acquire quality-discriminative style representations, which in turn benefit high-quality image generation. Experimental results show that our method, dubbed MATEBIT, performs considerably better than state-of-the-art methods, in diverse image translation tasks.

Imagic: Text-Based Real Image Editing With Diffusion Models
Kawar, BahjatandZada, ShiranandLang, OranandTov, OmerandChang, HuiwenandDekel, TaliandMosseri, InbarandIrani, Michal



研究问题:如何将复杂的基于文本的语义编辑应用于单个真实图像。
动机:目前的大部分方法仅限于特定的编辑类型、合成生成的图像或需要多个相同对象的输入图像,缺乏对真实图像进行复杂语义编辑的能力。
方法:本文提出了一种名为Imagic的方法,利用预训练的文本到图像扩散模型,通过产生与输入图像和目标文本对齐的文本嵌入,并微调扩散模型以捕获图像特定的外观,实现对单个真实图像进行复杂语义编辑。
效果:在各种领域的多个输入上展示了Imagic的高质量和多功能性,并在引入的TEdBench图像编辑基准测试中,用户研究结果显示人类评价者更喜欢Imagic而不是先前的领先编辑方法。

Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently limited to one of the following: specific editing types (e.g., object overlay, style transfer), synthetically generated images, or requiring multiple input images of a common object. In this paper we demonstrate, for the very first time, the ability to apply complex (e.g., non-rigid) text-based semantic edits to a single real image. For example, we can change the posture and composition of one or multiple objects inside an image, while preserving its original characteristics. Our method can make a standing dog sit down, cause a bird to spread its wings, etc. -- each within its single high-resolution user-provided natural image. Contrary to previous work, our proposed method requires only a single input image and a target text (the desired edit). It operates on real images, and does not require any additional inputs (such as image masks or additional views of the object). Our method, called Imagic, leverages a pre-trained text-to-image diffusion model for this task. It produces a text embedding that aligns with both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance. We demonstrate the quality and versatility of Imagic on numerous inputs from various domains, showcasing a plethora of high quality complex semantic image edits, all within a single unified framework. To better assess performance, we introduce TEdBench, a highly challenging image editing benchmark. We conduct a user study, whose findings show that human raters prefer Imagic to previous leading editing methods on TEdBench.

LightPainter: Interactive Portrait Relighting With Freehand Scribble
Mei, YiqunandZhang, HeandZhang, XuanerandZhang, JianmingandShu, ZhixinandWang, YilinandWei, ZijunandYan, ShiandJung, HyunJoonandPatel, VishalM.



研究问题:现有的肖像重光照方法在实现理想的照明效果时缺乏用户交互和精确的照明控制。
动机:提出一种基于涂鸦的重光照系统LightPainter,使用户能够轻松地交互式操纵肖像照明效果。
方法:通过两个条件神经网络,一个根据肤色恢复几何和漫反射的快乐模块,和一个用于重光照的基于涂鸦的模块。
效果:通过定量和定性实验展示了高质量和灵活的肖像照明编辑能力。与商业照明编辑工具的用户研究比较也显示了用户对该方法的一致偏好。

Recent portrait relighting methods have achieved realistic results of portrait lighting effects given a desired lighting representation such as an environment map. However, these methods are not intuitive for user interaction and lack precise lighting control. We introduce LightPainter, a scribble-based relighting system that allows users to interactively manipulate portrait lighting effect with ease. This is achieved by two conditional neural networks, a delighting module that recovers geometry and albedo optionally conditioned on skin tone, and a scribble-based module for relighting. To train the relighting module, we propose a novel scribble simulation procedure to mimic real user scribbles, which allows our pipeline to be trained without any human annotations. We demonstrate high-quality and flexible portrait lighting editing capability with both quantitative and qualitative experiments. User study comparisons with commercial lighting editing tools also demonstrate consistent user preference for our method.

Deep Curvilinear Editing: Commutative and Nonlinear Image Manipulation for Pretrained Deep Generative Model
Aoshima, TakehiroandMatsubara, Takashi



研究问题:如何有效地对生成的图像进行语义编辑。
动机:尽管深度学习方法如生成对抗网络(GANs)能够产生高质量的图像,但它们通常没有固有的方式来对生成的图像进行语义编辑。
方法:本研究提出了一种新的深度曲线编辑(DeCurvEd)方法,在潜在空间上确定语义交换向量场。
效果:由于交换性,多个属性的编辑仅取决于数量而不是顺序。实验证明,与以往的方法相比,DeCurvEd的非线性和交换性提供了更高质量的编辑。

Semantic editing of images is the fundamental goal of computer vision. Although deep learning methods, such as generative adversarial networks (GANs), are capable of producing high-quality images, they often do not have an inherent way of editing generated images semantically. Recent studies have investigated a way of manipulating the latent variable to determine the images to be generated. However, methods that assume linear semantic arithmetic have certain limitations in terms of the quality of image editing, whereas methods that discover nonlinear semantic pathways provide non-commutative editing, which is inconsistent when applied in different orders. This study proposes a novel method called deep curvilinear editing (DeCurvEd) to determine semantic commuting vector fields on the latent space. We theoretically demonstrate that owing to commutativity, the editing of multiple attributes depends only on the quantities and not on the order. Furthermore, we experimentally demonstrate that compared to previous methods, the nonlinear and commutative nature of DeCurvEd provides higher-quality editing.

Learning Semantic-Aware Knowledge Guidance for Low-Light Image Enhancement
Wu, YuhuiandPan, ChenandWang, GuoqingandYang, YangandWei, JiweiandLi, ChongyiandShen, HengTao



研究问题:如何通过改进照明条件和生成正常光照图像来提高低光图像质量。
动机:现有的大部分方法通过全局统一的方式改善低光图像,没有考虑到不同区域的语义信息。
方法:提出了一种新的语义感知知识引导框架(SKF),该框架可以在学习丰富的、多样的先验知识的同时,帮助低光增强模型进行学习。主要从三个方面引入语义知识:一是在特征表示空间中明智地集成语义先验知识的语义感知嵌入模块;二是保留各种实例颜色一致性的语义引导颜色直方图损失;三是通过语义先验知识产生更自然纹理的语义引导对抗性损失。
效果:实验表明,配备了SKF的模型在多个数据集上显著优于基线,并且SKF可以很好地推广到不同的模型和场景。

Low-light image enhancement (LLIE) investigates how to improve illumination and produce normal-light images. The majority of existing methods improve low-light images via a global and uniform manner, without taking into account the semantic information of different regions. Without semantic priors, a network may easily deviate from a region's original color. To address this issue, we propose a novel semantic-aware knowledge-guided framework (SKF) that can assist a low-light enhancement model in learning rich and diverse priors encapsulated in a semantic segmentation model. We concentrate on incorporating semantic knowledge from three key aspects: a semantic-aware embedding module that wisely integrates semantic priors in feature representation space, a semantic-guided color histogram loss that preserves color consistency of various instances, and a semantic-guided adversarial loss that produces more natural textures by semantic priors. Our SKF is appealing in acting as a general framework in LLIE task. Extensive experiments show that models equipped with the SKF significantly outperform the baselines on multiple datasets and our SKF generalizes to different models and scenes well. The code is available at Semantic-Aware-Low-Light-Image-Enhancement.

Generalized Deep 3D Shape Prior via Part-Discretized Diffusion Process
Li, YuhanandDou, YishunandChen, XuanhongandNi, BingbingandSun, YilinandLiu, YutianandWang, Fuzhen



研究问题:开发一种适用于多种3D任务的通用3D形状生成先验模型,包括无条件形状生成、点云补全和跨模态形状生成等。
动机:为了精确捕捉局部精细的形状信息,并建立不同标记之间的固有结构依赖关系,同时抑制高频率形状特征波动。
方法:利用基于广泛任务训练数据的紧凑代码簿的向量量化变分自编码器(VQ-VAE)来索引局部几何形状,引入离散扩散生成器来建模不同标记之间的固有结构依赖关系,并开发多频融合模块(MFM)来抑制高频率形状特征波动。
效果:通过上述设计,所提出的3D形状先验模型具有高保真度、多样化特征以及跨模态对齐的能力,在各种3D形状生成任务上表现出优越的性能。

We develop a generalized 3D shape generation prior model, tailored for multiple 3D tasks including unconditional shape generation, point cloud completion, and cross-modality shape generation, etc. On one hand, to precisely capture local fine detailed shape information, a vector quantized variational autoencoder (VQ-VAE) is utilized to index local geometry from a compactly learned codebook based on a broad set of task training data. On the other hand, a discrete diffusion generator is introduced to model the inherent structural dependencies among different tokens. In the meantime, a multi-frequency fusion module (MFM) is developed to suppress high-frequency shape feature fluctuations, guided by multi-frequency contextual information. The above designs jointly equip our proposed 3D shape prior model with high-fidelity, diverse features as well as the capability of cross-modality alignment, and extensive experiments have demonstrated superior performances on various 3D shape generation tasks.

CLIP2Protect: Protecting Facial Privacy Using Text-Guided Makeup via Adversarial Latent Search
Shamshad, FahadandNaseer, MuzammalandNandakumar, Karthik



研究问题:深度学习在人脸识别领域的成功引发了隐私问题,如何在保护面部隐私的同时不影响用户体验。
动机:现有的增强隐私的方法无法生成自然图像来保护面部隐私。
方法:提出一种新的两步法进行面部隐私保护,该方法依赖于在预训练的生成模型的低维流形中找到对抗性潜在代码。第一步将给定的人脸图像反转到潜在空间中,并微调生成模型以从其潜在代码中准确重建给定的图像。这一步产生了良好的初始状态,有助于生成与给定身份相似的高质量人脸。随后,使用用户定义的化妆文本提示和身份保护正则化来引导在潜在空间中搜索对抗性代码。
效果:实验表明,我们的方法生成的人脸具有更强的黑盒转移能力,在面部验证任务上比最先进的面部隐私保护方法绝对提高了12.06%。最后,我们展示了该方法在商业人脸识别系统上的有效性。

The success of deep learning based face recognition systems has given rise to serious privacy concerns due to their ability to enable unauthorized tracking of users in the digital world. Existing methods for enhancing privacy fail to generate naturalistic' images that can protect facial privacy without compromising user experience. We propose a novel two-step approach for facial privacy protection that relies on finding adversarial latent codes in the low-dimensional manifold of a pretrained generative model. The first step inverts the given face image into the latent space and finetunes the generative model to achieve an accurate reconstruction of the given image from its latent code. This step produces a good initialization, aiding the generation of high-quality faces that resemble the given identity. Subsequently, user defined makeup text prompts and identity-preserving regularization are used to guide the search for adversarial codes in the latent space. Extensive experiments demonstrate that faces generated by our approach have stronger black-box transferability with an absolute gain of 12.06% over the state-of-the-art facial privacy protection approach under the face verification task. Finally, we demonstrate the effectiveness of the proposed approach for commercial face recognition systems. Our code is available at https://github.com/fahadshamshad/Clip2Protect.

PAniC-3D: Stylized Single-View 3D Reconstruction From Portraits of Anime Characters
Chen, ShuhongandZhang, KevinandShi, YichunandWang, HengandZhu, YihengandSong, GuoxianandAn, SizheandKristjansson, JanusandYang, XiaoandZwicker, Matthias



研究问题:如何直接从插图化的动漫角色肖像中重建风格化的3D角色头部。
动机:动漫风格的领域对单视图重建提出了独特的挑战,与人类头部的自然图像相比,角色肖像插图的头发和配饰具有更复杂和多样化的几何形状,并且用非照片真实的轮廓线进行着色。此外,缺乏适合训练和评估这种模糊的风格化重建任务的3D模型和肖像插图数据。
方法:提出的PAniC-3D架构通过线条填充模型跨越插图到3D领域的鸿沟,并用体积辐射场表示复杂的几何形状。使用两个大型新数据集(11.2k Vroid 3D模型,1k Vtuber肖像插图)训练系统,并在新的动漫重建基准测试集中进行评估。
效果:PAniC-3D显著优于基线方法,并为从肖像插图中重建风格化的任务提供了数据。

We propose PAniC-3D, a system to reconstruct stylized 3D character heads directly from illustrated (p)ortraits of (ani)me (c)haracters. Our anime-style domain poses unique challenges to single-view reconstruction; compared to natural images of human heads, character portrait illustrations have hair and accessories with more complex and diverse geometry, and are shaded with non-photorealistic contour lines. In addition, there is a lack of both 3D model and portrait illustration data suitable to train and evaluate this ambiguous stylized reconstruction task. Facing these challenges, our proposed PAniC-3D architecture crosses the illustration-to-3D domain gap with a line-filling model, and represents sophisticated geometries with a volumetric radiance field. We train our system with two large new datasets (11.2k Vroid 3D models, 1k Vtuber portrait illustrations), and evaluate on a novel AnimeRecon benchmark of illustration-to-3D pairs. PAniC-3D significantly outperforms baseline methods, and provides data to establish the task of stylized reconstruction from portrait illustrations.

DCFace: Synthetic Face Generation With Dual Condition Diffusion Model
Kim, MinchulandLiu, FengandJain, AnilandLiu, Xiaoming



研究问题:训练人脸识别模型的合成数据集生成具有挑战性,因为不仅需要创建高保真图像,还需要在不同因素(如姿态、照明、表情、老化和遮挡)下生成同一主题的多张图像。
动机:先前的研究使用GAN或3D模型来生成合成数据集,本研究从结合主题外观(ID)和外部因素(风格)条件的角度解决这个问题。
方法:提出了一种基于扩散模型的双重条件人脸生成器(DCFace),其新颖的Patch-wise风格提取器和Time-step依赖的ID损失使DCFace能够精确控制同一主题在不同风格下的面部图像生成。
效果:在LFW、CFP-FP、CPLFW、AgeDB和CALFW等5个测试数据集中,有4个数据集上,用提出的DCFace生成的合成图像训练的人脸识别模型相比之前的工作平均提高了6.11%的验证准确率。

Generating synthetic datasets for training face recognition models is challenging because dataset generation entails more than creating high fidelity images. It involves generating multiple images of same subjects under different factors (e.g., variations in pose, illumination, expression, aging and occlusion) which follows the real image conditional distribution. Previous works have studied the generation of synthetic datasets using GAN or 3D models. In this work, we approach the problem from the aspect of combining subject appearance (ID) and external factor (style) conditions. These two conditions provide a direct way to control the inter-class and intra-class variations. To this end, we propose a Dual Condition Face Generator (DCFace) based on a diffusion model. Our novel Patch-wise style extractor and Time-step dependent ID loss enables DCFace to consistently produce face images of the same subject under different styles with precise control. Face recognition models trained on synthetic images from the proposed DCFace provide higher verification accuracies compared to previous works by 6.11% on average in 4 out of 5 test datasets, LFW, CFP-FP, CPLFW, AgeDB and CALFW. Model, code, and synthetic dataset are available at https://github.com/mk-minchul/dcface

Perception-Oriented Single Image Super-Resolution Using Optimal Objective Estimation
Park, SeungHoandMoon, YoungSuandCho, NamIk



研究问题:如何优化使用单一图像超分辨率(SISR)网络进行高对比度输出。
动机:虽然现有的SISR网络通过感知损失和对抗损失训练可以提供高对比度的输出,但仅使用单一的感知损失无法准确恢复图像中局部变化的形状,常常产生不自然的纹理或细节。
方法:本文提出了一种新的SISR框架,该框架为每个区域应用最优目标以生成整体高分辨率输出的合理结果。具体来说,该框架包含两个模型:一个预测模型,用于为给定的低分辨率(LR)输入推断最优目标图;一个生成模型,用于应用目标目标图来生成相应的SR输出。生成模型在代表一组基本目标的目标轨迹上进行训练,使单个网络能够学习对应于轨迹上组合损失的各种SR结果。预测模型则使用成对的LR图像和从目标轨迹中搜索到的相应最优目标图进行训练。
效果:在五个基准测试集上的实验结果表明,该方法在LPIPS、DISTS、PSNR和SSIM指标上都优于最先进的感知驱动的SR方法。视觉结果也证明了我们的方法在感知导向重建方面的优越性。代码可在https://github.com/seungho-snu/SROOE获取。

Single-image super-resolution (SISR) networks trained with perceptual and adversarial losses provide high-contrast outputs compared to those of networks trained with distortion-oriented losses, such as L1 or L2. However, it has been shown that using a single perceptual loss is insufficient for accurately restoring locally varying diverse shapes in images, often generating undesirable artifacts or unnatural details. For this reason, combinations of various losses, such as perceptual, adversarial, and distortion losses, have been attempted, yet it remains challenging to find optimal combinations. Hence, in this paper, we propose a new SISR framework that applies optimal objectives for each region to generate plausible results in overall areas of high-resolution outputs. Specifically, the framework comprises two models: a predictive model that infers an optimal objective map for a given low-resolution (LR) input and a generative model that applies a target objective map to produce the corresponding SR output. The generative model is trained over our proposed objective trajectory representing a set of essential objectives, which enables the single network to learn various SR results corresponding to combined losses on the trajectory. The predictive model is trained using pairs of LR images and corresponding optimal objective maps searched from the objective trajectory. Experimental results on five benchmarks show that the proposed method outperforms state-of-the-art perception-driven SR methods in LPIPS, DISTS, PSNR, and SSIM metrics. The visual results also demonstrate the superiority of our method in perception-oriented reconstruction. The code is available at https://github.com/seungho-snu/SROOE.

GP-VTON: Towards General Purpose Virtual Try-On via Collaborative Local-Flow Global-Parsing Learning
Xie, ZhenyuandHuang, ZaiyuandDong, XinandZhao, FuweiandDong, HaoyeandZhang, XijinandZhu, FeidaandLiang, Xiaodan



研究问题:现有的基于图像的虚拟试穿技术在处理复杂输入(如复杂的人体姿势、困难的服装)时,无法保留不同部分的语义信息,且直接进行变形会导致纹理失真。
动机:为了解决上述问题,并推动虚拟试穿技术在实际中的应用。
方法:提出了一种名为GP-VTON的通用虚拟试穿框架,通过开发局部流动全局解析(LFGP)变形模块和动态梯度截断(DGT)训练策略。LFGP采用局部流动对服装各部分进行单独变形,并通过全局服装解析组装局部变形结果,即使在复杂输入下也能生成合理的变形部分和语义正确的完整服装。DGT训练策略动态截断重叠区域和变形服装的梯度,有效避免了纹理压缩问题。
效果:实验证明,GP-VTON在两个高分辨率基准测试中优于现有最先进的方法,且可以容易地扩展到多类别场景,并使用来自不同服装类别的数据进行联合训练。

Image-based Virtual Try-ON aims to transfer an in-shop garment onto a specific person. Existing methods employ a global warping module to model the anisotropic deformation for different garment parts, which fails to preserve the semantic information of different parts when receiving challenging inputs (e.g, intricate human poses, difficult garments). Moreover, most of them directly warp the input garment to align with the boundary of the preserved region, which usually requires texture squeezing to meet the boundary shape constraint and thus leads to texture distortion. The above inferior performance hinders existing methods from real-world applications. To address these problems and take a step towards real-world virtual try-on, we propose a General-Purpose Virtual Try-ON framework, named GP-VTON, by developing an innovative Local-Flow Global-Parsing (LFGP) warping module and a Dynamic Gradient Truncation (DGT) training strategy. Specifically, compared with the previous global warping mechanism, LFGP employs local flows to warp garments parts individually, and assembles the local warped results via the global garment parsing, resulting in reasonable warped parts and a semantic-correct intact garment even with challenging inputs.On the other hand, our DGT training strategy dynamically truncates the gradient in the overlap area and the warped garment is no more required to meet the boundary constraint, which effectively avoids the texture squeezing problem. Furthermore, our GP-VTON can be easily extended to multi-category scenario and jointly trained by using data from different garment categories. Extensive experiments on two high-resolution benchmarks demonstrate our superiority over the existing state-of-the-art methods.

Video Probabilistic Diffusion Models in Projected Latent Space
Yu, SihyunandSohn, KihyukandKim, SubinandShin, Jinwoo



研究问题:如何有效地生成高分辨率、时间连贯的视频?
动机:尽管深度生成模型取得了显著的进步,但由于视频的高维度和复杂的时空动态以及大的空间变化,合成高分辨率和时间连贯的视频仍然是一个挑战。
方法:提出了一种新的视频生成模型——投影潜在视频扩散模型(PVDM),这是一种概率扩散模型,可以在低维潜在空间中学习视频分布,因此可以在有限的资源下高效地训练高分辨率的视频。具体来说,PVDM由两个组件组成:(a)一个自动编码器,将给定的视频投影为二维形状的潜在向量,分解视频像素的复杂立方结构;(b)一种专门针对我们的新因子化潜在空间和训练/采样过程的扩散模型架构,以单模型合成任意长度的视频。
效果:在流行的视频生成数据集上的实验表明,PVDM优于以前的视频合成方法;例如,PVDM在UCF-101长视频(128帧)生成基准测试中获得了639.7的FVD分数,比之前最先进的提高了1773.4。

Despite the remarkable progress in deep generative models, synthesizing high-resolution and temporally coherent videos still remains a challenge due to their high-dimensionality and complex temporal dynamics along with large spatial variations. Recent works on diffusion models have shown their potential to solve this challenge, yet they suffer from severe computation- and memory-inefficiency that limit the scalability. To handle this issue, we propose a novel generative model for videos, coined projected latent video diffusion models (PVDM), a probabilistic diffusion model which learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources. Specifically, PVDM is composed of two components: (a) an autoencoder that projects a given video as 2D-shaped latent vectors that factorize the complex cubic structure of video pixels and (b) a diffusion model architecture specialized for our new factorized latent space and the training/sampling procedure to synthesize videos of arbitrary length with a single model. Experiments on popular video generation datasets demonstrate the superiority of PVDM compared with previous video synthesis methods; e.g., PVDM obtains the FVD score of 639.7 on the UCF-101 long video (128 frames) generation benchmark, which improves 1773.4 of the prior state-of-the-art.

NeuWigs: A Neural Dynamic Model for Volumetric Hair Capture and Animation
Wang, ZiyanandNam, GiljooandStuyck, TuurandLombardi, StephenandCao, ChenandSaragih, JasonandZollh\"ofer, MichaelandHodgins, JessicaandLassner, Christoph



研究问题:在虚拟现实中创建逼真的头像时,如何捕捉和动画化人类的头发是两个主要挑战。
动机:由于头发具有复杂的几何形状、外观和运动特性,因此这两个问题都非常具有挑战性。
方法:本文提出了一种两阶段的方法,将头发与头部独立建模,以数据驱动的方式解决这些挑战。第一阶段,状态压缩,通过一种新型的自动编码器跟踪策略,学习包含运动和外观的3D头发状态的低维潜在空间。为了在外观学习中更好地分离头发和头部,我们结合了多视图头发分割掩码和一个可微体积渲染器。第二阶段学习一种新的头发动力学模型,根据发现的潜在代码进行时间上的头发转移。为了在我们的动力学模型中强制更高的稳定性,我们使用压缩阶段的3D点云自动编码器对头发状态进行去噪。
效果:我们的模型在新颖视图合成方面优于现有技术,并且能够在不依赖头发观察作为驱动信号的情况下创建新的发型动画。

The capture and animation of human hair are two of the major challenges in the creation of realistic avatars for the virtual reality. Both problems are highly challenging, because hair has complex geometry and appearance, as well as exhibits challenging motion. In this paper, we present a two-stage approach that models hair independently from the head to address these challenges in a data-driven manner. The first stage, state compression, learns a low-dimensional latent space of 3D hair states containing motion and appearance, via a novel autoencoder-as-a-tracker strategy. To better disentangle the hair and head in appearance learning, we employ multi-view hair segmentation masks in combination with a differentiable volumetric renderer. The second stage learns a novel hair dynamics model that performs temporal hair transfer based on the discovered latent codes. To enforce higher stability while driving our dynamics model, we employ the 3D point-cloud autoencoder from the compression stage for de-noising of the hair state. Our model outperforms the state of the art in novel view synthesis and is capable of creating novel hair animations without having to rely on hair observations as a driving signal

One-Shot High-Fidelity Talking-Head Synthesis With Deformable Neural Radiance Field
Li, WeichuangandZhang, LonghaoandWang, DongandZhao, BinandWang, ZhigangandChen, MulinandZhang, BangandWang, ZhongjianandBo, LiefengandLi, Xuelong



研究问题:如何生成能保持源图像身份信息并模仿驱动图像动作的头部图像?
动机:大多数先前的方法主要依赖2D表示,因此在遇到大头旋转时会不可避免地出现面部扭曲。
方法:本文提出了HiDe-NeRF,它通过将3D动态场景表示为一个标准外观场和一个隐含的形变场来实现高保真度和自由视角的说话人合成。
效果:实验结果表明,我们提出的方法比之前的工作能产生更好的结果。

Talking head generation aims to generate faces that maintain the identity information of the source image and imitate the motion of the driving image. Most pioneering methods rely primarily on 2D representations and thus will inevitably suffer from face distortion when large head rotations are encountered. Recent works instead employ explicit 3D structural representations or implicit neural rendering to improve performance under large pose changes. Nevertheless, the fidelity of identity and expression is not so desirable, especially for novel-view synthesis. In this paper, we propose HiDe-NeRF, which achieves high-fidelity and free-view talking-head synthesis. Drawing on the recently proposed Deformable Neural Radiance Fields, HiDe-NeRF represents the 3D dynamic scene into a canonical appearance field and an implicit deformation field, where the former comprises the canonical source face and the latter models the driving pose and expression. In particular, we improve fidelity from two aspects: (i) to enhance identity expressiveness, we design a generalized appearance module that leverages multi-scale volume features to preserve face shape and details; (ii) to improve expression preciseness, we propose a lightweight deformation module that explicitly decouples the pose and expression to enable precise expression modeling. Extensive experiments demonstrate that our proposed approach can generate better results than previous works. Project page: https://www.waytron.net/hidenerf/

Conditional Image-to-Video Generation With Latent Flow Diffusion Models
Ni, HaomiaoandShi, ChanghaoandLi, KaiandHuang, SharonX.andMin, MartinRenqiang



研究问题:本文旨在解决条件图像到视频(cI2V)生成任务中,如何从给定的图像和条件(如动作类别标签)出发,同时生成逼真的空间外观和时间动态的新视频。
动机:条件图像到视频生成任务的主要挑战在于如何同时生成与给定图像和条件相对应的真实空间外观和时间动态。现有的直接合成方法无法充分利用给定图像的空间内容进行合成。
方法:本文提出了一种使用新颖的潜在流扩散模型(LFDM)进行条件图像到视频生成的方法。LFDM在潜在空间中根据给定的条件合成光流序列,以扭曲给定的图像。相比于直接合成的方法,LFDM能更好地利用给定图像的空间内容,并在潜在空间中根据生成的时间连贯流进行扭曲,从而更好地合成空间细节和时间运动。
效果:实验结果表明,LFDM在多个数据集上的表现均优于现有方法。此外,通过简单地微调图像解码器,LFDM可以很容易地适应新的领域。

Conditional image-to-video (cI2V) generation aims to synthesize a new plausible video starting from an image (e.g., a person's face) and a condition (e.g., an action class label like smile). The key challenge of the cI2V task lies in the simultaneous generation of realistic spatial appearance and temporal dynamics corresponding to the given image and condition. In this paper, we propose an approach for cI2V using novel latent flow diffusion models (LFDM) that synthesize an optical flow sequence in the latent space based on the given condition to warp the given image. Compared to previous direct-synthesis-based works, our proposed LFDM can better synthesize spatial details and temporal motion by fully utilizing the spatial content of the given image and warping it in the latent space according to the generated temporally-coherent flow. The training of LFDM consists of two separate stages: (1) an unsupervised learning stage to train a latent flow auto-encoder for spatial content generation, including a flow predictor to estimate latent flow between pairs of video frames, and (2) a conditional learning stage to train a 3D-UNet-based diffusion model (DM) for temporal latent flow generation. Unlike previous DMs operating in pixel space or latent feature space that couples spatial and temporal information, the DM in our LFDM only needs to learn a low-dimensional latent flow space for motion generation, thus being more computationally efficient. We conduct comprehensive experiments on multiple datasets, where LFDM consistently outperforms prior arts. Furthermore, we show that LFDM can be easily adapted to new domains by simply finetuning the image decoder. Our code is available at https://github.com/nihaomiao/CVPR23_LFDM.

Towards Universal Fake Image Detectors That Generalize Across Generative Models
Ojha, UtkarshandLi, YuhengandLee, YongJae



研究问题:现有的生成模型检测器无法有效识别新出现的生成模型产生的假图像。
动机:目前的假图像检测方法主要通过训练深度网络进行真实与伪造分类,但这种方法在面对新的生成模型时效果不佳。
方法:提出一种无需学习的真实与伪造分类方法,即不直接在特征空间中区分真实和假图像。具体实施方式包括最近邻分类和线性探测。
效果:在大型预训练的视觉语言模型的特征空间中,即使使用非常简单的最近邻分类作为基线,也能在检测各种生成模型产生的假图像上表现出惊人的泛化能力,例如,在未见过的传播和自回归模型上,其性能比当前最佳方法提高了+15.07 mAP和+25.90%。

With generative models proliferating at a rapid rate, there is a growing need for general purpose fake image detectors. In this work, we first show that the existing paradigm, which consists of training a deep network for real-vs-fake classification, fails to detect fake images from newer breeds of generative models when trained to detect GAN fake images. Upon analysis, we find that the resulting classifier is asymmetrically tuned to detect patterns that make an image fake. The real class becomes a 'sink' class holding anything that is not fake, including generated images from models not accessible during training. Building upon this discovery, we propose to perform real-vs-fake classification without learning; i.e., using a feature space not explicitly trained to distinguish real from fake images. We use nearest neighbor and linear probing as instantiations of this idea. When given access to the feature space of a large pretrained vision-language model, the very simple baseline of nearest neighbor classification has surprisingly good generalization ability in detecting fake images from a wide variety of generative models; e.g., it improves upon the SoTA by +15.07 mAP and +25.90% acc when tested on unseen diffusion and autoregressive models.

DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model
Kim, GwanghyunandChun, SeYoung



研究问题:如何训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Recent 3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes, but training them for diverse domains is challenging since it requires massive training images and their camera distribution information. Text-guided domain adaptation methods have shown impressive performance on converting the 2D generative model on one domain into the models on other domains with different styles by leveraging the CLIP (Contrastive Language-Image Pre-training), rather than collecting massive datasets for those domains. However, one drawback of them is that the sample diversity in the original generative model is not well-preserved in the domain-adapted generative models due to the deterministic nature of the CLIP text encoder. Text-guided domain adaptation will be even more challenging for 3D generative models not only because of catastrophic diversity loss, but also because of inferior text-image correspondence and poor image quality. Here we propose DATID-3D, a domain adaptation method tailored for 3D generative models using text-to-image diffusion models that can synthesize diverse images per text prompt without collecting additional images and camera information for the target domain. Unlike 3D extensions of prior text-guided domain adaptation methods, our novel pipeline was able to fine-tune the state-of-the-art 3D generator of the source domain to synthesize high resolution, multi-view consistent images in text-guided targeted domains without additional data, outperforming the existing text-guided domain adaptation methods in diversity and text-image correspondence. Furthermore, we propose and demonstrate diverse 3D image manipulations such as one-shot instance-selected adaptation and single-view manipulated 3D reconstruction to fully enjoy diversity in text.

ShadowDiffusion: When Degradation Prior Meets Diffusion Model for Shadow Removal
Guo, LanqingandWang, ChongandYang, WenhanandHuang, SiyuandWang, YufeiandPfister, HanspeterandWen, Bihan



研究问题:现有的深度学习方法在图像阴影移除上取得了良好的效果,但恢复的图像边界伪影问题仍然严重。
动机:由于缺乏退化先验和模型能力不足,导致恢复的图像边界伪影问题严重。
方法:提出了一种统一的扩散框架,整合了图像和退化先验,用于高效的影子移除。首先提出一个影子退化模型,然后构建了一个名为ShadowDiffusion的新型展开扩散模型,通过逐步改进期望输出来提高模型在阴影移除上的能力。
效果:在三个流行的公共数据集上进行了大量的实验,结果显示,与最先进的方法相比,我们的模型在PSNR上有了显著的提高,从31.69dB提高到34.73dB。

Recent deep learning methods have achieved promising results in image shadow removal. However, their restored images still suffer from unsatisfactory boundary artifacts, due to the lack of degradation prior and the deficiency in modeling capacity. Our work addresses these issues by proposing a unified diffusion framework that integrates both the image and degradation priors for highly effective shadow removal. In detail, we first propose a shadow degradation model, which inspires us to build a novel unrolling diffusion model, dubbed ShandowDiffusion. It remarkably improves the model's capacity in shadow removal via progressively refining the desired output with both degradation prior and diffusive generative prior, which by nature can serve as a new strong baseline for image restoration. Furthermore, ShadowDiffusion progressively refines the estimated shadow mask as an auxiliary task of the diffusion generator, which leads to more accurate and robust shadow-free image generation. We conduct extensive experiments on three popular public datasets, including ISTD, ISTD+, and SRD, to validate our method's effectiveness. Compared to the state-of-the-art methods, our model achieves a significant improvement in terms of PSNR, increasing from 31.69dB to 34.73dB over SRD dataset.

FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction
Bai, HaoranandKang, DiandZhang, HaoxianandPan, JinshanandBao, Linchao



研究问题:如何利用大规模人脸图像数据集生成高质量的面部UV纹理地图,以用于在各种光照条件下渲染真实的3D人脸模型。
动机:现有的面部UV纹理数据集质量参差不齐,且多样性不足。因此,需要一种能够自动、稳健地从大规模人脸图像数据集中生成高质量面部UV纹理地图的方法。
方法:通过使用基于StyleGAN的人脸图像编辑方法从单张人脸图像中生成多视图标准化人脸图像,然后应用详细的UV纹理提取、校正和完成过程从标准化人脸图像中生成高质量的UV映射。
效果:实验表明,该方法提高了重建精度,超过了最先进的方法,并且生成的高质量纹理地图已准备好进行逼真的渲染。

We present a large-scale facial UV-texture dataset that contains over 50,000 high-quality texture UV-maps with even illuminations, neutral expressions, and cleaned facial regions, which are desired characteristics for rendering realistic 3D face models under different lighting conditions. The dataset is derived from a large-scale face image dataset namely FFHQ, with the help of our fully automatic and robust UV-texture production pipeline. Our pipeline utilizes the recent advances in StyleGAN-based facial image editing approaches to generate multi-view normalized face images from single-image inputs. An elaborated UV-texture extraction, correction, and completion procedure is then applied to produce high-quality UV-maps from the normalized face images. Compared with existing UV-texture datasets, our dataset has more diverse and higher-quality texture maps. We further train a GAN-based texture decoder as the nonlinear texture basis for parametric fitting based 3D face reconstruction. Experiments show that our method improves the reconstruction accuracy over state-of-the-art approaches, and more importantly, produces high-quality texture maps that are ready for realistic renderings. The dataset, code, and pre-trained texture decoder are publicly available at https://github.com/csbhr/FFHQ-UV.

Master: Meta Style Transformer for Controllable Zero-Shot and Few-Shot Artistic Style Transfer
Tang, HaoandLiu, SonghuaandLin, TianweiandHuang, ShaoliandLi, FuandHe, DongliangandWang, Xinchao



研究问题:如何利用Transformer模型进行艺术风格转移,并解决参数过多和内容失真的问题。
动机:Transformer模型在艺术风格转移任务上表现良好,但其多层结构导致参数过多,训练负担重,且直接融合内容和风格特征易导致内容失真。
方法:提出一种名为Master的新型Transformer模型,通过共享参数、引入可学习的缩放操作以及元学习机制,以降低参数数量、提高训练稳定性、保留内容特征的相似性并确保风格化质量。
效果:实验证明,Master模型在零样本和少样本风格转移设置下均表现出优越性,实现了文本引导的少样本风格转移。

Transformer-based models achieve favorable performance in artistic style transfer recently thanks to its global receptive field and powerful multi-head/layer attention operations. Nevertheless, the over-paramerized multi-layer structure increases parameters significantly and thus presents a heavy burden for training. Moreover, for the task of style transfer, vanilla Transformer that fuses content and style features by residual connections is prone to content-wise distortion. In this paper, we devise a novel Transformer model termed as Master specifically for style transfer. On the one hand, in the proposed model, different Transformer layers share a common group of parameters, which (1) reduces the total number of parameters, (2) leads to more robust training convergence, and (3) is readily to control the degree of stylization via tuning the number of stacked layers freely during inference. On the other hand, different from the vanilla version, we adopt a learnable scaling operation on content features before content-style feature interaction, which better preserves the original similarity between a pair of content features while ensuring the stylization quality. We also propose a novel meta learning scheme for the proposed model so that it can not only work in the typical setting of arbitrary style transfer, but also adaptable to the few-shot setting, by only fine-tuning the Transformer encoder layer in the few-shot stage for one specific style. Text-guided few-shot style transfer is firstly achieved with the proposed framework. Extensive experiments demonstrate the superiority of Master under both zero-shot and few-shot style transfer settings.

Affordance Diffusion: Synthesizing Hand-Object Interactions
Ye, YufeiandLi, XuetingandGupta, AbhinavandDeMello, ShaliniandBirchfield, StanandSong, JiamingandTulsiani, ShubhamandLiu, Sifei



研究问题:本文旨在解决图像合成中复杂交互(如手部与物体的互动)的问题。
动机:目前的图像合成方法大多局限于文本或图像条件生成,对于复杂的手部与物体互动合成能力有限。
方法:提出一种两步生成方法,首先利用LayoutNet预测一个与关节无关的手-物体交互布局,然后利用ContentNet根据预测的布局合成手抓取物体的图像。这两个网络都建立在大规模预训练的扩散模型之上,以利用其潜在表示。
效果:实验结果表明,该方法在新颖物体上具有更好的泛化能力,并且在分布外的自然场景上也表现出色。

Recent successes in image synthesis are powered by large-scale diffusion models. However, most methods are currently limited to either text- or image-conditioned generation for synthesizing an entire image, texture transfer or inserting objects into a user-specified region. In contrast, in this work we focus on synthesizing complex interactions (i.e., an articulated hand) with a given object. Given an RGB image of an object, we aim to hallucinate plausible images of a human hand interacting with it. We propose a two step generative approach that leverages a LayoutNet that samples an articulation-agnostic hand-object-interaction layout, and a ContentNet that synthesizes images of a hand grasping the object given the predicted layout. Both are built on top of a large-scale pretrained diffusion model to make use of its latent representation. Compared to baselines, the proposed method is shown to generalize better to novel objects and perform surprisingly well on out-of-distribution in-the-wild scenes. The resulting system allows us to predict descriptive affordance information, such as hand articulation and approaching orientation.

Towards Artistic Image Aesthetics Assessment: A Large-Scale Dataset and a New Method
Yi, RanandTian, HaoyuanandGu, ZhihaoandLai, Yu-KunandRosin, PaulL.



研究问题:如何评估艺术图像的审美质量,特别是在现有数据集只包含相对较少艺术作品的情况下。
动机:当前的艺术图像审美评估(AIAA)研究主要依赖于大规模数据集,但现有的数据集对于艺术图像的评估存在缺陷,因此需要一种新的方法来解决这个问题。
方法:提出了一种新方法SAAN(Style-specific Art Assessment Network),该方法可以有效地提取和使用特定风格和通用的审美信息来评估艺术图像。同时,还引入了一个大规模的AIAA数据集:Boldbrush Artistic Image Dataset(BAID)。
效果:实验结果表明,相比于现有的IAA方法,SAAN在提出的BAID数据集上的表现更好。这种方法和数据集可以为未来的AIAA工作提供基础,并激发该领域的更多研究。

Image aesthetics assessment (IAA) is a challenging task due to its highly subjective nature. Most of the current studies rely on large-scale datasets (e.g., AVA and AADB) to learn a general model for all kinds of photography images. However, little light has been shed on measuring the aesthetic quality of artistic images, and the existing datasets only contain relatively few artworks. Such a defect is a great obstacle to the aesthetic assessment of artistic images. To fill the gap in the field of artistic image aesthetics assessment (AIAA), we first introduce a large-scale AIAA dataset: Boldbrush Artistic Image Dataset (BAID), which consists of 60,337 artistic images covering various art forms, with more than 360,000 votes from online users. We then propose a new method, SAAN (Style-specific Art Assessment Network), which can effectively extract and utilize style-specific and generic aesthetic information to evaluate artistic images. Experiments demonstrate that our proposed approach outperforms existing IAA methods on the proposed BAID dataset according to quantitative comparisons. We believe the proposed dataset and method can serve as a foundation for future AIAA works and inspire more research in this field.

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
Tumanyan, NarekandGeyer, MichalandBagon, ShaiandDekel, Tali



研究问题:如何利用大规模文本到图像生成模型进行真实世界的内容创作,并为用户提供对生成内容的精细控制。
动机:虽然大规模的文本到图像生成模型在合成复杂视觉概念的多样化图像方面取得了突破性进展,但在将其用于真实世界的内容创作时,如何让用户对生成内容有精细的控制仍是一个关键挑战。
方法:本文提出了一种新的框架,将文本到图像的合成推向了图像到图像转换的领域。给定一个指导图像和目标文本提示作为输入,该方法利用预训练的文本到图像扩散模型的力量,生成符合目标文本的新图像,同时保留指导图像的语义布局。具体来说,通过操纵模型内的空间特征及其自我注意力,实现了对生成结构的精细控制。
效果:在多种文本引导的图像转换任务上,包括将草图、粗糙的图画和动画转换为真实的图像,改变给定图像中物体的类别和外观,以及修改全局属性如光照和颜色等,该方法都展示了高质量的结果。

Large-scale text-to-image generative models have been a revolutionary breakthrough in the evolution of generative AI, synthesizing diverse images with highly complex visual concepts. However, a pivotal challenge in leveraging such models for real-world content creation is providing users with control over the generated content. In this paper, we present a new framework that takes text-to-image synthesis to the realm of image-to-image translation -- given a guidance image and a target text prompt as input, our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text, while preserving the semantic layout of the guidance image. Specifically, we observe and empirically demonstrate that fine-grained control over the generated structure can be achieved by manipulating spatial features and their self-attention inside the model. This results in a simple and effective approach, where features extracted from the guidance image are directly injected into the generation process of the translated image, requiring no training or fine-tuning. We demonstrate high-quality results on versatile text-guided image translation tasks, including translating sketches, rough drawings and animations into realistic images, changing the class and appearance of objects in a given image, and modifying global qualities such as lighting and color.

Local 3D Editing via 3D Distillation of CLIP Knowledge
Hyung, JunhaandHwang, SungwonandKim, DaejinandLee, HyunjiandChoo, Jaegul



研究问题:如何有效地进行3D内容的编辑和修改,特别是在保持视觉质量的同时进行局部化操作。
动机:现有的3D GANs在生成逼真的3D内容方面表现出色,但在编辑和修改这些内容时,如使用语义地图等控制手段,可能会导致视觉质量下降。同时,虽然文本引导的编辑方法显示出潜力,但它们往往缺乏局部性。
方法:提出了Local Editing NeRF(LENeRF)模型,该模型只需要文本输入就可以进行精细和局部化的编辑。具体来说,我们引入了三个附加模块:潜在残余映射器、注意力场网络和变形网络,共同用于通过估计3D注意力场来局部编辑3D特征。3D注意力场是通过将CLIP的零样本掩码生成能力提炼到3D并结合多视图指导进行无监督学习的。
效果:实验结果表明,LENeRF在各种任务上都取得了显著改进,无论是定量还是定性评估,都证明了其在3D内容编辑方面的优越性。

3D content manipulation is an important computer vision task with many real-world applications (e.g., product design, cartoon generation, and 3D Avatar editing). Recently proposed 3D GANs can generate diverse photo-realistic 3D-aware contents using Neural Radiance fields (NeRF). However, manipulation of NeRF still remains a challenging problem since the visual quality tends to degrade after manipulation and suboptimal control handles such as semantic maps are used for manipulations. While text-guided manipulations have shown potential in 3D editing, such approaches often lack locality. To overcome the problems, we propose Local Editing NeRF (LENeRF), which only requires text inputs for fine-grained and localized manipulation. Specifically, we present three add-on modules of LENeRF, the Latent Residual Mapper, the Attention Field Network, and the Deformation Network, which are jointly used for local manipulations of 3D features by estimating a 3D attention field. The 3D attention field is learned in an unsupervised way, by distilling the CLIP's zero-shot mask generation capability to 3D with multi-view guidance. We conduct diverse experiments and thorough evaluations both quantitatively and qualitatively.

3D-Aware Conditional Image Synthesis
Deng, KangleandYang, GengshanandRamanan, DevaandZhu, Jun-Yan



研究问题:提出一种3D感知的条件生成模型pix2pix3D,用于可控的逼真图像合成。
动机:现有的条件生成模型无法从不同视角合成对应的图像,需要改进以实现3D用户控制。
方法:通过扩展条件生成模型与神经辐射场,使模型能为每个3D点分配标签、颜色和密度,同时渲染图像和像素对齐的标签图。
效果:构建了一个交互系统,用户可以从不同视角编辑标签图并生成相应的输出,实现了可控的逼真图像合成。

We propose pix2pix3D, a 3D-aware conditional generative model for controllable photorealistic image synthesis. Given a 2D label map, such as a segmentation or edge map, our model learns to synthesize a corresponding image from different viewpoints. To enable explicit 3D user control, we extend conditional generative models with neural radiance fields. Given widely-available posed monocular image and label map pairs, our model learns to assign a label to every 3D point in addition to color and density, which enables it to render the image and pixel-aligned label map simultaneously. Finally, we build an interactive system that allows users to edit the label map from different viewpoints and generate outputs accordingly.

Spider GAN: Leveraging Friendly Neighbors To Accelerate GAN Training
Asokan, SiddarthandSeelamantula, ChandraSekhar



研究问题:训练生成对抗网络(GANs)的稳定性是一个挑战。
动机:图像比噪声更有结构,生成器可以利用这一点学习更稳健的转换。
方法:提出一种新的方法,使用图像作为输入来训练GANs,而不强制任何成对约束。通过识别密切相关的数据集或目标分布的“友好邻域”来定义友好邻域,激发了“蜘蛛GAN”的绰号。
效果:实验结果表明,蜘蛛GAN的公式可以实现更快的收敛,因为生成器可以在看似无关的数据集之间发现对应关系,例如在Tiny-ImageNet和CelebA人脸之间。此外,还展示了级联蜘蛛GAN,其中预训练GAN生成器的输出分布用作后续网络的输入。有效地,以级联的方式将一个分布传输到另一个分布,直到目标被学习——这是转移学习的一种新风味。在DCGAN、条件GAN、PGGAN、StyleGAN2和StyleGAN3上证明了蜘蛛方法的有效性。与他们在高分辨率小数据集(如MetFaces、Ukiyo-E Faces和AFHQ-Cats)上的基线相比,所提出的方法实现了最先进的Frechet Inception Distance(FID)值,训练迭代次数仅为五分之一。

Training Generative adversarial networks (GANs) stably is a challenging task. The generator in GANs transform noise vectors, typically Gaussian distributed, into realistic data such as images. In this paper, we propose a novel approach for training GANs with images as inputs, but without enforcing any pairwise constraints. The intuition is that images are more structured than noise, which the generator can leverage to learn a more robust transformation. The process can be made efficient by identifying closely related datasets, or a "friendly neighborhood" of the target distribution, inspiring the moniker, Spider GAN. To define friendly neighborhoods leveraging proximity between datasets, we propose a new measure called the signed inception distance (SID), inspired by the polyharmonic kernel. We show that the Spider GAN formulation results in faster convergence, as the generator can discover correspondence even between seemingly unrelated datasets, for instance, between Tiny-ImageNet and CelebA faces. Further, we demonstrate cascading Spider GAN, where the output distribution from a pre-trained GAN generator is used as the input to the subsequent network. Effectively, transporting one distribution to another in a cascaded fashion until the target is learnt -- a new flavor of transfer learning. We demonstrate the efficacy of the Spider approach on DCGAN, conditional GAN, PGGAN, StyleGAN2 and StyleGAN3. The proposed approach achieves state-of-the-art Frechet inception distance (FID) values, with one-fifth of the training iterations, in comparison to their baseline counterparts on high-resolution small datasets such as MetFaces, Ukiyo-E Faces and AFHQ-Cats.

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation
Shen, ShuaiandZhao, WenliangandMeng, ZibinandLi, WanhuaandZhu, ZhengandZhou, JieandLu, Jiwen



研究问题:如何同时提高生成质量和模型泛化能力,以实现高质量的人脸说话视频生成。
动机:尽管现有的人脸说话视频生成技术在改善生成质量或增强模型泛化方面已有所努力,但能同时解决这两个问题的研究却寥寥无几,这对于实际应用至关重要。
方法:本文引入了新兴的强大的潜在扩散模型,并将人脸说话视频生成视为一个与源音频同步的音频驱动的时间连贯去噪过程(DiffTalk)。具体来说,我们不仅使用音频信号作为驱动因素,还研究了控制说话人脸的机制,并将参考人脸图像和地标作为个性化泛化合成的条件。
效果:所提出的DiffTalk能够产生与源音频同步的高质量人脸说话视频,更重要的是,它可以自然地泛化到不同的个体,无需任何额外的微调。此外,我们的DiffTalk可以很容易地适应更高分辨率的合成,而计算成本几乎可以忽略不计。大量实验表明,所提出的DiffTalk能有效合成高保真度的音频驱动的人脸说话视频,适用于泛化的新颖个体。

Talking head synthesis is a promising approach for the video production industry. Recently, a lot of effort has been devoted in this research area to improve the generation quality or enhance the model generalization. However, there are few works able to address both issues simultaneously, which is essential for practical applications. To this end, in this paper, we turn attention to the emerging powerful Latent Diffusion Models, and model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk). More specifically, instead of employing audio signals as the single driving factor, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis. In this way, the proposed DiffTalk is capable of producing high-quality talking head videos in synchronization with the source audio, and more importantly, it can be naturally generalized across different identities without any further fine-tuning. Additionally, our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost. Extensive experiments show that the proposed DiffTalk efficiently synthesizes high-fidelity audio-driven talking head videos for generalized novel identities. For more video results, please refer to https://sstzal.github.io/DiffTalk/.

SceneComposer: Any-Level Semantic Image Synthesis
Zeng, YuandLin, ZheandZhang, JianmingandLiu, QingandCollomosse, JohnandKuen, JasonandPatel, VishalM.



研究问题:提出一种新的条件图像合成框架,用于从任何精度级别的语义布局进行生成。
动机:现有的方法在处理具有不同精度级别的语义布局时存在局限性,需要一种灵活且高效的框架来解决这个问题。
方法:该框架支持从纯文本到具有精确形状的2D语义画布的各种精度级别。通过引入一系列新技术,如训练数据收集管道、精度编码的掩码金字塔和文本特征图表示等,实现了对精度级别、语义和构图信息的联合编码,以及多尺度引导扩散模型用于图像合成。
效果:实验结果表明,该方法能够根据给定的布局精度生成高质量的图像,并在与现有方法的比较中表现优越。

We propose a new framework for conditional image synthesis from semantic layouts of any precision levels, ranging from pure text to a 2D semantic canvas with precise shapes. More specifically, the input layout consists of one or more semantic regions with free-form text descriptions and adjustable precision levels, which can be set based on the desired controllability. The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level. By supporting the levels in-between, our framework is flexible in assisting users of different drawing expertise and at different stages of their creative workflow. We introduce several novel techniques to address the challenges coming with this new setup, including a pipeline for collecting training data; a precision-encoded mask pyramid and a text feature map representation to jointly encode precision level, semantics, and composition information; and a multi-scale guided diffusion model to synthesize images. To evaluate the proposed method, we collect a test dataset containing user-drawn layouts with diverse scenes and styles. Experimental results show that the proposed method can generate high-quality images following the layout at given precision, and compares favorably against existing methods. Project page https://zengxianyu.github.io/scenec/

Unsupervised Domain Adaption With Pixel-Level Discriminator for Image-Aware Layout Generation
Xu, ChenchenandZhou, MinandGe, TiezhengandJiang, YuningandXu, Weiwei



研究问题:如何利用深度学习模型生成广告海报的图形布局?
动机:现有的数据集存在源领域数据(修复后的海报)和目标领域数据(清洁产品图像)之间的领域差距,影响了图形布局的质量。
方法:提出了一种基于GAN和无监督领域适应技术的新模型PDA-GAN,通过连接浅层特征图的像素级判别器来计算输入图像像素的GAN损失,以生成与图像内容相符的广告海报图形布局。
效果:定量和定性评估表明,PDA-GAN能够实现最先进的性能,并生成高质量的图像感知广告海报图形布局。

Layout is essential for graphic design and poster generation. Recently, applying deep learning models to generate layouts has attracted increasing attention. This paper focuses on using the GAN-based model conditioned on image contents to generate advertising poster graphic layouts, which requires an advertising poster layout dataset with paired product images and graphic layouts. However, the paired images and layouts in the existing dataset are collected by inpainting and annotating posters, respectively. There exists a domain gap between inpainted posters (source domain data) and clean product images (target domain data). Therefore, this paper combines unsupervised domain adaption techniques to design a GAN with a novel pixel-level discriminator (PD), called PDA-GAN, to generate graphic layouts according to image contents. The PD is connected to the shallow level feature map and computes the GAN loss for each input-image pixel. Both quantitative and qualitative evaluations demonstrate that PDA-GAN can achieve state-of-the-art performances and generate high-quality image-aware graphic layouts for advertising posters.

Real-Time 6K Image Rescaling With Rate-Distortion Optimization
Qi, ChenyangandYang, XinandCheng, KaLeongandChen, Ying-CongandChen, Qifeng



研究问题:如何将高分辨率图像嵌入到低分辨率图像中,以实现对高分辨率图像的重建。
动机:现有的图像缩放方法没有优化低分辨率图像的文件大小,并且最新的基于流的缩放方法对于高分辨率图像的重建(如6K)来说并不是实时的。
方法:我们提出了一种新的框架(HyperThumbnail),用于实时6K率失真感知的图像缩放。我们的HyperThumbnail首先通过一个带有我们提出的可学习JPEG量化模块的编码器将高分辨率图像嵌入到一个JPEG低分辨率图像(缩略图)中,该模块优化了嵌入的低分辨率JPEG图像的文件大小。然后,一个高效的解码器从低分辨率图像实时重建高保真度(6K)的HR图像。
效果:大量的实验表明,我们的框架在率失真性能方面优于以前的图像缩放基线,并且在高分辨率图像重建速度上比先前的工作快得多。

The task of image rescaling aims at embedding an high-resolution (HR) image into a low-resolution (LR) one that can contain embedded information for HR image reconstruction. Existing image rescaling methods do not optimize the LR image file size and recent flow-based rescaling methods are not real-time yet for HR image reconstruction (e.g., 6K). To address these two challenges, we propose a novel framework (HyperThumbnail) for real-time 6K rate-distortion-aware image rescaling. Our HyperThumbnail first embeds an HR image into a JPEG LR image (thumbnail) by an encoder with our proposed learnable JPEG quantization module, which optimizes the file size of the embedding LR JPEG image. Then, an efficient decoder reconstructs a high-fidelity HR (6K) image from the LR one in real time. Extensive experiments demonstrate that our framework outperforms previous image rescaling baselines in both rate-distortion performance and is much faster than prior work in HR image reconstruction speed.

OmniAvatar: Geometry-Guided Controllable 3D Head Synthesis
Xu, HongyiandSong, GuoxianandJiang, ZihangandZhang, JianfengandShi, YichunandLiu, JingandMa, WanchunandFeng, JiashiandLuo, Linjie



研究问题:开发一种新颖的几何引导3D人头合成模型,能够从非结构化图像中训练,生成具有动态细节的不同身份保持的3D人头。
动机:现有的3D人头合成模型无法在摄像机姿态、面部表情、头部形状、颈部和下颌关节姿势等方面进行完全解耦控制,因此需要开发新的模型来解决这个问题。
方法:首先定义了一种围绕头部几何的新型语义有符号距离函数(SDF),然后利用3D感知的GAN框架(EG3D)在规范空间中合成详细的3D全头形状和外观,最后通过体积渲染步骤输出到观察空间。
效果:实验结果表明,新模型可以生成更优的身份保持的3D人头,其动态细节比现有方法更具吸引力,无论是定性还是定量比较。

We present OmniAvatar, a novel geometry-guided 3D head synthesis model trained from in-the-wild unstructured images that is capable of synthesizing diverse identity-preserved 3D heads with compelling dynamic details under full disentangled control over camera poses, facial expressions, head shapes, articulated neck and jaw poses. To achieve such high level of disentangled control, we first explicitly define a novel semantic signed distance function (SDF) around a head geometry (FLAME) conditioned on the control parameters. This semantic SDF allows us to build a differentiable volumetric correspondence map from the observation space to a disentangled canonical space from all the control parameters. We then leverage the 3D-aware GAN framework (EG3D) to synthesize detailed shape and appearance of 3D full heads in the canonical space, followed by a volume rendering step guided by the volumetric correspondence map to output into the observation space. To ensure the control accuracy on the synthesized head shapes and expressions, we introduce a geometry prior loss to conform to head SDF and a control loss to conform to the expression code. Further, we enhance the temporal realism with dynamic details conditioned upon varying expressions and joint poses. Our model can synthesize more preferable identity-preserved 3D heads with compelling dynamic details compared to the state-of-the-art methods both qualitatively and quantitatively. We also provide an ablation study to justify many of our system design choices.

LayoutDM: Transformer-Based Diffusion Model for Layout Generation
Chai, ShangandZhuang, LianshengandYan, Fengying



研究问题:如何利用扩散模型在条件布局生成中实现高质量的布局生成?
动机:尽管现有的基于生成对抗网络(GANs)和变分自编码器(VAEs)的方法在布局生成上有所进步,但在质量和多样性方面仍有改进空间。
方法:受最近扩散模型在高质量图像生成上的成功启发,本文提出了一种基于Transformer的布局扩散模型(LayoutDM)。该模型通过将条件去噪扩散概率模型(DDPM)与纯Transformer架构相结合,使用Transformer-based条件布局去噪器从噪声布局数据中学习反向扩散过程以生成样本。
效果:与GANs和VAEs相比,LayoutDM具有高质量生成、强大的样本多样性、忠实的分布覆盖和稳定的训练等理想特性。实验结果表明,该方法在质量和多样性方面优于现有的最佳生成模型。

Automatic layout generation that can synthesize high-quality layouts is an important tool for graphic design in many applications. Though existing methods based on generative models such as Generative Adversarial Networks (GANs) and Variational Auto-Encoders (VAEs) have progressed, they still leave much room for improving the quality and diversity of the results. Inspired by the recent success of diffusion models in generating high-quality images, this paper explores their potential for conditional layout generation and proposes Transformer-based Layout Diffusion Model (LayoutDM) by instantiating the conditional denoising diffusion probabilistic model (DDPM) with a purely transformer-based architecture. Instead of using convolutional neural networks, a transformer-based conditional Layout Denoiser is proposed to learn the reverse diffusion process to generate samples from noised layout data. Benefitting from both transformer and DDPM, our LayoutDM is of desired properties such as high-quality generation, strong sample diversity, faithful distribution coverage, and stationary training in comparison to GANs and VAEs. Quantitative and qualitative experimental results show that our method outperforms state-of-the-art generative models in terms of quality and diversity.

DualVector: Unsupervised Vector Font Synthesis With Dual-Part Representation
Liu, Ying-TianandZhang, ZhifeiandGuo, Yuan-ChenandFisher, MatthewandWang, ZhaowenandZhang, Song-Hai



研究问题:如何提高字体设计的自动生成效果?
动机:目前的字体设计方法存在像素化图像的弊端,并且在向量化过程中会出现质量损失。同时,现有的矢量字体合成方法在形状简洁表示和训练过程中的向量监督方面存在问题。
方法:提出一种新的矢量字形双部分表示法,将每个字形建模为一组封闭的“正”和“负”路径对,通过布尔运算获取字形轮廓。首先仅从字形图像中学习这种表示法,然后设计一个轮廓细化步骤,使轮廓与图像表示对齐,以进一步改善细节。
效果:该方法(命名为DualVector)在矢量字体合成方面优于现有方法,无论是定量还是定性评估。生成的矢量字体可以方便地转换为TrueType Font等常见数字字体格式,具有实际应用价值。

Automatic generation of fonts can be an important aid to typeface design. Many current approaches regard glyphs as pixelated images, which present artifacts when scaling and inevitable quality losses after vectorization. On the other hand, existing vector font synthesis methods either fail to represent the shape concisely or require vector supervision during training. To push the quality of vector font synthesis to the next level, we propose a novel dual-part representation for vector glyphs, where each glyph is modeled as a collection of closed "positive" and "negative" path pairs. The glyph contour is then obtained by boolean operations on these paths. We first learn such a representation only from glyph images and devise a subsequent contour refinement step to align the contour with an image representation to further enhance details. Our method, named DualVector, outperforms state-of-the-art methods in vector font synthesis both quantitatively and qualitatively. Our synthesized vector fonts can be easily converted to common digital font formats like TrueType Font for practical use. The code is released at https://github.com/thuliu-yt16/dualvector.

GazeNeRF: 3D-Aware Gaze Redirection With Neural Radiance Fields
Ruzzi, AlessandroandShi, XiangweiandWang, XiandLi, GengyanandDeMello, ShaliniandChang, HyungJinandZhang, XucongandHilliges, Otmar



研究问题:本文旨在提出一种3D感知的注视重定向方法。
动机:现有的注视重定向方法在二维图像上操作,难以生成3D一致的结果。
方法:我们的方法建立在面部区域和眼球是独立移动的分离3D结构的基础上,利用最新的条件图像基神经辐射场的进步,提出了分别预测面部和眼部体积特征的双分支架构。通过3D旋转矩阵刚性变换眼球特征,可以精细控制期望的注视角度。最后,通过可微分体积合成获得重定向的图像。
效果:我们的实验表明,这种架构在重定向准确性和身份保持方面优于简单地条件化的NeRF基线以及先前最先进的2D注视重定向方法。

We propose GazeNeRF, a 3D-aware method for the task of gaze redirection. Existing gaze redirection methods operate on 2D images and struggle to generate 3D consistent results. Instead, we build on the intuition that the face region and eye balls are separate 3D structures that move in a coordinated yet independent fashion. Our method leverages recent advancements in conditional image-based neural radiance fields and proposes a two-branch architecture that predicts volumetric features for the face and eye regions separately. Rigidly transforming the eye features via a 3D rotation matrix provides fine-grained control over the desired gaze angle. The final, redirected image is then attained via differentiable volume compositing. Our experiments show that this architecture outperforms naively conditioned NeRF baselines as well as previous state-of-the-art 2D gaze redirection methods in terms of redirection accuracy and identity preservation. Code and models will be released for research purposes.

Realistic Saliency Guided Image Enhancement
Miangoleh, S.MahdiH.andBylinskii, ZoyaandKee, EricandShechtman, EliandAksoy, Ya\u{g



研究问题:如何通过增强语言表示模型(ERNIE)充分利用词汇、句法和知识信息,同时提高各种知识驱动任务的性能。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,而知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,训练出ERNIE模型,以更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Common editing operations performed by professional photographers include the cleanup operations: de-emphasizing distracting elements and enhancing subjects. These edits are challenging, requiring a delicate balance between manipulating the viewer's attention while maintaining photo realism. While recent approaches can boast successful examples of attention attenuation or amplification, most of them also suffer from frequent unrealistic edits. We propose a realism loss for saliency-guided image enhancement to maintain high realism across varying image types, while attenuating distractors and amplifying objects of interest. Evaluations with professional photographers confirm that we achieve the dual objective of realism and effectiveness, and outperform the recent approaches on their own datasets, while requiring a smaller memory footprint and runtime. We thus offer a viable solution for automating image enhancement and photo cleanup operations.

Collaborative Diffusion for Multi-Modal Face Generation and Editing
Huang, ZiqiandChan, KelvinC.K.andJiang, YumingandLiu, Ziwei



研究问题:现有的扩散模型主要关注单模态控制,即扩散过程仅由一种模态的条件驱动。为了进一步释放用户的创造力,需要模型能够同时由多个模态控制,例如通过描述年龄(文本驱动)生成和编辑面部(掩码驱动)。
动机:不同模态的扩散模型在潜在去噪步骤上具有互补性,可以在这些步骤上建立双向连接。
方法:提出协作扩散模型,该模型将预训练的单模态扩散模型进行协作,实现多模态的面部生成和编辑,无需重新训练。动态扩散器是一种元网络,通过预测每个预训练的单模态模型的空间-时间影响函数,自适应地产生多模态去噪步骤。
效果:实验表明,协作扩散模型不仅整合了单模态扩散模型的生成能力,还整合了多种单模态操作进行多模态编辑,在图像质量和条件一致性方面均表现出优越性。

Diffusion models arise as a powerful generative tool recently. Despite the great progress, existing diffusion models mainly focus on uni-modal control, i.e., the diffusion process is driven by only one modality of condition. To further unleash the users' creativity, it is desirable for the model to be controllable by multiple modalities simultaneously, e.g., generating and editing faces by describing the age (text-driven) while drawing the face shape (mask-driven). In this work, we present Collaborative Diffusion, where pre-trained uni-modal diffusion models collaborate to achieve multi-modal face generation and editing without re-training. Our key insight is that diffusion models driven by different modalities are inherently complementary regarding the latent denoising steps, where bilateral connections can be established upon. Specifically, we propose dynamic diffuser, a meta-network that adaptively hallucinates multi-modal denoising steps by predicting the spatial-temporal influence functions for each pre-trained uni-modal model. Collaborative Diffusion not only collaborates generation capabilities from uni-modal diffusion models, but also integrates multiple uni-modal manipulations to perform multi-modal editing. Extensive qualitative and quantitative experiments demonstrate the superiority of our framework in both image quality and condition consistency.

SmartBrush: Text and Shape Guided Object Inpainting With Diffusion Model
Xie, ShaoanandZhang, ZhifeiandLin, ZheandHinz, TobiasandZhang, Kun



研究问题:本文旨在提出一种名为SmartBrush的新的扩散模型,用于使用文本和形状指导完成缺失区域的对象。
动机:现有的图像修复方法只能借用周围信息来填补损坏的图像,而多模态修复提供了更灵活和有用的控制,例如可以使用文本提示描述具有更丰富属性的对象,并使用掩码约束修复对象的形状。
方法:我们提出了一种新的基于扩散的模型SmartBrush,它结合了文本和形状指导,以精确控制的方式完成缺失区域的对象。为了更好地保护背景,我们通过将对象掩码预测添加到扩散U-net中,提出了一种新的训练和采样策略。最后,我们通过联合训练修复和文本到图像生成任务来进行多任务训练,以利用更多的训练数据。
效果:实验结果表明,我们的模型在视觉质量、掩码可控性和背景保护方面均优于所有基线模型。

Generic image inpainting aims to complete a corrupted image by borrowing surrounding information, which barely generates novel content. By contrast, multi-modal inpainting provides more flexible and useful controls on the inpainted content, e.g., a text prompt can be used to describe an object with richer attributes, and a mask can be used to constrain the shape of the inpainted object rather than being only considered as a missing area. We propose a new diffusion-based model named SmartBrush for completing a missing region with an object using both text and shape-guidance. While previous work such as DALLE-2 and Stable Diffusion can do text-guided inapinting they do not support shape guidance and tend to modify background texture surrounding the generated object. Our model incorporates both text and shape guidance with precision control. To preserve the background better, we propose a novel training and sampling strategy by augmenting the diffusion U-net with object-mask prediction. Lastly, we introduce a multi-task training strategy by jointly training inpainting with text-to-image generation to leverage more training data. We conduct extensive experiments showing that our model outperforms all baselines in terms of visual quality, mask controllability, and background preservation.

StyleIPSB: Identity-Preserving Semantic Basis of StyleGAN for High Fidelity Face Swapping
Jiang, DiqiongandSong, DanandTong, RuofengandTang, Min



研究问题:现有的人脸交换方法在生成高保真度的人脸图像时,无法保留毛孔级别的细节和身份特征。
动机:为了解决上述问题,研究者创新地构建了一种新的StyleGAN模型——StyleIPSB,该模型能够保留毛孔级别的细节并保持身份特征。
方法:通过构建一系列与姿势、表情和照明有关的身份保留语义基础(StyleIPSB),并将其应用于StyleGAN模型中,实现了高保真度的人脸交换。
效果:实验结果表明,StyleIPSB可以有效地保留毛孔级别的细节和身份特征,并在人脸交换任务上取得了最先进的性能。

Recent researches reveal that StyleGAN can generate highly realistic images, inspiring researchers to use pretrained StyleGAN to generate high-fidelity swapped faces. However, existing methods fail to meet the expectations in two essential aspects of high-fidelity face swapping. Their results are blurry without pore-level details and fail to preserve identity for challenging cases. To overcome the above artifacts, we innovatively construct a series of identity-preserving semantic bases of StyleGAN (called StyleIPSB) in respect of pose, expression, and illumination. Each basis of StyleIPSB controls one specific semantic attribute and disentangles with the others. The StyleIPSB constrains style code in the subspace of W+ space to preserve pore-level details. StyleIPSB gives us a novel tool for high-fidelity face swapping, and we propose a three-stage framework for face swapping with StyleIPSB. Firstly, we transform the target facial images' attributes to the source image. We learn the mapping from 3D Morphable Model (3DMM) parameters, which capture the prominent semantic variance, to the coordinates of StyleIPSB that show higher identity-preserving and fidelity. Secondly, to transform detailed attributes which 3DMM does not capture, we learn the residual attribute between the reenacted face and the target face. Finally, the face is blended into the background of the target image. Extensive results and comparisons demonstrate that StyleIPSB can effectively preserve identity and pore-level details. The results of face swapping can achieve state-of-the-art performance. We will release our code at https://github.com/a686432/StyleIPSB.

Discriminator-Cooperated Feature Map Distillation for GAN Compression
Hu, TieandLin, MingbaoandYou, LizhouandChao, FeiandJi, Rongrong



研究问题:尽管生成对抗网络(GANs)在图像生成方面表现优秀,但其对大量存储和密集计算的需求是众所周知的。
动机:知识蒸馏作为一种有效的方法,被证明可以探索低成本的GANs。
方法:本文提出了一种创新的判别器协作蒸馏(DCD)方法,通过将教师判别器用作转换来推动学生生成器的中间结果与教师生成器的对应输出在感知上接近。同时,为了缓解GAN压缩中的模式崩溃问题,我们构建了一个合作对抗训练范例,其中教师判别器从头开始与学生生成器共同训练。
效果:实验结果表明,我们的DCD方法优于现有的GAN压缩方法。例如,在减少CycleGAN的40倍MACs和80倍参数后,我们将FID指标从61.53降低到48.24,而目前的最佳方法仅为51.92。

Despite excellent performance in image generation, Generative Adversarial Networks (GANs) are notorious for its requirements of enormous storage and intensive computation. As an awesome "performance maker", knowledge distillation is demonstrated to be particularly efficacious in exploring low-priced GANs. In this paper, we investigate the irreplaceability of teacher discriminator and present an inventive discriminator-cooperated distillation, abbreviated as DCD, towards refining better feature maps from the generator. In contrast to conventional pixel-to-pixel match methods in feature map distillation, our DCD utilizes teacher discriminator as a transformation to drive intermediate results of the student generator to be perceptually close to corresponding outputs of the teacher generator. Furthermore, in order to mitigate mode collapse in GAN compression, we construct a collaborative adversarial training paradigm where the teacher discriminator is from scratch established to co-train with student generator in company with our DCD. Our DCD shows superior results compared with existing GAN compression methods. For instance, after reducing over 40x MACs and 80x parameters of CycleGAN, we well decrease FID metric from 61.53 to 48.24 while the current SoTA method merely has 51.92. This work's source code has been made accessible at https://github.com/poopit/DCD-official.

Learning on Gradients: Generalized Artifacts Representation for GAN-Generated Images Detection
Tan, ChuangchuangandZhao, YaoandWei, ShikuiandGu, GuanghuaandWei, Yunchao



研究问题:如何开发一种能检测出由生成对抗网络(GAN)生成的假图像的通用检测器。
动机:现有的图像检测器在未见过的数据领域性能下降明显,而GAN可以轻易生成逼真的假图像,增加了滥用的风险。
方法:提出了一种新的检测框架——学习梯度(LGrad)。该框架首先使用预训练的CNN模型将图像转换为梯度,然后将这些梯度作为通用的伪影表示输入分类器以确定图像的真实性。
效果:实验表明,该方法能有效且鲁棒地利用梯度作为GAN生成图像的通用伪影表示,其检测器的性能比现有技术提高了11.4%,达到了新的最先进的水平。

Recently, there has been a significant advancement in image generation technology, known as GAN. It can easily generate realistic fake images, leading to an increased risk of abuse. However, most image detectors suffer from sharp performance drops in unseen domains. The key of fake image detection is to develop a generalized representation to describe the artifacts produced by generation models. In this work, we introduce a novel detection framework, named Learning on Gradients (LGrad), designed for identifying GAN-generated images, with the aim of constructing a generalized detector with cross-model and cross-data. Specifically, a pretrained CNN model is employed as a transformation model to convert images into gradients. Subsequently, we leverage these gradients to present the generalized artifacts, which are fed into the classifier to ascertain the authenticity of the images. In our framework, we turn the data-dependent problem into a transformation-model-dependent problem. To the best of our knowledge, this is the first study to utilize gradients as the representation of artifacts in GAN-generated images. Extensive experiments demonstrate the effectiveness and robustness of gradients as generalized artifact representations. Our detector achieves a new state-of-the-art performance with a remarkable gain of 11.4%. The code is released at https://github.com/chuangchuangtan/LGrad.

InstructPix2Pix: Learning To Follow Image Editing Instructions
Brooks, TimandHolynski, AleksanderandEfros, AlexeiA.



研究问题:如何通过人类指令编辑图像?
动机:现有的图像编辑方法需要手动操作,耗时且效率低下。
方法:结合预训练的语言模型(GPT-3)和文本到图像模型(Stable Diffusion),生成大量图像编辑示例作为训练数据,训练条件扩散模型InstructPix2Pix。
效果:InstructPix2Pix能够快速地根据用户写的指令在几秒内完成图像编辑,并在多种输入图像和指令上取得了良好的编辑效果。

We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models--a language model (GPT-3) and a text-to-image model (Stable Diffusion)--to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per-example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.

Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis
Wang, DuominandDeng, YuandYin, ZixinandShum, Heung-YeungandWang, Baoyuan



研究问题:提出一种新颖的一阶段人头说话合成方法,实现对嘴唇运动、眼神注视和眨眼、头部姿势以及情感表达的解耦和精细化控制。
动机:现有的方法无法有效地分离并控制这些不同的运动因素,我们的目标是通过解耦的潜在表示来表示不同的运动,并利用图像生成器从它们中合成说话的头部。
方法:我们首先从驱动信号中提取统一的运动特征,然后从统一的特征中分离出每个精细的运动。我们还引入了针对非情感运动的运动特定对比学习和回归,以及针对情感表达的特征级去相关和自我重建,以充分利用无结构视频数据中每个运动因素的内在特性来实现解耦。
效果:实验表明,我们的方法在语音和嘴唇运动的同步方面提供了高质量的表现,同时对多种额外的面部运动进行了精确和解耦的控制,这是以前的方法是难以实现的。

We present a novel one-shot talking head synthesis method that achieves disentangled and fine-grained control over lip motion, eye gaze&blink, head pose, and emotional expression. We represent different motions via disentangled latent representations and leverage an image generator to synthesize talking heads from them. To effectively disentangle each motion factor, we propose a progressive disentangled representation learning strategy by separating the factors in a coarse-to-fine manner, where we first extract unified motion feature from the driving signal, and then isolate each fine-grained motion from the unified feature. We introduce motion-specific contrastive learning and regressing for non-emotional motions, and feature-level decorrelation and self-reconstruction for emotional expression, to fully utilize the inherent properties of each motion factor in unstructured video data to achieve disentanglement. Experiments show that our method provides high quality speech&lip-motion synchronization along with precise and disentangled control over multiple extra facial motions, which can hardly be achieved by previous methods.

ReDirTrans: Latent-to-Latent Translation for Gaze and Head Redirection
Jin, ShiweiandWang, ZhenandWang, LeiandBi, NingandNguyen, Truong



研究问题:如何准确改变高分辨率全脸图像的注视方向,同时保留其他属性如身份、表情和发型?
动机:现有的图像合成方法主要关注改变低分辨率图像的注视方向,但这种方法在处理高分辨率全脸图像时会限制其应用场景。
方法:提出了一种名为ReDirTrans的可移植网络,通过潜在到潜在的翻译来以可解释的方式改变注视方向和头部方向。该方法只对目标属性进行投影,并通过分配的俯仰角和偏航角值来重定向这些嵌入。然后,将初始和编辑的嵌入投射回初始的潜在空间作为残差,通过减法和加法修改输入的潜在向量,表示旧状态的移除和新状态的增加。
效果:通过将ReDirTrans与预训练的固定e4e-StyleGAN配对,创建了ReDirTrans-GAN,能够在保持其他属性如身份、表情和发型的同时,准确地改变1024*1024分辨率全脸图像的注视方向。此外,使用重定向样本作为数据集增强,提高了下游基于学习的注视估计任务的性能。

Learning-based gaze estimation methods require large amounts of training data with accurate gaze annotations. Facing such demanding requirements of gaze data collection and annotation, several image synthesis methods were proposed, which successfully redirected gaze directions precisely given the assigned conditions. However, these methods focused on changing gaze directions of the images that only include eyes or restricted ranges of faces with low resolution (less than 128*128) to largely reduce interference from other attributes such as hairs, which limits application scenarios. To cope with this limitation, we proposed a portable network, called ReDirTrans, achieving latent-to-latent translation for redirecting gaze directions and head orientations in an interpretable manner. ReDirTrans projects input latent vectors into aimed-attribute embeddings only and redirects these embeddings with assigned pitch and yaw values. Then both the initial and edited embeddings are projected back (deprojected) to the initial latent space as residuals to modify the input latent vectors by subtraction and addition, representing old status removal and new status addition. The projection of aimed attributes only and subtraction-addition operations for status replacement essentially mitigate impacts on other attributes and the distribution of latent vectors. Thus, by combining ReDirTrans with a pretrained fixed e4e-StyleGAN pair, we created ReDirTrans-GAN, which enables accurately redirecting gaze in full-face images with 1024*1024 resolution while preserving other attributes such as identity, expression, and hairstyle. Furthermore, we presented improvements for the downstream learning-based gaze estimation task, using redirected samples as dataset augmentation.

Controllable Light Diffusion for Portraits
Futschik, DavidandRitland, KelvinandVecore, JamesandFanello, SeanandOrts-Escolano, SergioandCurless, BrianandS\'ykora, DanielandPandey, Rohit



研究问题:本文旨在提出一种新的方法——光扩散,以改善肖像照的照明效果,软化硬阴影和镜面高光,同时保持整体场景的照明。
动机:受到专业摄影师的扩散器和纱幕的启发,我们的方法仅使用一张肖像照片就能软化照明。
方法:我们提出了一种基于学习的方法,可以控制光扩散的程度并将其应用于自然环境中的肖像。此外,我们还设计了一种方法,可以合成具有次表面散射效应的合理外部阴影,同时符合主体脸部的形状。
效果:实验结果表明,我们的方法可以提高高级视觉应用的鲁棒性,如反照率估计、几何估计和语义分割。

We introduce light diffusion, a novel method to improve lighting in portraits, softening harsh shadows and specular highlights while preserving overall scene illumination. Inspired by professional photographers' diffusers and scrims, our method softens lighting given only a single portrait photo. Previous portrait relighting approaches focus on changing the entire lighting environment, removing shadows (ignoring strong specular highlights), or removing shading entirely. In contrast, we propose a learning based method that allows us to control the amount of light diffusion and apply it on in-the-wild portraits. Additionally, we design a method to synthetically generate plausible external shadows with sub-surface scattering effects while conforming to the shape of the subject's face. Finally, we show how our approach can increase the robustness of higher level vision applications, such as albedo estimation, geometry estimation and semantic segmentation.

DiffSwap: High-Fidelity and Controllable Face Swapping via 3D-Aware Masked Diffusion
Zhao, WenliangandRao, YongmingandShi, WeikangandLiu, ZuyanandZhou, JieandLu, Jiwen



研究问题:本文旨在提出一种基于扩散模型的高保真、可控人脸交换框架。
动机:与依赖精心设计的网络架构和损失函数融合源和目标脸部信息的工作不同,本文将人脸交换重新定义为一个条件修复任务,由强大的扩散模型根据所需的脸部属性(如身份和地标)指导执行。
方法:提出了DiffSwap,这是一个基于扩散模型的框架,通过仅2步的高效恢复交换后的人脸合理扩散结果,引入身份约束以提高人脸交换质量。
效果:实验结果表明,该方法在定性和定量上都能实现最先进的人脸交换效果,具有高度的控制性、高保真度和形状保持性。

In this paper, we propose DiffSwap, a diffusion model based framework for high-fidelity and controllable face swapping. Unlike previous work that relies on carefully designed network architectures and loss functions to fuse the information from the source and target faces, we reformulate the face swapping as a conditional inpainting task, performed by a powerful diffusion model guided by the desired face attributes (e.g., identity and landmarks). An important issue that makes it nontrivial to apply diffusion models to face swapping is that we cannot perform the time-consuming multi-step sampling to obtain the generated image during training. To overcome this, we propose a midpoint estimation method to efficiently recover a reasonable diffusion result of the swapped face with only 2 steps, which enables us to introduce identity constraints to improve the face swapping quality. Our framework enjoys several favorable properties more appealing than prior arts: 1) Controllable. Our method is based on conditional masked diffusion on the latent space, where the mask and the conditions can be fully controlled and customized. 2) High-fidelity. The formulation of conditional inpainting can fully exploit the generative ability of diffusion models and can preserve the background of target images with minimal artifacts. 3) Shape-preserving. The controllability of our method enables us to use 3D-aware landmarks as the condition during generation to preserve the shape of the source face. Extensive experiments on both FF++ and FFHQ demonstrate that our method can achieve state-of-the-art face swapping results both qualitatively and quantitatively.

Local Implicit Normalizing Flow for Arbitrary-Scale Image Super-Resolution
Yao, Jie-EnandTsao, Li-YuanandLo, Yi-ChenandTseng, RoyandChang, Chia-CheandLee, Chun-Yi



研究问题:本文旨在解决现有任意尺度超分辨率(SR)方法忽略的病态问题,以及其只能进行预定义固定尺度SR的限制。
动机:虽然流基方法在解决超分辨率的病态问题上表现出了潜力,但它们只能进行预定义的固定尺度SR,限制了其在实际应用中的潜力。同时,任意尺度SR得到了更多的关注并取得了重大进展,但以前的任意尺度SR方法忽视了病态问题,并用逐像素L1损失训练模型,导致模糊的SR输出。
方法:我们提出了"局部隐式正则化流"(LINF),该方法通过正则化流对不同缩放因子下的纹理细节分布进行建模。因此,LINF可以在任意尺度因子下生成具有丰富纹理细节的照片级高分辨率图像。
效果:通过大量实验评估,我们发现LINF与先前的任意尺度SR方法相比,实现了最先进的感知质量。

Flow-based methods have demonstrated promising results in addressing the ill-posed nature of super-resolution (SR) by learning the distribution of high-resolution (HR) images with the normalizing flow. However, these methods can only perform a predefined fixed-scale SR, limiting their potential in real-world applications. Meanwhile, arbitrary-scale SR has gained more attention and achieved great progress. Nonetheless, previous arbitrary-scale SR methods ignore the ill-posed problem and train the model with per-pixel L1 loss, leading to blurry SR outputs. In this work, we propose "Local Implicit Normalizing Flow" (LINF) as a unified solution to the above problems. LINF models the distribution of texture details under different scaling factors with normalizing flow. Thus, LINF can generate photo-realistic HR images with rich texture details in arbitrary scale factors. We evaluate LINF with extensive experiments and show that LINF achieves the state-of-the-art perceptual quality compared with prior arbitrary-scale SR methods.

Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models
Xu, JialeandWang, XintaoandCheng, WeihaoandCao, Yan-PeiandShan, YingandQie, XiaohuandGao, Shenghua



研究问题:如何提高文本到3D的优化方法的准确性和忠实度。
动机:目前的文本引导3D优化方法由于缺乏先验知识,经常无法生成与输入文本相符的准确和忠实的3D结构。
方法:首次将明确的3D形状先验引入CLIP引导的3D优化过程。首先,在文本到形状阶段从输入文本中生成高质量的3D形状作为3D形状先验。然后将其用作神经辐射场的初始化,并使用完整的提示进行优化。
效果:该方法能够生成具有优越视觉质量和形状准确性的富有想象力的3D内容,优于最先进的方法。

Recent CLIP-guided 3D optimization methods, such as DreamFields and PureCLIPNeRF, have achieved impressive results in zero-shot text-to-3D synthesis. However, due to scratch training and random initialization without prior knowledge, these methods often fail to generate accurate and faithful 3D structures that conform to the input text. In this paper, we make the first attempt to introduce explicit 3D shape priors into the CLIP-guided 3D optimization process. Specifically, we first generate a high-quality 3D shape from the input text in the text-to-shape stage as a 3D shape prior. We then use it as the initialization of a neural radiance field and optimize it with the full prompt. To address the challenging text-to-shape generation task, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between the images synthesized by the text-to-image diffusion model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, Dream3D, is capable of generating imaginative 3D content with superior visual quality and shape accuracy compared to state-of-the-art methods. Our project page is at https://bluestyle97.github.io/dream3d/.

DINN360: Deformable Invertible Neural Network for Latitude-Aware 360deg Image Rescaling
Guo, YichenandXu, MaiandJiang, LaiandSigal, LeonidandChen, Yunjin



研究问题:随着虚拟现实的快速发展,360度图像越来越受欢迎。然而,其广阔的视野需要高分辨率来保证图像质量,这给获取、存储和处理这些图像带来了困难。
动机:为了解决这个问题,我们提出了首次尝试的360度图像缩放方法,即先将360度图像下采样到视觉上有效的低分辨率版本,然后根据低分辨率版本上采样到高分辨率的360度图像。
方法:我们首先分析了两个360度图像数据集,并观察到了一些特点,这些特点描述了360度图像通常如何沿着纬度变化。受这些发现的启发,我们提出了一种新的可变形可逆神经网络(INN),名为DINN360,用于纬度感知的360度图像缩放。在DINN360中,设计了一个可变形的INN来下采样低分辨率图像,并通过自适应地处理不同纬度区域发生的各种形变,将高频(HF)分量投影到潜在空间。给定下采样的低分辨率图像,通过从潜在空间恢复与结构相关的HF分量,以条件纬度感知的方式重建高质量的高分辨率图像。
效果:我们在四个公共数据集上进行了大量实验,结果表明我们的DINN360方法在2倍、4倍和8倍360度图像缩放方面比其他最先进的方法表现得要好得多。

With the rapid development of virtual reality, 360deg images have gained increasing popularity. Their wide field of view necessitates high resolution to ensure image quality. This, however, makes it harder to acquire, store and even process such 360deg images. To alleviate this issue, we propose the first attempt at 360deg image rescaling, which refers to downscaling a 360deg image to a visually valid low-resolution (LR) counterpart and then upscaling to a high-resolution (HR) 360deg image given the LR variant. Specifically, we first analyze two 360deg image datasets and observe several findings that characterize how 360deg images typically change along their latitudes. Inspired by these findings, we propose a novel deformable invertible neural network (INN), named DINN360, for latitude-aware 360deg image rescaling. In DINN360, a deformable INN is designed to downscale the LR image, and project the high-frequency (HF) component to the latent space by adaptively handling various deformations occurring at different latitude regions. Given the downscaled LR image, the high-quality HR image is then reconstructed in a conditional latitude-aware manner by recovering the structure-related HF component from the latent space. Extensive experiments over four public datasets show that our DINN360 method performs considerably better than other state-of-the-art methods for 2x, 4x and 8x 360deg image rescaling.

Learning Detailed Radiance Manifolds for High-Fidelity and 3D-Consistent Portrait Synthesis From Monocular Image
Deng, YuandWang, BaoyuanandShum, Heung-Yeung



研究问题:单目肖像图像的新视图合成中的关键挑战是在连续的姿势变化下的3D一致性。
动机:大多数现有方法依赖于2D生成模型,这往往导致明显的3D不一致的伪影。
方法:提出了一种基于最近提出的3D感知GAN(Generative Radiance Manifolds,GRAM)的3D一致的单目肖像图像新视图合成方法。该方法通过辐射流形表示在多视角虚拟主体图像生成方面表现出强大的3D一致性。
效果:通过在野外2D图像上进行训练,该方法实现了高保真度和3D一致的肖像合成,大大超过了先前的技术。

A key challenge for novel view synthesis of monocular portrait images is 3D consistency under continuous pose variations. Most existing methods rely on 2D generative models which often leads to obvious 3D inconsistency artifacts. We present a 3D-consistent novel view synthesis approach for monocular portrait images based on a recent proposed 3D-aware GAN, namely Generative Radiance Manifolds (GRAM), which has shown strong 3D consistency at multiview image generation of virtual subjects via the radiance manifolds representation. However, simply learning an encoder to map a real image into the latent space of GRAM can only reconstruct coarse radiance manifolds without faithful fine details, while improving the reconstruction fidelity via instance-specific optimization is time-consuming. We introduce a novel detail manifolds reconstructor to learn 3D-consistent fine details on the radiance manifolds from monocular images, and combine them with the coarse radiance manifolds for high-fidelity reconstruction. The 3D priors derived from the coarse radiance manifolds are used to regulate the learned details to ensure reasonable synthesized results at novel views. Trained on in-the-wild 2D images, our method achieves high-fidelity and 3D-consistent portrait synthesis largely outperforming the prior art. Project page: https://yudeng.github.io/GRAMInverter/

Quantitative Manipulation of Custom Attributes on 3D-Aware Image Synthesis
Do, HoseokandYoo, EunKyungandKim, TaehyeongandLee, ChulandChoi, JinYoung



研究问题:如何对3D图像进行细粒度的特定属性控制,而不仅仅局限于某一类别的对象。
动机:尽管基于3D的GAN技术已被成功应用于渲染具有各种属性的逼真3D图像并保持视角一致性,但如何精细控制3D图像的属性,而不仅限于特定类别的对象,目前的研究还很少。
方法:我们提出了一种新的基于3D-GAN表示的图像操作模型,用于精细控制特定自定义属性。通过扩展最新的基于3D的GAN模型(如EG3D),我们的用户友好的定量操作模型实现了多属性数量的精细且规范化的3D操作控制,同时保持了视角一致性。
效果:通过各种实验,我们从定性和定量两个方面验证了我们提出的方法的有效性。

While 3D-based GAN techniques have been successfully applied to render photo-realistic 3D images with a variety of attributes while preserving view consistency, there has been little research on how to fine-control 3D images without limiting to a specific category of objects of their properties. To fill such research gap, we propose a novel image manipulation model of 3D-based GAN representations for a fine-grained control of specific custom attributes. By extending the latest 3D-based GAN models (e.g., EG3D), our user-friendly quantitative manipulation model enables a fine yet normalized control of 3D manipulation of multi-attribute quantities while achieving view consistency. We validate the effectiveness of our proposed technique both qualitatively and quantitatively through various experiments.

ObjectStitch: Object Compositing With Diffusion Model
Song, YizhiandZhang, ZhifeiandLin, ZheandCohen, ScottandPrice, BrianandZhang, JianmingandKim, SooYeandAliaga, Daniel



研究问题:基于2D图像的对象合成是一个挑战性的问题,因为它通常涉及多个处理阶段,如色彩协调、几何校正和阴影生成,以生成逼真的结果。
动机:由于注释合成训练数据对需要专业人员的大量手动努力,且难以扩展,因此我们提出了一种利用条件扩散模型的力量进行对象合成的自监督框架。
方法:我们的框架可以在一个统一的模型中全面解决对象合成任务,转换生成对象的视点、几何、颜色和阴影,而无需手动标注。为了保留输入对象的特性,我们引入了一个内容适应器来帮助保持类别语义和对象外观。此外,还采用了一种数据增强方法来提高生成器的保真度。
效果:在一项关于各种真实世界图像的用户研究中,我们的方法在合成结果图像的真实性和忠实度方面都优于相关的基线方法。

Object compositing based on 2D images is a challenging problem since it typically involves multiple processing stages such as color harmonization, geometry correction and shadow generation to generate realistic results. Furthermore, annotating training data pairs for compositing requires substantial manual effort from professionals, and is hardly scalable. Thus, with the recent advances in generative models, in this work, we propose a self-supervised framework for object compositing by leveraging the power of conditional diffusion models. Our framework can hollistically address the object compositing task in a unified model, transforming the viewpoint, geometry, color and shadow of the generated object while requiring no manual labeling. To preserve the input object's characteristics, we introduce a content adaptor that helps to maintain categorical semantics and object appearance. A data augmentation method is further adopted to improve the fidelity of the generator. Our method outperforms relevant baselines in both realism and faithfulness of the synthesized result images in a user study on various real-world images.

High-Fidelity 3D GAN Inversion by Pseudo-Multi-View Optimization
Xie, JiaxinandOuyang, HaoandPiao, JingtanandLei, ChenyangandChen, Qifeng



研究问题:如何通过单张图片生成高保真度的新视图,同时保留输入图像的特定细节。
动机:由于几何和纹理之间的权衡,高保真度的3D GAN反转具有固有的挑战性,对单一视图输入图像的过度拟合往往会在潜在优化过程中损害估计的几何结构。
方法:我们提出了一种新颖的管道,基于伪多视图估计和可见性分析。我们将原始纹理保留在可见部分,并利用生成的先验知识来处理被遮挡的部分。
效果:大量实验表明,我们的方法在重建和新的视图合成质量上优于先前的工作,即使对于具有分布外纹理的图像也是如此。此外,我们的管道还可以使用反转的潜在代码进行图像属性编辑和3D感知的纹理修改。

We present a high-fidelity 3D generative adversarial network (GAN) inversion framework that can synthesize photo-realistic novel views while preserving specific details of the input image. High-fidelity 3D GAN inversion is inherently challenging due to the geometry-texture trade-off, where overfitting to a single view input image often damages the estimated geometry during the latent optimization. To solve this challenge, we propose a novel pipeline that builds on the pseudo-multi-view estimation with visibility analysis. We keep the original textures for the visible parts and utilize generative priors for the occluded parts. Extensive experiments show that our approach achieves advantageous reconstruction and novel view synthesis quality over prior work, even for images with out-of-distribution textures. The proposed pipeline also enables image attribute editing with the inverted latent code and 3D-aware texture modification. Our approach enables high-fidelity 3D rendering from a single image, which is promising for various applications of AI-generated 3D content. The source code is at https://github.com/jiaxinxie97/HFGI3D/.

A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction From In-the-Wild Images
Lei, BiwenandRen, JianqiangandFeng, MengyangandCui, MiaomiaoandXie, Xuansong



研究问题:如何从单张图片中实现准确且详细的人脸重建。
动机:由于3DMM的低维表示能力有限,大多数基于3DMM的人脸重建方法无法恢复高频面部细节,如皱纹、酒窝等。
方法:本文提出了一种新的分层表示网络(HRN)来实现准确的面部重建。通过实施几何解耦和引入分层表示来完成详细的面部建模,同时结合面部细节的3D先验来提高重建结果的准确性和真实性。
效果:在两个单视图和两个多视图的面部重建基准测试中,该方法在重建准确性和视觉效果上都优于现有方法。最后,引入了一个高质量的3D人脸数据集FaceHD-100,以推动高保真度人脸重建的研究。

Limited by the nature of the low-dimensional representational capacity of 3DMM, most of the 3DMM-based face reconstruction (FR) methods fail to recover high-frequency facial details, such as wrinkles, dimples, etc. Some attempt to solve the problem by introducing detail maps or non-linear operations, however, the results are still not vivid. To this end, we in this paper present a novel hierarchical representation network (HRN) to achieve accurate and detailed face reconstruction from a single image. Specifically, we implement the geometry disentanglement and introduce the hierarchical representation to fulfill detailed face modeling. Meanwhile, 3D priors of facial details are incorporated to enhance the accuracy and authenticity of the reconstruction results. We also propose a de-retouching module to achieve better decoupling of the geometry and appearance. It is noteworthy that our framework can be extended to a multi-view fashion by considering detail consistency of different views. Extensive experiments on two single-view and two multi-view FR benchmarks demonstrate that our method outperforms the existing methods in both reconstruction accuracy and visual effects. Finally, we introduce a high-quality 3D face dataset FaceHD-100 to boost the research of high-fidelity face reconstruction. The project homepage is at https://younglbw.github.io/HRN-homepage/.

NeuralLift-360: Lifting an In-the-Wild 2D Photo to a 3D Object With 360deg Views
Xu, DejiaandJiang, YifanandWang, PeihaoandFan, ZhiwenandWang, YiandWang, Zhangyang



研究问题:如何将单张图片转化为3D物体,并生成与参考图像相符的360度视图。
动机:虚拟现实和增强现实对3D内容的需求日益增长,但创建高质量的3D内容需要人类专家的繁琐工作。
方法:提出了一种名为NeuralLift-360的新框架,利用深度感知神经辐射表示(NeRF)并通过去噪扩散模型指导场景创作。通过引入排名损失,可以在野外进行粗略的深度估计。还采用了CLIP引导的采样策略以提供连贯的指导。
效果:实验表明,NeuralLift-360显著优于现有的最先进的基线方法。

Virtual reality and augmented reality (XR) bring increasing demand for 3D content generation. However, creating high-quality 3D content requires tedious work from a human expert. In this work, we study the challenging task of lifting a single image to a 3D object and, for the first time, demonstrate the ability to generate a plausible 3D object with 360deg views that corresponds well with the given reference image. By conditioning on the reference image, our model can fulfill the everlasting curiosity for synthesizing novel views of objects from images. Our technique sheds light on a promising direction of easing the workflows for 3D artists and XR designers. We propose a novel framework, dubbed NeuralLift-360, that utilizes a depth-aware neural radiance representation (NeRF) and learns to craft the scene guided by denoising diffusion models. By introducing a ranking loss, our NeuralLift-360 can be guided with rough depth estimation in the wild. We also adopt a CLIP-guided sampling strategy for the diffusion prior to provide coherent guidance. Extensive experiments demonstrate that our NeuralLift-360 significantly outperforms existing state-of-the-art baselines. Project page: https://vita-group.github.io/NeuralLift-360/

Learning Neural Proto-Face Field for Disentangled 3D Face Modeling in the Wild
Zhang, ZhenyuandChen, RenwangandCao, WeijianandTai, YingandWang, Chengjie



研究问题:如何恢复极端姿态、阴影或外观条件下的3D人脸?
动机:目前的生成模型在恢复3D人脸时,由于形状假设有限,容易在极端条件下失败。
方法:本文提出了一种新的神经原初人脸场(NPF)进行无监督的鲁棒3D人脸建模。NPF从野外照片集中分离出常见的/特定的面部线索,如身份、表情和场景特定细节,并学习一个脸部原型来聚合3D一致的身份。
效果:实验表明,与最先进的方法相比,NPF能够恢复更优或竞争的面部形状和纹理。

Generative models show good potential for recovering 3D faces beyond limited shape assumptions. While plausible details and resolutions are achieved, these models easily fail under extreme conditions of pose, shadow or appearance, due to the entangled fitting or lack of multi-view priors. To address this problem, this paper presents a novel Neural Proto-face Field (NPF) for unsupervised robust 3D face modeling. Instead of using constrained images as Neural Radiance Field (NeRF), NPF disentangles the common/specific facial cues, i.e., ID, expression and scene-specific details from in-the-wild photo collections. Specifically, NPF learns a face prototype to aggregate 3D-consistent identity via uncertainty modeling, extracting multi-image priors from a photo collection. NPF then learns to deform the prototype with the appropriate facial expressions, constrained by a loss of expression consistency and personal idiosyncrasies. Finally, NPF is optimized to fit a target image in the collection, recovering specific details of appearance and geometry. In this way, the generative model benefits from multi-image priors and meaningful facial structures. Extensive experiments on benchmarks show that NPF recovers superior or competitive facial shapes and textures, compared to state-of-the-art methods.

Self-Supervised Geometry-Aware Encoder for Style-Based 3D GAN Inversion
Lan, YushiandMeng, XuyiandYang, ShuaiandLoy, ChenChangeandDai, Bo



研究问题:本文旨在解决3D人脸重建和语义编辑中的挑战性问题,即给定一张人脸图像,如何预测其潜码以忠实地恢复其3D形状和详细纹理。
动机:尽管2D StyleGAN在二维人脸重建和语义编辑方面取得了重大进展,但将2D StyleGAN扩展到3D人脸的研究仍然缺乏一个通用的3D GAN逆映射框架,这限制了3D人脸重建和语义编辑的应用。
方法:我们设计了一种有效的自我训练方案来约束逆映射的学习过程。这种学习是在没有任何真实世界的2D-3D训练对的情况下进行的,而是通过从3D GAN生成的代理样本进行高效学习的。此外,除了捕获粗略形状和纹理信息的全局潜在代码外,我们还增强了生成网络的一个局部分支,其中添加了像素对齐的特征,以忠实地重建人脸细节。我们还考虑了一个新的管道来进行3D视图一致的编辑。
效果:大量的实验表明,我们的方法在形状和纹理重建质量上都优于最先进的逆映射方法。

StyleGAN has achieved great progress in 2D face reconstruction and semantic editing via image inversion and latent editing. While studies over extending 2D StyleGAN to 3D faces have emerged, a corresponding generic 3D GAN inversion framework is still missing, limiting the applications of 3D face reconstruction and semantic editing. In this paper, we study the challenging problem of 3D GAN inversion where a latent code is predicted given a single face image to faithfully recover its 3D shapes and detailed textures. The problem is ill-posed: innumerable compositions of shape and texture could be rendered to the current image. Furthermore, with the limited capacity of a global latent code, 2D inversion methods cannot preserve faithful shape and texture at the same time when applied to 3D models. To solve this problem, we devise an effective self-training scheme to constrain the learning of inversion. The learning is done efficiently without any real-world 2D-3D training pairs but proxy samples generated from a 3D GAN. In addition, apart from a global latent code that captures the coarse shape and texture information, we augment the generation network with a local branch, where pixel-aligned features are added to faithfully reconstruct face details. We further consider a new pipeline to perform 3D view-consistent editing. Extensive experiments show that our method outperforms state-of-the-art inversion methods in both shape and texture reconstruction quality.

PC2: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction
Melas-Kyriazi, LukeandRupprecht, ChristianandVedaldi, Andrea



研究问题:如何从单张RGB图像中重建物体的3D形状。
动机:从单张RGB图像中重建物体的3D形状是计算机视觉中长期存在的问题。
方法:提出一种新的方法,通过条件去噪扩散过程生成稀疏点云进行单图像3D重建。该方法以单张RGB图像及其相机位姿为输入,逐步将一组随机采样自三维高斯分布的3D点去噪成物体的形状。其关键在于几何一致的条件处理过程,即投影条件处理:在扩散过程的每一步,我们将局部图像特征从给定的相机位姿投影到部分去噪的点云上。
效果:实验结果表明,该方法不仅在合成基准测试上表现良好,而且在复杂的真实世界数据上也取得了显著的质量改进。

Reconstructing the 3D shape of an object from a single RGB image is a long-standing problem in computer vision. In this paper, we propose a novel method for single-image 3D reconstruction which generates a sparse point cloud via a conditional denoising diffusion process. Our method takes as input a single RGB image along with its camera pose and gradually denoises a set of 3D points, whose positions are initially sampled randomly from a three-dimensional Gaussian distribution, into the shape of an object. The key to our method is a geometrically-consistent conditioning process which we call projection conditioning: at each step in the diffusion process, we project local image features onto the partially-denoised point cloud from the given camera pose. This projection conditioning process enables us to generate high-resolution sparse geometries that are well-aligned with the input image and can additionally be used to predict point colors after shape reconstruction. Moreover, due to the probabilistic nature of the diffusion process, our method is naturally capable of generating multiple different shapes consistent with a single input image. In contrast to prior work, our approach not only performs well on synthetic benchmarks but also gives large qualitative improvements on complex real-world data.

Evading Forensic Classifiers With Attribute-Conditioned Adversarial Faces
Shamshad, FahadandSrivatsan, KoushikandNandakumar, Karthik



研究问题:如何生成能成功骗过人脸鉴定分类器的具有特定属性(如发色、眼睛大小、种族、性别等)的对抗性假脸图像。
动机:现有的基于深度学习的人脸鉴定分类器虽然可以高准确度地检测出人脸图像是否为合成或真实,但它们容易受到对抗性攻击。这些攻击虽然能成功避开人脸鉴定分类器的检测,但会引入可见的噪声模式,通过仔细的人眼审查就能发现。此外,这些攻击需要访问目标模型,这并不总是可能的。
方法:利用最先进的生成对抗网络(GAN)StyleGAN进行操作,该网络具有解耦表示,可以在不离开自然图像流形的情况下进行一系列修改。在StyleGAN的特征空间中搜索对抗性潜在代码,搜索过程可以由文本提示或参考图像引导。还提出了一种基于元学习的优化策略,以实现对未知目标模型的可转移性能。
效果:实验表明,该方法可以生成语义上经过操纵的对抗性假脸图像,这些图像符合指定的属性集,并能成功骗过人脸鉴定分类器,同时对人类来说是不可检测的。

The ability of generative models to produce highly realistic synthetic face images has raised security and ethical concerns. As a first line of defense against such fake faces, deep learning based forensic classifiers have been developed. While these forensic models can detect whether a face image is synthetic or real with high accuracy, they are also vulnerable to adversarial attacks. Although such attacks can be highly successful in evading detection by forensic classifiers, they introduce visible noise patterns that are detectable through careful human scrutiny. Additionally, these attacks assume access to the target model(s) which may not always be true. Attempts have been made to directly perturb the latent space of GANs to produce adversarial fake faces that can circumvent forensic classifiers. In this work, we go one step further and show that it is possible to successfully generate adversarial fake faces with a specified set of attributes (e.g., hair color, eye size, race, gender, etc.). To achieve this goal, we leverage the state-of-the-art generative model StyleGAN with disentangled representations, which enables a range of modifications without leaving the manifold of natural images. We propose a framework to search for adversarial latent codes within the feature space of StyleGAN, where the search can be guided either by a text prompt or a reference image. We also propose a meta-learning based optimization strategy to achieve transferable performance on unknown target models. Extensive experiments demonstrate that the proposed approach can produce semantically manipulated adversarial fake faces, which are true to the specified attribute set and can successfully fool forensic face classifiers, while remaining undetectable by humans. Code: https://github.com/koushiksrivats/face_attribute_attack.

Handwritten Text Generation From Visual Archetypes
Pippi, VittorioandCascianelli, SilviaandCucchiara, Rita



研究问题:如何生成特定作者风格的手写文本图像,特别是在处理未见过的风格和新词,以及训练中很少遇到的字符时。
动机:尽管生成器模型已经可以模仿作者的风格,但对于罕见字符的泛化能力尚未得到解决。
方法:设计一种基于Transformer的少样本风格手写文本生成模型,并专注于获取文本和风格的稳健和丰富的表示。特别是,我们提出了一种新的文本内容表示方法,即将符号的图像作为一系列密集向量,这些符号是用标准的GNU Unifont字形书写的,可以被视为其视觉原型。这种方法更适合生成在训练中很少看到的字符,但可能与经常观察到的字符具有视觉细节相似性。对于风格,我们通过在一个大型合成数据集上进行特定的预训练来获取未见过作者书法的稳健表示。
效果:定量和定性的结果表明,我们的方法在生成未见过的风格和罕见字符的单词方面比现有的依赖独立字符一热编码的方法更有效。

Generating synthetic images of handwritten text in a writer-specific style is a challenging task, especially in the case of unseen styles and new words, and even more when these latter contain characters that are rarely encountered during training. While emulating a writer's style has been recently addressed by generative models, the generalization towards rare characters has been disregarded. In this work, we devise a Transformer-based model for Few-Shot styled handwritten text generation and focus on obtaining a robust and informative representation of both the text and the style. In particular, we propose a novel representation of the textual content as a sequence of dense vectors obtained from images of symbols written as standard GNU Unifont glyphs, which can be considered their visual archetypes. This strategy is more suitable for generating characters that, despite having been seen rarely during training, possibly share visual details with the frequently observed ones. As for the style, we obtain a robust representation of unseen writers' calligraphy by exploiting specific pre-training on a large synthetic dataset. Quantitative and qualitative results demonstrate the effectiveness of our proposal in generating words in unseen styles and with rare characters more faithfully than existing approaches relying on independent one-hot encodings of the characters.

Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models
Blattmann, AndreasandRombach, RobinandLing, HuanandDockhorn, TimandKim, SeungWookandFidler, SanjaandKreis, Karsten



研究问题:如何利用潜在扩散模型(LDM)进行高质量的高分辨率视频生成,同时避免过度的计算需求。
动机:现有的方法在处理资源密集型的视频生成任务时,往往需要大量的计算资源。通过将LDM应用到高分辨率视频生成中,可以在保持高质量输出的同时,降低计算需求。
方法:首先,仅在图像上预训练一个LDM;然后,通过在潜在空间扩散模型中引入时间维度,并将编码后的图像序列(即视频)进行微调,将图像生成器转变为视频生成器。同样地,对扩散模型的上采样器进行时间对齐,将其转变为时间一致的视频超分辨率模型。
效果:在真实驾驶视频(分辨率为512x1024)上进行验证,实现了最先进的性能。此外,该方法可以轻松利用现成的预训练图像LDMs,只需训练一个时间对齐模型即可。通过这种方式,可以将公开的、最先进的文本到图像LDM Stable Diffusion转化为高效且富有表现力的文本到视频模型,分辨率可达1280x2048。

Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling. In particular, we validate our Video LDM on real driving videos of resolution 512x1024, achieving state-of-the-art performance. Furthermore, our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280x2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation. Project page: https://nv-tlabs.github.io/VideoLDM/

NULL-Text Inversion for Editing Real Images Using Guided Diffusion Models
Mokady, RonandHertz, AmirandAberman, KfirandPritch, YaelandCohen-Or, Daniel



研究问题:如何通过文本引导的扩散模型进行图像编辑。
动机:目前,大型的文本引导扩散模型在图像生成方面具有强大的能力,但需要通过反转图像和有意义的文本提示来修改真实的图像。
方法:本文提出了一种精确的反转技术,包括关键的新型组件:(i)对扩散模型的关键性反转,使用单个关键的噪声向量进行优化;(ii)NULL-text优化,只修改用于无分类器指导的无条件文本嵌入,而不是输入文本嵌入。
效果:通过在各种图像和各种提示编辑上广泛评估,证明了我们的Null-text反转在真实图像的高保真度编辑方面的有效性。

Recent large-scale text-guided diffusion models provide powerful image generation capabilities. Currently, a massive effort is given to enable the modification of these images using text only as means to offer intuitive and versatile editing tools. To edit a real image using these state-of-the-art tools, one must first invert the image with a meaningful text prompt into the pretrained model's domain. In this paper, we introduce an accurate inversion technique and thus facilitate an intuitive text-based modification of the image. Our proposed inversion consists of two key novel components: (i) Pivotal inversion for diffusion models. While current methods aim at mapping random noise samples to a single input image, we use a single pivotal noise vector for each timestamp and optimize around it. We recognize that a direct DDIM inversion is inadequate on its own, but does provide a rather good anchor for our optimization. (ii) NULL-text optimization, where we only modify the unconditional textual embedding that is used for classifier-free guidance, rather than the input text embedding. This allows for keeping both the model weights and the conditional embedding intact and hence enables applying prompt-based editing while avoiding the cumbersome tuning of the model's weights. Our Null-text inversion, based on the publicly available Stable Diffusion model, is extensively evaluated on a variety of images and various prompt editing, showing high-fidelity editing of real images.

Neural Texture Synthesis With Guided Correspondence
Zhou, YangandChen, KaijianandXiao, RongjunandHuang, Hui



研究问题:本文旨在重新推广MRFs和神经网络的结合,即CNNMRF模型,用于纹理合成。
动机:尽管MRFs是经典基于实例的纹理合成方法的基础,但在深度学习时代并未得到充分重视。
方法:首先提出在最近邻搜索中计算引导对应距离,并在此基础上定义引导对应损失以测量输出纹理与示例的相似性。
效果:实验表明,该方法在不受控制的和受控制的纹理合成方面超越了现有的神经网络方法。更重要的是,引导对应损失可以作为一般纹理损失,例如在实时受控合成和基于反转的单图像编辑的生成网络训练中使用。相比之下,现有的纹理损失,如切片Wasserstein损失,无法在这些具有挑战性的任务上工作。

Markov random fields (MRFs) are the cornerstone of classical approaches to example-based texture synthesis. Yet, it is not fully valued in the deep learning era. This paper aims to re-promote the combination of MRFs and neural networks, i.e., the CNNMRF model, for texture synthesis, with two key observations made. We first propose to compute the Guided Correspondence Distance in the nearest neighbor search, based on which a Guided Correspondence loss is defined to measure the similarity of the output texture to the example. Experiments show that our approach surpasses existing neural approaches in uncontrolled and controlled texture synthesis. More importantly, the Guided Correspondence loss can function as a general textural loss in, e.g., training generative networks for real-time controlled synthesis and inversion-based single-image editing. In contrast, existing textural losses, such as the Sliced Wasserstein loss, cannot work on these challenging tasks.

Hierarchical Fine-Grained Image Forgery Detection and Localization
Guo, XiaoandLiu, XiaohongandRen, ZhiyuanandGrosz, StevenandMasi, IacopoandLiu, Xiaoming



研究问题:如何有效地检测和定位图像伪造?
动机:由于CNN生成的图像和图像编辑领域的图像伪造属性差异大,使得统一的图像伪造检测和定位(IFDL)具有挑战性。
方法:提出一种分层精细的IFDL表示学习方法,通过在不同层次上对操纵图像的伪造属性进行多标签表示,并利用这些层次之间的依赖关系进行精细分类。
效果:在7个不同的基准测试中,该方法在IFDL任务和伪造属性分类任务上都表现出了有效性。

Differences in forgery attributes of images generated in CNN-synthesized and image-editing domains are large, and such differences make a unified image forgery detection and localization (IFDL) challenging. To this end, we present a hierarchical fine-grained formulation for IFDL representation learning. Specifically, we first represent forgery attributes of a manipulated image with multiple labels at different levels. Then we perform fine-grained classification at these levels using the hierarchical dependency between them. As a result, the algorithm is encouraged to learn both comprehensive features and inherent hierarchical nature of different forgery attributes, thereby improving the IFDL representation. Our proposed IFDL framework contains three components: multi-branch feature extractor, localization and classification modules. Each branch of the feature extractor learns to classify forgery attributes at one level, while localization and classification modules segment the pixel-level forgery region and detect image-level forgery, respectively. Lastly, we construct a hierarchical fine-grained dataset to facilitate our study. We demonstrate the effectiveness of our method on 7 different benchmarks, for both tasks of IFDL and forgery attribute classification. Our source code and dataset can be found at https://github.com/CHELSEA234/HiFi_IFDL

Modernizing Old Photos Using Multiple References via Photorealistic Style Transfer
Gunawan, AgusandKim, SooYeandSim, HyeonjunandLee, Jae-HoandKim, Munchurl



研究问题:如何利用多种参考进行旧照片的现代化改造。
动机:目前的旧照片现代化方法缺乏对多种参考的有效利用,我们提出一种新的基于多参考的旧照片现代化(MROPM)框架来解决这个问题。
方法:我们提出了一个新颖的多参考旧照片现代化网络(MROPM-Net)和一种新的合成数据生成方案。MROPM-Net通过逼真的风格转换(PST)使用多种参考对旧照片进行风格化,并进一步优化结果以产生现代感的图片。同时,合成数据生成方案训练网络有效利用多种参考进行现代化改造。
效果:实验表明,我们的方法在真实旧照片的现代化改造上优于其他基线方法,即使训练过程中没有使用过旧照片。此外,我们的方法能够为旧照片中的每个语义区域适当选择来自多种参考的风格,进一步提高现代化改造的性能。

This paper firstly presents old photo modernization using multiple references by performing stylization and enhancement in a unified manner. In order to modernize old photos, we propose a novel multi-reference-based old photo modernization (MROPM) framework consisting of a network MROPM-Net and a novel synthetic data generation scheme. MROPM-Net stylizes old photos using multiple references via photorealistic style transfer (PST) and further enhances the results to produce modern-looking images. Meanwhile, the synthetic data generation scheme trains the network to effectively utilize multiple references to perform modernization. To evaluate the performance, we propose a new old photos benchmark dataset (CHD) consisting of diverse natural indoor and outdoor scenes. Extensive experiments show that the proposed method outperforms other baselines in performing modernization on real old photos, even though no old photos were used during training. Moreover, our method can appropriately select styles from multiple references for each semantic region in the old photo to further improve the modernization performance.

Interactive Cartoonization With Controllable Perceptual Factors
Ahn, NamhyukandKwon, PatrickandBack, JihyeandHong, KibeomandKim, Seungkwon



研究问题:如何将自然照片转化为卡通风格,并允许艺术家对结果进行操作。
动机:现有的深度学习方法只能进行端到端的转换,无法让艺术家进行结果操作。
方法:提出一种新的解决方案,基于卡通创作过程的纹理和颜色编辑特性,设计了一个模型架构,包括分离的纹理和颜色解码器,以解耦这些属性。在纹理解码器中,提出了一个纹理控制器,使用户能够控制笔触风格和抽象度,生成多样化的卡通纹理。还引入了HSV颜色增强,使网络生成一致的颜色转换。
效果:据我们所知,这是第一种在推理阶段控制卡通化的方法,与基线相比,生成了高质量的结果。

Cartoonization is a task that renders natural photos into cartoon styles. Previous deep methods only have focused on end-to-end translation, disabling artists from manipulating results. To tackle this, in this work, we propose a novel solution with editing features of texture and color based on the cartoon creation process. To do that, we design a model architecture to have separate decoders, texture and color, to decouple these attributes. In the texture decoder, we propose a texture controller, which enables a user to control stroke style and abstraction to generate diverse cartoon textures. We also introduce an HSV color augmentation to induce the networks to generate consistent color translation. To the best of our knowledge, our work is the first method to control the cartoonization during the inferences step, generating high-quality results compared to baselines.

topic-9

Topic words :  segmentation,  semantic,  supervised,  object,  detection,  labels,  instance,  level

HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation
Ding, JianandXue, NanandXia, Gui-SongandSchiele, BerntandDai, Dengxin



研究问题:现有的语义分割模型在独立同分布条件下取得了巨大成功,但在真实世界应用中,测试数据可能来自与训练数据不同的领域。因此,提高模型对领域差异的鲁棒性非常重要。
动机:本研究探讨了在领域泛化设置下的语义分割,即模型仅在源领域进行训练,并在未见过的目标领域进行测试。
方法:我们提出了一种新的分层分组变换器(HGFormer),通过显式地将像素分组形成部分级别的掩码和整体级别的掩码,来提高模型的鲁棒性。
效果:实验表明,HGFormer比像素级分类方法和平面分组变换器产生了更鲁棒的语义分割结果,并且显著优于先前的方法。

Current semantic segmentation models have achieved great success under the independent and identically distributed (i.i.d.) condition. However, in real-world applications, test data might come from a different domain than training data. Therefore, it is important to improve model robustness against domain differences. This work studies semantic segmentation under the domain generalization setting, where a model is trained only on the source domain and tested on the unseen target domain. Existing works show that Vision Transformers are more robust than CNNs and show that this is related to the visual grouping property of self-attention. In this work, we propose a novel hierarchical grouping transformer (HGFormer) to explicitly group pixels to form part-level masks and then whole-level masks. The masks at different scales aim to segment out both parts and a whole of classes. HGFormer combines mask classification results at both scales for class label prediction. We assemble multiple interesting cross-domain settings by using seven public semantic segmentation datasets. Experiments show that HGFormer yields more robust semantic segmentation results than per-pixel classification methods and flat-grouping transformers, and outperforms previous methods significantly. Code will be available at https://github.com/dingjiansw101/HGFormer.

Distilling Vision-Language Pre-Training To Collaborate With Weakly-Supervised Temporal Action Localization
Ju, ChenandZheng, KunhaoandLiu, JinxiangandZhao, PeisenandZhang, YaandChang, JianlongandTian, QiandWang, Yanfeng



研究问题:如何通过弱监督进行时间动作定位,并解决现有方法中优化目标不同导致的不完整问题。
动机:大部分方法采用现成的分类预训练生成视频特征进行动作定位,但分类和定位的优化目标不同,导致定位结果存在严重的不完整问题。
方法:从视觉语言预训练(VLP)中提炼自由动作知识,构建一个新的蒸馏协作框架,该框架由两个分别执行CBP和VLP的分支组成,并通过双分支交替训练策略进行优化。
效果:在THUMOS14和ActivityNet1.2上进行的大量实验和消融研究表明,该方法显著优于最先进的方法。

Weakly-supervised temporal action localization (WTAL) learns to detect and classify action instances with only category labels. Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization. However, the different optimization objectives between classification and localization, make temporally localized results suffer from the serious incomplete issue. To tackle this issue without additional annotations, this paper considers to distill free action knowledge from Vision-Language Pre-training (VLP), as we surprisingly observe that the localization results of vanilla VLP have an over-complete issue, which is just complementary to the CBP results. To fuse such complementarity, we propose a novel distillation-collaboration framework with two branches acting as CBP and VLP respectively. The framework is optimized through a dual-branch alternate training strategy. Specifically, during the B step, we distill the confident background pseudo-labels from the CBP branch; while during the F step, the confident foreground pseudo-labels are distilled from the VLP branch. As a result, the dual-branch complementarity is effectively fused to promote one strong alliance. Extensive experiments and ablation studies on THUMOS14 and ActivityNet1.2 reveal that our method significantly outperforms state-of-the-art methods.

Exploring Structured Semantic Prior for Multi Label Recognition With Incomplete Labels
Ding, ZixuanandWang, AoandChen, HuiandZhang, QiangandLiu, PengzhangandBao, YongjunandYan, WeipengandHan, Jungong



研究问题:多标签识别(MLR)在标签不完整的情况下具有挑战性。
动机:尽管现有的视觉语言模型CLIP在图像到标签的对应关系上表现出色,但它们通常忽视了标签到标签对应关系的宝贵先验知识。
方法:通过语义先验提示器获取标签到标签对应关系的结构化语义先验,并提出了一种新的语义对应提示网络(SCPNet)。
效果:在多个广泛使用的基准数据集上的全面实验和分析表明,该方法在所有数据集上都显著优于现有方法,充分展示了该方法的有效性和优越性。

Multi-label recognition (MLR) with incomplete labels is very challenging. Recent works strive to explore the image-to-label correspondence in the vision-language model, i.e., CLIP, to compensate for insufficient annotations. In spite of promising performance, they generally overlook the valuable prior about the label-to-label correspondence. In this paper, we advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior about the label-to-label correspondence via a semantic prior prompter. We then present a novel Semantic Correspondence Prompt Network (SCPNet), which can thoroughly explore the structured semantic prior. A Prior-Enhanced Self-Supervised Learning method is further introduced to enhance the use of the prior. Comprehensive experiments and analyses on several widely used benchmark datasets show that our method significantly outperforms existing methods on all datasets, well demonstrating the effectiveness and the superiority of our method. Our code will be available at https://github.com/jameslahm/SCPNet.

Instance-Specific and Model-Adaptive Supervision for Semi-Supervised Semantic Segmentation
Zhao, ZhenandLong, SifanandPi, JiminandWang, JingdongandZhou, Luping



研究问题:现有的半监督语义分割方法在处理未标注数据时,没有考虑到不同实例间的差异和训练难度,本研究旨在解决这个问题。
动机:通过区分未标注实例,可以促进针对特定实例的监督,使模型能够动态适应其演变。
方法:提出了一种名为iMAS的实例特定和模型自适应的监督方法。该方法根据模型性能使用类加权对称交并比来评估每个未标注实例的定量难度,并根据评估的难度对未标注数据的进行逐步学习。
效果:实验结果表明,iMAS在各种半监督分区协议下的分割基准测试中,都能显著提高性能,超越了当前最先进的方法。

Recently, semi-supervised semantic segmentation has achieved promising performance with a small fraction of labeled data. However, most existing studies treat all unlabeled data equally and barely consider the differences and training difficulties among unlabeled instances. Differentiating unlabeled instances can promote instance-specific supervision to adapt to the model's evolution dynamically. In this paper, we emphasize the cruciality of instance differences and propose an instance-specific and model-adaptive supervision for semi-supervised semantic segmentation, named iMAS. Relying on the model's performance, iMAS employs a class-weighted symmetric intersection-over-union to evaluate quantitative hardness of each unlabeled instance and supervises the training on unlabeled data in a model-adaptive manner. Specifically, iMAS learns from unlabeled instances progressively by weighing their corresponding consistency losses based on the evaluated hardness. Besides, iMAS dynamically adjusts the augmentation for each instance such that the distortion degree of augmented instances is adapted to the model's generalization capability across the training course. Not integrating additional losses and training procedures, iMAS can obtain remarkable performance gains against current state-of-the-art approaches on segmentation benchmarks under different semi-supervised partition protocols.

Mapping Degeneration Meets Label Evolution: Learning Infrared Small Target Detection With Single Point Supervision
Ying, XinyiandLiu, LiandWang, YingqianandLi, RuojingandChen, NuoandLin, ZaipingandSheng, WeidongandZhou, Shilin



研究问题:训练卷积神经网络(CNN)以全监督方式检测红外小目标,但需要大量的像素级标注,成本高昂。
动机:为了解决这个问题,本文首次尝试使用点级监督进行红外小目标检测。
方法:在点标签监督的训练阶段,我们发现CNNs首先学习分割目标附近的像素群集,然后逐渐收敛到预测地面真值点标签。受此“映射退化”现象的启发,我们提出了一种标签演化框架,名为单点监督下的标签演化(LESPS),通过利用CNN的中间预测来逐步扩展点标签。
效果:实验结果表明,配备LESPS的CNNs可以从相应的点标签中恢复目标掩码,并在像素级交并比(IoU)和对象级检测概率(Pd)方面分别实现超过70%和95%的全监督性能。

Training a convolutional neural network (CNN) to detect infrared small targets in a fully supervised manner has gained remarkable research interests in recent years, but is highly labor expensive since a large number of per-pixel annotations are required. To handle this problem, in this paper, we make the first attempt to achieve infrared small target detection with point-level supervision. Interestingly, during the training phase supervised by point labels, we discover that CNNs first learn to segment a cluster of pixels near the targets, and then gradually converge to predict groundtruth point labels. Motivated by this "mapping degeneration" phenomenon, we propose a label evolution framework named label evolution with single point supervision (LESPS) to progressively expand the point label by leveraging the intermediate predictions of CNNs. In this way, the network predictions can finally approximate the updated pseudo labels, and a pixel-level target mask can be obtained to train CNNs in an end-to-end manner. We conduct extensive experiments with insightful visualizations to validate the effectiveness of our method. Experimental results show that CNNs equipped with LESPS can well recover the target masks from corresponding point labels, and can achieve over 70% and 95% of their fully supervised performance in terms of pixel-level intersection over union (IoU) and object-level probability of detection (Pd), respectively. Code is available at https://github.com/XinyiYing/LESPS.

Delving Into Shape-Aware Zero-Shot Semantic Segmentation
Liu, XinyuandTian, BeiwenandWang, ZhenandWang, RuiandSheng, KehuaandZhang, BoandZhao, HaoandZhou, Guyue



研究问题:如何将视觉语言预训练的成功应用于语义分割。
动机:现有的视觉语言模型虽然能准确理解语义,但在精细的形状描绘和密集预测任务上表现不佳。
方法:借鉴图像分割文献中的古典谱方法,提出利用自监督像素级特征构建的拉普拉斯矩阵的特征向量来提升形状感知能力。
效果:该方法在Pascal和COCO数据集上实现了新的最先进的零样本语义分割性能,且具有显著优势。

Thanks to the impressive progress of large-scale vision-language pretraining, recent recognition models can classify arbitrary objects in a zero-shot and open-set manner, with a surprisingly high accuracy. However, translating this success to semantic segmentation is not trivial, because this dense prediction task requires not only accurate semantic understanding but also fine shape delineation and existing vision-language models are trained with image-level language descriptions. To bridge this gap, we pursue shape-aware zero-shot semantic segmentation in this study. Inspired by classical spectral methods in the image segmentation literature, we propose to leverage the eigen vectors of Laplacian matrices constructed with self-supervised pixel-wise features to promote shape-awareness. Despite that this simple and effective technique does not make use of the masks of seen classes at all, we demonstrate that it out-performs a state-of-the-art shape-aware formulation that aligns ground truth and predicted edges during training. We also delve into the performance gains achieved on different datasets using different backbones and draw several interesting and conclusive observations: the benefits of promoting shape-awareness highly relates to mask compactness and language embedding locality. Finally, our method sets new state-of-the-art performance for zero-shot semantic segmentation on both Pascal and COCO, with significant margins. Code and models will be accessed at https://github.com/Liuxinyv/SAZS.

Annealing-Based Label-Transfer Learning for Open World Object Detection
Ma, YuqingandLi, HainanandZhang, ZhangeandGuo, JinyangandZhang, ShanghangandGong, RuihaoandLiu, Xianglong



研究问题:如何提高开放世界目标检测(OWOD)的性能,特别是在未知目标的识别上。
动机:现有的OWOD方法需要手动选择未知目标,缺乏适当的先验知识,导致不确定性。
方法:提出一种基于退火的标签转移框架,将未知特征通过卷积操作传播到已知目标,并通过标签转移学习范式和锯齿退火调度策略来重建已知和未知类别的决策边界,以提升已知和未知的识别能力。
效果:在常用基准测试中,该方法实现了优越的检测性能(未知目标的mAP提高了200%,同时已知目标的检测性能更高),并且是首个无需手动选择未知目标的OWOD方法。

Open world object detection (OWOD) has attracted extensive attention due to its practicability in the real world. Previous OWOD works manually designed unknown-discover strategies to select unknown proposals from the background, suffering from uncertainties without appropriate priors. In this paper, we claim the learning of object detection could be seen as an object-level feature-entanglement process, where unknown traits are propagated to the known proposals through convolutional operations and could be distilled to benefit unknown recognition without manual selection. Therefore, we propose a simple yet effective Annealing-based Label-Transfer framework, which sufficiently explores the known proposals to alleviate the uncertainties. Specifically, a Label-Transfer Learning paradigm is introduced to decouple the known and unknown features, while a Sawtooth Annealing Scheduling strategy is further employed to rebuild the decision boundaries of the known and unknown classes, thus promoting both known and unknown recognition. Moreover, previous OWOD works neglected the trade-off of known and unknown performance, and we thus introduce a metric called Equilibrium Index to comprehensively evaluate the effectiveness of the OWOD models. To the best of our knowledge, this is the first OWOD work without manual unknown selection. Extensive experiments conducted on the common-used benchmark validate that our model achieves superior detection performance (200% unknown mAP improvement with the even higher known detection performance) compared to other state-of-the-art methods. Our code is available at https://github.com/DIG-Beihang/ALLOW.git.

DeGPR: Deep Guided Posterior Regularization for Multi-Class Cell Detection and Counting
Tyagi, AayushKumarandMohapatra, ChiragandDas, PrasenjitandMakharia, GovindandMehra, LalitaandAP, PrathoshandMausam



研究问题:如何更准确地检测和计数医学图像中的细胞,特别是在数据有限、对象重叠、多种细胞类型、严重的类别不平衡以及细胞大小/形状差异微小等情况下。
动机:手动计数既繁琐又可能导致病理学家之间的观察者差异。现有的深度学习基础的对象检测和计数方法可能无法直接应用于医学图像中的细胞检测和计数。
方法:提出引导后正则化DeGPR,通过指导对象检测器利用细胞之间的判别特征来辅助其进行检测和计数。这些特征可以由病理学家提供,也可以直接从视觉数据中推断出来。
效果:在两个公开可用的数据集(CoNSeP和MoNuSAC)以及我们贡献的新数据集MuCeD上进行验证。MuCeD包含55张人类十二指肠活检图像,用于预测腹腔疾病。我们在三个数据集上与三种对象检测基线进行了大量实验,结果显示DeGPR是模型无关的,并持续提高基线,获得了高达9%(绝对)的mAP增益。

Multi-class cell detection and counting is an essential task for many pathological diagnoses. Manual counting is tedious and often leads to inter-observer variations among pathologists. While there exist multiple, general-purpose, deep learning-based object detection and counting methods, they may not readily transfer to detecting and counting cells in medical images, due to the limited data, presence of tiny overlapping objects, multiple cell types, severe class-imbalance, minute differences in size/shape of cells, etc. In response, we propose guided posterior regularization DeGPR, which assists an object detector by guiding it to exploit discriminative features among cells. The features may be pathologist-provided or inferred directly from visual data. We validate our model on two publicly available datasets (CoNSeP and MoNuSAC), and on MuCeD, a novel dataset that we contribute. MuCeD consists of 55 biopsy images of the human duodenum for predicting celiac disease. We perform extensive experimentation with three object detection baselines on three datasets to show that DeGPR is model-agnostic, and consistently improves baselines obtaining up to 9% (absolute) mAP gains.

itKD: Interchange Transfer-Based Knowledge Distillation for 3D Object Detection
Cho, HyeonandChoi, JunyongandBaek, GeonwooandHwang, Wonjun



研究问题:目前,大多数基于点云的3D物体检测器的研究仅关注网络架构的开发以提高准确性,而没有考虑计算效率。
动机:本文提出了一种自动编码器风格的框架,通过基于交换传输的知识蒸馏进行通道压缩和解压缩,以解决上述问题。
方法:我们首先使用共享的自动编码器独立地传递教师和学生网络的特征来学习教师网络的地图视图特征;然后,我们使用一种压缩表示损失来绑定来自学生和教师网络的通道压缩知识作为某种正则化。解压缩后的特征以相反的方向转移以减小交换重建的差距。最后,我们提出头部注意力损失来匹配由多头自我注意机制提取的3D物体检测信息。
效果:通过大量实验,我们验证了该方法可以训练出与3D点云检测任务紧密对齐的轻量级模型,并在著名的公开数据集(如Waymo和nuScenes)上展示了其优越性。

Point-cloud based 3D object detectors recently have achieved remarkable progress. However, most studies are limited to the development of network architectures for improving only their accuracy without consideration of the computational efficiency. In this paper, we first propose an autoencoder-style framework comprising channel-wise compression and decompression via interchange transfer-based knowledge distillation. To learn the map-view feature of a teacher network, the features from teacher and student networks are independently passed through the shared autoencoder; here, we use a compressed representation loss that binds the channel-wised compression knowledge from both student and teacher networks as a kind of regularization. The decompressed features are transferred in opposite directions to reduce the gap in the interchange reconstructions. Lastly, we present an head attention loss to match the 3D object detection information drawn by the multi-head self-attention mechanism. Through extensive experiments, we verify that our method can train the lightweight model that is well-aligned with the 3D point cloud detection task and we demonstrate its superiority using the well-known public datasets; e.g., Waymo and nuScenes.

2PCNet: Two-Phase Consistency Training for Day-to-Night Unsupervised Domain Adaptive Object Detection
Kennerley, MikhailandWang, Jian-GangandVeeravalli, BharadwajandTan, RobbyT.



研究问题:夜间目标检测由于缺乏夜间图像标注,是一个具有挑战性的问题。
动机:尽管存在多种领域适应方法,但在高精度结果上仍存在问题,特别是在小尺度和低光照物体上的错误传播。
方法:本文提出了一种两阶段一致性无监督领域适应网络2PCNet,通过在第一阶段使用教师的高置信边界框预测并将其附加到学生的区域建议中,然后在第二阶段让教师重新评估,生成高、低置信度伪标签的组合。同时,为了解决低光照区域和其他夜间相关属性引起的错误,我们提出了一个名为NightAug的夜间特定增强管道。
效果:实验结果表明,该方法在公开数据集上的表现优于最先进的方法20%,并且比直接在目标数据上训练的有监督模型表现更好。

Object detection at night is a challenging problem due to the absence of night image annotations. Despite several domain adaptation methods, achieving high-precision results remains an issue. False-positive error propagation is still observed in methods using the well-established student-teacher framework, particularly for small-scale and low-light objects. This paper proposes a two-phase consistency unsupervised domain adaptation network, 2PCNet, to address these issues. The network employs high-confidence bounding-box predictions from the teacher in the first phase and appends them to the student's region proposals for the teacher to re-evaluate in the second phase, resulting in a combination of high and low confidence pseudo-labels. The night images and pseudo-labels are scaled-down before being used as input to the student, providing stronger small-scale pseudo-labels. To address errors that arise from low-light regions and other night-related attributes in images, we propose a night-specific augmentation pipeline called NightAug. This pipeline involves applying random augmentations, such as glare, blur, and noise, to daytime images. Experiments on publicly available datasets demonstrate that our method achieves superior results to state-of-the-art methods by 20%, and to supervised models trained directly on the target data.

Generating Features With Increased Crop-Related Diversity for Few-Shot Object Detection
Xu, JingyiandLe, HieuandSamaras, Dimitris



研究问题:两阶段物体检测器在图像中生成物体提议并对其进行分类以检测物体,但这些提议并不完美地包含物体,而是以许多可能的方式与物体重叠,表现出提议难度水平的极大变化性。
动机:训练一个对抗这种裁剪相关变化的鲁棒分类器需要大量的训练数据,但在少数镜头设置中这是不可用的。为了解决这个问题,我们提出了一种新的基于变分自动编码器(VAE)的数据生成模型,该模型能够生成具有增加的裁剪相关多样性的数据。
方法:我们的主要思想是转换潜在空间,使得具有不同范数的潜在代码代表不同的裁剪相关变化。这使得我们可以通过简单地改变潜在范数来生成具有增加的裁剪相关难度水平的特征。具体来说,每个潜在代码都被重新缩放,使其范数与输入裁剪相对于地面真值框的IoU分数线性相关。在这里,IoU分数是一个代表裁剪难度水平的代理。我们在基本类别上训练这个VAE模型,然后使用训练好的模型为新类别生成特征。
效果:我们的实验结果表明,我们生成的特征在PASCAL VOC和MS COCO数据集上始终改进了最先进的少数镜头物体检测方法。

Two-stage object detectors generate object proposals and classify them to detect objects in images. These proposals often do not perfectly contain the objects but overlap with them in many possible ways, exhibiting great variability in the difficulty levels of the proposals. Training a robust classifier against this crop-related variability requires abundant training data, which is not available in few-shot settings. To mitigate this issue, we propose a novel variational autoencoder (VAE) based data generation model, which is capable of generating data with increased crop-related diversity. The main idea is to transform the latent space such the latent codes with different norms represent different crop-related variations. This allows us to generate features with increased crop-related diversity in difficulty levels by simply varying the latent norm. In particular, each latent code is rescaled such that its norm linearly correlates with the IoU score of the input crop w.r.t. the ground-truth box. Here the IoU score is a proxy that represents the difficulty level of the crop. We train this VAE model on base classes conditioned on the semantic code of each class and then use the trained model to generate features for novel classes. Our experimental results show that our generated features consistently improve state-of-the-art few-shot object detection methods on PASCAL VOC and MS COCO datasets.

The Devil Is in the Points: Weakly Semi-Supervised Instance Segmentation via Point-Guided Mask Representation
Kim, BeomyoungandJeong, JoonhyunandHan, DongyoonandHwang, SungJu



研究问题:如何在预算有限的情况下,通过弱监督学习实现高性能的实例分割。
动机:现有的半监督学习方法主要面临的挑战是误报和漏报实例提议之间的权衡,而弱监督学习可以有效利用经济实惠的点标签作为强大的弱监督源来解决这个挑战。
方法:提出了一种名为弱半监督实例分割(WSSIS)的新型学习方案,该方案考虑了一个由少量全标记图像和大量点标记图像组成的数据集设置。同时,为了处理全标记数据非常有限这一困难情况,还提出了一种MaskRefineNet来精炼粗糙掩码中的噪声。
效果:在COCO和BDD100K数据集上进行了广泛的实验,所提出的方法即使在只有50%的全标定COCO数据(38.8% vs. 39.7%)的情况下,也能取得与全监督模型相当的结果。当只使用5%的全标定COCO数据时,该方法的性能明显优于最先进的半监督学习方法(33.7% vs. 24.9%)。代码可在https://github.com/clovaai/PointWSSIS获取。

In this paper, we introduce a novel learning scheme named weakly semi-supervised instance segmentation (WSSIS) with point labels for budget-efficient and high-performance instance segmentation. Namely, we consider a dataset setting consisting of a few fully-labeled images and a lot of point-labeled images. Motivated by the main challenge of semi-supervised approaches mainly derives from the trade-off between false-negative and false-positive instance proposals, we propose a method for WSSIS that can effectively leverage the budget-friendly point labels as a powerful weak supervision source to resolve the challenge. Furthermore, to deal with the hard case where the amount of fully-labeled data is extremely limited, we propose a MaskRefineNet that refines noise in rough masks. We conduct extensive experiments on COCO and BDD100K datasets, and the proposed method achieves promising results comparable to those of the fully-supervised model, even with 50% of the fully labeled COCO data (38.8% vs. 39.7%). Moreover, when using as little as 5% of fully labeled COCO data, our method shows significantly superior performance over the state-of-the-art semi-supervised learning method (33.7% vs. 24.9%). The code is available at https://github.com/clovaai/PointWSSIS.

DynaMask: Dynamic Mask Selection for Instance Segmentation
Li, RuihuangandHe, ChenhangandLi, ShuaiandZhang, YabinandZhang, Lei



研究问题:如何有效地对不同物体实例进行分割,同时解决低分辨率掩膜丢失丰富细节和高分辨率掩膜导致二次计算开销的问题。
动机:目前的代表性实例分割方法大多使用固定分辨率的掩膜(如28x 28网格)来分割不同的物体实例,但这种方法在保留细节和控制计算开销上存在困难。
方法:本文提出了一种动态选择适合不同物体建议的掩膜的方法。首先,开发了一种具有自适应特征聚合的双级特征金字塔网络(FPN),以逐渐增加掩膜网格的分辨率,确保高质量地分割对象。其次,为了缓解使用大掩膜导致的计算和内存成本的增加,设计了一种计算开销极小的掩膜切换模块(MSM),为每个实例选择最合适的掩膜分辨率,实现了高效率的同时保持了高精度的分割。
效果:该方法被称为DynaMask,无需复杂的操作,就能在其他最先进的方法中实现一致且显著的性能提升,同时只需要适度的计算开销。

The representative instance segmentation methods mostly segment different object instances with a mask of the fixed resolution, e.g., 28x 28 grid. However, a low-resolution mask loses rich details, while a high-resolution mask incurs quadratic computation overhead. It is a challenging task to predict the optimal binary mask for each instance. In this paper, we propose to dynamically select suitable masks for different object proposals. First, a dual-level Feature Pyramid Network (FPN) with adaptive feature aggregation is developed to gradually increase the mask grid resolution, ensuring high-quality segmentation of objects. Specifically, an efficient region-level top-down path (r-FPN) is introduced to incorporate complementary contextual and detailed information from different stages of image-level FPN (i-FPN). Then, to alleviate the increase of computation and memory costs caused by using large masks, we develop a Mask Switch Module (MSM) with negligible computational cost to select the most suitable mask resolution for each instance, achieving high efficiency while maintaining high segmentation accuracy. Without bells and whistles, the proposed method, namely DynaMask, brings consistent and noticeable performance improvements over other state-of-the-arts at a moderate computation overhead. The source code: https://github.com/lslrh/DynaMask.

SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency
Liu, YangandZhang, YaoandWang, YixinandZhang, YangandTian, JiangandShi, ZhongchaoandFan, JianpingandHe, Zhiqiang



研究问题:如何提高基于Transformer的目标检测器的收敛速度和性能。
动机:目前的DETR方法通过使用中心概念的空间先验来加速Transformer检测器的收敛,但这种方法可能会降低查询的显著性并混淆检测器。
方法:提出了一种名为SAP-DETR的新方法,将目标检测视为从显著点到实例对象的转换。在SAP-DETR中,为每个对象查询显式初始化一个特定于查询的参考点,逐步将其聚合成实例对象,然后预测边界框的每一侧到这些点的距离。
效果:实验表明,SAP-DETR以显著的速度收敛,并在性能上具有竞争力。在标准训练方案下,SAP-DETR稳定地提升了最先进的方法1.0 AP。在ResNet-DC-101的基础上,SAP-DETR实现了46.9 AP。

Recently, the dominant DETR-based approaches apply central-concept spatial prior to accelerating Transformer detector convergency. These methods gradually refine the reference points to the center of target objects and imbue object queries with the updated central reference information for spatially conditional attention. However, centralizing reference points may severely deteriorate queries' saliency and confuse detectors due to the indiscriminative spatial prior. To bridge the gap between the reference points of salient queries and Transformer detectors, we propose SAlient Point-based DETR (SAP-DETR) by treating object detection as a transformation from salient points to instance objects. In SAP-DETR, we explicitly initialize a query-specific reference point for each object query, gradually aggregate them into an instance object, and then predict the distance from each side of the bounding box to these points. By rapidly attending to query-specific reference region and other conditional extreme regions from the image features, SAP-DETR can effectively bridge the gap between the salient point and the query-based Transformer detector with a significant convergency speed. Our extensive experiments have demonstrated that SAP-DETR achieves 1.4 times convergency speed with competitive performance. Under the standard training scheme, SAP-DETR stably promotes the SOTA approaches by 1.0 AP. Based on ResNet-DC-101, SAP-DETR achieves 46.9 AP. The code will be released at https://github.com/liuyang-ict/SAP-DETR.

DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection
Zhang, XuanandLi, ShiyuandLi, XiandHuang, PingandShan, JiulongandChen, Ting



研究问题:如何利用预训练教师网络、去噪学生编码器-解码器和分割网络,提高计算机视觉中的异常检测效果。
动机:现有的基于学生-教师框架的异常检测方法仅对正常数据施加经验性约束并融合多级信息,效果有限。
方法:提出一种改进的模型DeSTSeg,将预训练教师网络、去噪学生编码器-解码器和分割网络整合到一个框架中。首先引入去噪过程强化对异常数据的约束;其次通过合成异常掩模为分割网络提供丰富的监督,自适应地融合多级学生-教师特征。
效果:在工业检测基准数据集上进行实验,该方法在图像级别AUC、像素级别平均精度和实例级别平均精度上分别达到98.6%、75.8%和76.4%,取得了最先进的性能。

Visual anomaly detection, an important problem in computer vision, is usually formulated as a one-class classification and segmentation task. The student-teacher (S-T) framework has proved to be effective in solving this challenge. However, previous works based on S-T only empirically applied constraints on normal data and fused multi-level information. In this study, we propose an improved model called DeSTSeg, which integrates a pre-trained teacher network, a denoising student encoder-decoder, and a segmentation network into one framework. First, to strengthen the constraints on anomalous data, we introduce a denoising procedure that allows the student network to learn more robust representations. From synthetically corrupted normal images, we train the student network to match the teacher network feature of the same images without corruption. Second, to fuse the multi-level S-T features adaptively, we train a segmentation network with rich supervision from synthetic anomaly masks, achieving a substantial performance improvement. Experiments on the industrial inspection benchmark dataset demonstrate that our method achieves state-of-the-art performance, 98.6% on image-level AUC, 75.8% on pixel-level average precision, and 76.4% on instance-level average precision.

PIP-Net: Patch-Based Intuitive Prototypes for Interpretable Image Classification
Nauta, MeikeandSchl\"otterer, J\"organdvanKeulen, MauriceandSeifert, Christin



研究问题:现有的基于原型的图像解释模型无法很好地符合人类的视觉感知,同一原型可能对应现实世界中的不同概念,使得解释不够直观。
动机:为了解决这个问题,我们提出了PIP-Net(基于补丁的直观原型网络),这是一种可解释的图像分类模型,它以自监督的方式学习与人类视觉更相关的原型部分。
方法:PIP-Net可以被解释为一个稀疏评分表,图像中原型部分的存在为某一类添加证据。对于超出分布范围的数据,模型可以拒绝做出决策。
效果:我们的原型与真实物体部分相关联,这表明PIP-Net缩小了潜在空间和像素空间之间的“语义差距”。因此,我们的PIP-Net及其可解释的原型使用户能够以一种直观、忠实且具有语义意义的方式理解决策过程。

Interpretable methods based on prototypical patches recognize various components in an image in order to explain their reasoning to humans. However, existing prototype-based methods can learn prototypes that are not in line with human visual perception, i.e., the same prototype can refer to different concepts in the real world, making interpretation not intuitive. Driven by the principle of explainability-by-design, we introduce PIP-Net (Patch-based Intuitive Prototypes Network): an interpretable image classification model that learns prototypical parts in a self-supervised fashion which correlate better with human vision. PIP-Net can be interpreted as a sparse scoring sheet where the presence of a prototypical part in an image adds evidence for a class. The model can also abstain from a decision for out-of-distribution data by saying "I haven't seen this before". We only use image-level labels and do not rely on any part annotations. PIP-Net is globally interpretable since the set of learned prototypes shows the entire reasoning of the model. A smaller local explanation locates the relevant prototypes in one image. We show that our prototypes correlate with ground-truth object parts, indicating that PIP-Net closes the "semantic gap" between latent space and pixel space. Hence, our PIP-Net with interpretable prototypes enables users to interpret the decision making process in an intuitive, faithful and semantically meaningful way. Code is available at https://github.com/M-Nauta/PIPNet.

PROB: Probabilistic Objectness for Open World Object Detection
Zohar, OrrandWang, Kuan-ChiehandYeung, Serena



研究问题:如何有效地在开放世界中进行物体检测,特别是在未知物体的检测上。
动机:传统的物体检测方法无法有效处理开放世界中的未知物体,因为未知物体和背景的区分缺乏监督信息。
方法:提出了一种新的概率框架进行物体性估计,通过交替进行概率分布估计和已知物体在嵌入特征空间的对象性似然最大化,从而估计不同提案的物体性概率。
效果:在开放世界物体检测基准测试中,该方法在未知物体检测和已知物体检测上都超过了所有现有的开放世界物体检测方法。

Open World Object Detection (OWOD) is a new and challenging computer vision task that bridges the gap between classic object detection (OD) benchmarks and object detection in the real world. In addition to detecting and classifying seen/labeled objects, OWOD algorithms are expected to detect novel/unknown objects - which can be classified and incrementally learned. In standard OD, object proposals not overlapping with a labeled object are automatically classified as background. Therefore, simply applying OD methods to OWOD fails as unknown objects would be predicted as background. The challenge of detecting unknown objects stems from the lack of supervision in distinguishing unknown objects and background object proposals. Previous OWOD methods have attempted to overcome this issue by generating supervision using pseudo-labeling - however, unknown object detection has remained low. Probabilistic/generative models may provide a solution for this challenge. Herein, we introduce a novel probabilistic framework for objectness estimation, where we alternate between probability distribution estimation and objectness likelihood maximization of known objects in the embedded feature space - ultimately allowing us to estimate the objectness probability of different proposals. The resulting Probabilistic Objectness transformer-based open-world detector, PROB, integrates our framework into traditional object detection models, adapting them for the open-world setting. Comprehensive experiments on OWOD benchmarks show that PROB outperforms all existing OWOD methods in both unknown object detection ( 2x unknown recall) and known object detection ( mAP). Our code is available at https://github.com/orrzohar/PROB.

AUNet: Learning Relations Between Action Units for Face Forgery Detection
Bai, WeimingandLiu, YufanandZhang, ZhipengandLi, BingandHu, Weiming



研究问题:由于人脸篡改技术引发的严重安全问题,人脸伪造检测变得越来越重要。
动机:尽管现有的深度学习方法在训练和测试来自同一领域的人脸伪造品时取得了良好的效果,但当试图将检测器推广到训练期间未见过的伪造方法时,问题仍然具有挑战性。
方法:我们提出了一个动作单元关系学习框架来提高伪造检测的通用性。具体来说,它包括动作单元关系转换器(ART)和篡改的动作单元预测(TAP)。ART通过与特定分支互补并共同工作的动作单元无关分支来构建不同动作单元之间的关系,以利用伪造线索。在篡改的动作单元预测中,我们在图像级别篡改与动作单元相关的区域,并在特征级别开发具有挑战性的伪样本。然后,模型被训练以预测篡改的动作单元区域,并生成特定位置的监督信息。
效果:实验结果表明,我们的方法在数据集内和跨数据集评估中都能达到最先进的性能。

Face forgery detection becomes increasingly crucial due to the serious security issues caused by face manipulation techniques. Recent studies in deepfake detection have yielded promising results when the training and testing face forgeries are from the same domain. However, the problem remains challenging when one tries to generalize the detector to forgeries created by unseen methods during training. Observing that face manipulation may alter the relation between different facial action units (AU), we propose the Action Units Relation Learning framework to improve the generality of forgery detection. In specific, it consists of the Action Units Relation Transformer (ART) and the Tampered AU Prediction (TAP). The ART constructs the relation between different AUs with AU-agnostic Branch and AU-specific Branch, which complement each other and work together to exploit forgery clues. In the Tampered AU Prediction, we tamper AU-related regions at the image level and develop challenging pseudo samples at the feature level. The model is then trained to predict the tampered AU regions with the generated location-specific supervision. Experimental results demonstrate that our method can achieve state-of-the-art performance in both the in-dataset and cross-dataset evaluations.

PolyFormer: Referring Image Segmentation As Sequential Polygon Generation
Liu, JiangandDing, HuiandCai, ZhaoweiandZhang, YutingandSatzoda, RaviKumarandMahadevan, VijayandManmatha, R.



研究问题:本文旨在解决图像分割问题,通过将问题转化为序列多边形生成,并利用新的序列到序列框架Polygon Transformer进行预测。
动机:直接预测像素级的分割掩模在准确性和效率上存在问题,因此提出将图像分割问题转化为序列多边形生成的新方法。
方法:提出了一种新的序列到序列框架Polygon Transformer,它接受一系列图像补丁和文本查询令牌作为输入,并自动回归输出一系列多边形顶点。为了更准确的几何定位,还提出了一种基于回归的解码器,可以直接预测精确的浮点坐标,无需任何坐标量化误差。
效果:实验结果表明,PolyFormer在挑战性的RefCOCO+和RefCOCOg数据集上比现有技术有明显优势,例如分别提高了5.40%和4.52%。在没有微调的情况下评估参考视频分割任务时,也显示出强大的泛化能力,例如在Ref-DAVIS17数据集上达到了具有竞争力的61.5% J&F。

In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.

Interactive Segmentation As Gaussion Process Classification
Zhou, MinghaoandWang, HongandZhao, QianandLi, YuexiangandHuang, YawenandMeng, DeyuandZheng, Yefeng



研究问题:本文旨在解决基于深度学习的点击式交互分割(IS)任务中,现有方法无法充分明确利用和传播点击信息的问题。
动机:尽管现有的深度学习方法在点击式交互分割任务上取得了不错的效果,但它们并未完全明确地利用并传播点击信息,导致分割结果不尽人意,甚至在点击点上也是如此。
方法:本文将点击式交互分割任务表述为每个图像上的高斯过程(GP)像素级二元分类模型。为了解决这个问题,我们使用变分推断对难以处理的高斯过程后验进行近似,并以数据驱动的方式解算该模型,然后将近似的高斯过程后验解耦为双空间形式,以实现线性复杂度的高效采样。然后,我们构建了一个名为GPCIS的高斯过程分类框架,该框架集成了深度核学习机制,以提高灵活性。
效果:通过在几个基准测试集上进行全面实验,并与代表性方法进行定量和定性比较,证明了GPCIS的优点以及其良好的通用性和高效率。

Click-based interactive segmentation (IS) aims to extract the target objects under user interaction. For this task, most of the current deep learning (DL)-based methods mainly follow the general pipelines of semantic segmentation. Albeit achieving promising performance, they do not fully and explicitly utilize and propagate the click information, inevitably leading to unsatisfactory segmentation results, even at clicked points. Against this issue, in this paper, we propose to formulate the IS task as a Gaussian process (GP)-based pixel-wise binary classification model on each image. To solve this model, we utilize amortized variational inference to approximate the intractable GP posterior in a data-driven manner and then decouple the approximated GP posterior into double space forms for efficient sampling with linear complexity. Then, we correspondingly construct a GP classification framework, named GPCIS, which is integrated with the deep kernel learning mechanism for more flexibility. The main specificities of the proposed GPCIS lie in: 1) Under the explicit guidance of the derived GP posterior, the information contained in clicks can be finely propagated to the entire image and then boost the segmentation; 2) The accuracy of predictions at clicks has good theoretical support. These merits of GPCIS as well as its good generality and high efficiency are substantiated by comprehensive experiments on several benchmarks, as compared with representative methods both quantitatively and qualitatively. Codes will be released at https://github.com/zmhhmz/GPCIS_CVPR2023.

Efficient Mask Correction for Click-Based Interactive Image Segmentation
Du, FeiandYuan, JianlongandWang, ZhibinandWang, Fan



研究问题:如何通过点击操作有效地进行交互式图像分割。
动机:现有的点击式交互图像分割方法在每次点击后都需要运行整个分割网络,效率低下。
方法:提出一种有效的方法,使用轻量级的掩模修正网络来修正掩模。同时,引入了点击引导的自注意力模块和点击引导的相关模块,以有效利用点击信息提高性能。
效果:新方法在性能和效率上都优于现有方法。

The goal of click-based interactive image segmentation is to extract target masks with the input of positive/negative clicks. Every time a new click is placed, existing methods run the whole segmentation network to obtain a corrected mask, which is inefficient since several clicks may be needed to reach satisfactory accuracy. To this end, we propose an efficient method to correct the mask with a lightweight mask correction network. The whole network remains a low computational cost from the second click, even if we have a large backbone. However, a simple correction network with limited capacity is not likely to achieve comparable performance with a classic segmentation network. Thus, we propose a click-guided self-attention module and a click-guided correlation module to effectively exploits the click information to boost performance. First, several templates are selected based on the semantic similarity with click features. Then the self-attention module propagates the template information to other pixels, while the correlation module directly uses the templates to obtain target outlines. With the efficient architecture and two click-guided modules, our method shows preferable performance and efficiency compared to existing methods. The code will be released at https://github.com/feiaxyt/EMC-Click.

Ambiguity-Resistant Semi-Supervised Learning for Dense Object Detection
Liu, ChangandZhang, WeimingandLin, XiangruandZhang, WeiandTan, XiaoandHan, JunyuandLi, XiaomaoandDing, ErruiandWang, Jingdong



研究问题:现有的基础半监督目标检测技术中,单阶段检测器相比于双阶段检测器提升有限。
动机:实验发现这主要源于两种类型的模糊性:选择模糊性和分配模糊性。
方法:为解决这些问题,提出了一种抗模糊半监督学习(ARSL)方法。具体来说,为了减轻选择模糊性,提出了联合置信度估计(JCE)来共同量化伪标签的分类和定位质量;对于分配模糊性,引入了任务分离分配(TSA),基于像素级预测而不是不可靠的伪框进行标签分配。
效果:综合实验证明,ARSL有效地缓解了模糊性,并在MS COCO和PASCAL VOC上实现了最先进的半监督目标检测性能。

With basic Semi-Supervised Object Detection (SSOD) techniques, one-stage detectors generally obtain limited promotions compared with two-stage clusters. We experimentally find that the root lies in two kinds of ambiguities: (1) Selection ambiguity that selected pseudo labels are less accurate, since classification scores cannot properly represent the localization quality. (2) Assignment ambiguity that samples are matched with improper labels in pseudo-label assignment, as the strategy is misguided by missed objects and inaccurate pseudo boxes. To tackle these problems, we propose a Ambiguity-Resistant Semi-supervised Learning (ARSL) for one-stage detectors. Specifically, to alleviate the selection ambiguity, Joint-Confidence Estimation (JCE) is proposed to jointly quantifies the classification and localization quality of pseudo labels. As for the assignment ambiguity, Task-Separation Assignment (TSA) is introduced to assign labels based on pixel-level predictions rather than unreliable pseudo boxes. It employs a 'divide-and-conquer' strategy and separately exploits positives for the classification and localization task, which is more robust to the assignment ambiguity. Comprehensive experiments demonstrate that ARSL effectively mitigates the ambiguities and achieves state-of-the-art SSOD performance on MS COCO and PASCAL VOC. Codes can be found at https://github.com/PaddlePaddle/PaddleDetection.

SFD2: Semantic-Guided Feature Detection and Description
Xue, FeiandBudvytis, IgnasandCipolla, Roberto



研究问题:如何提高视觉定位的效率和准确性,特别是在大规模环境中的具有挑战性的条件下。
动机:现有的方法主要依赖于提取大量常常是冗余的局部可靠特征,这在效率和准确性上存在限制,尤其是在大规模环境下的具有挑战性的条件下。
方法:我们提出通过将高层语义隐式嵌入检测和描述过程中来提取全局可靠的特征。具体来说,我们的语义感知检测器能够从可靠区域(如建筑物、车道)中检测关键点,并隐式地抑制可靠区域(如天空、汽车),而不是依赖于显式的语义标签。
效果:实验结果表明,我们的模型在长期大规模的视觉定位Aachen Day-Night和RobotCar-Seasons数据集上优于先前的局部特征,并在使用2k和4k关键点时分别比先进的匹配器快约2倍和3倍,同时保持了有竞争力的准确性。

Visual localization is a fundamental task for various applications including autonomous driving and robotics. Prior methods focus on extracting large amounts of often redundant locally reliable features, resulting in limited efficiency and accuracy, especially in large-scale environments under challenging conditions. Instead, we propose to extract globally reliable features by implicitly embedding high-level semantics into both the detection and description processes. Specifically, our semantic-aware detector is able to detect keypoints from reliable regions (e.g. building, traffic lane) and suppress reliable areas (e.g. sky, car) implicitly instead of relying on explicit semantic labels. This boosts the accuracy of keypoint matching by reducing the number of features sensitive to appearance changes and avoiding the need of additional segmentation networks at test time. Moreover, our descriptors are augmented with semantics and have stronger discriminative ability, providing more inliers at test time. Particularly, experiments on long-term large-scale visual localization Aachen Day-Night and RobotCar-Seasons datasets demonstrate that our model outperforms previous local features and gives competitive accuracy to advanced matchers but is about 2 and 3 times faster when using 2k and 4k keypoints, respectively.

Semi-Supervised Stereo-Based 3D Object Detection via Cross-View Consensus
Wu, WenhaoandWong, HauSanandWu, Si



研究问题:如何利用有限的标注数据和大量的未标注数据,实现基于立体视觉的三维物体检测。
动机:虽然基于立体视觉的三维物体检测在低成本部署上具有巨大潜力,但其出色的性能需要高质量的手动标注,这在现实中很难实现。
方法:提出一种通过从时间聚合教师模型生成伪标注来实现半监督学习的方案,该教师模型会从学生模型中累积知识。同时引入了跨视图一致性约束策略和交叉视图一致策略来提高深度估计的稳定性和准确性,减少伪标注噪声。
效果:在KITTI 3D数据集上的大量实验表明,该方法能够有效地利用大量的未标注立体图像,显著提高检测效果。

Stereo-based 3D object detection, which aims at detecting 3D objects with stereo cameras, shows great potential in low-cost deployment compared to LiDAR-based methods and excellent performance compared to monocular-based algorithms. However, the impressive performance of stereo-based 3D object detection is at the huge cost of high-quality manual annotations, which are hardly attainable for any given scene. Semi-supervised learning, in which limited annotated data and numerous unannotated data are required to achieve a satisfactory model, is a promising method to address the problem of data deficiency. In this work, we propose to achieve semi-supervised learning for stereo-based 3D object detection through pseudo annotation generation from a temporal-aggregated teacher model, which temporally accumulates knowledge from a student model. To facilitate a more stable and accurate depth estimation, we introduce Temporal-Aggregation-Guided (TAG) disparity consistency, a cross-view disparity consistency constraint between the teacher model and the student model for robust and improved depth estimation. To mitigate noise in pseudo annotation generation, we propose a cross-view agreement strategy, in which pseudo annotations should attain high degree of agreements between 3D and 2D views, as well as between binocular views. We perform extensive experiments on the KITTI 3D dataset to demonstrate our proposed method's capability in leveraging a huge amount of unannotated stereo images to attain significantly improved detection results.

SCPNet: Semantic Scene Completion on Point Cloud
Xia, ZhaoyangandLiu, YouquanandLi, XinandZhu, XingeandMa, YuexinandLi, YikangandHou, YuenanandQiao, Yu



研究问题:训练深度模型进行语义场景补全面临挑战,如输入稀疏和不完整、大量不同尺度的对象以及移动对象的固有标签噪声。
动机:为了解决上述问题,我们提出了三个解决方案:1)重新设计补全网络;2)从多帧模型中提炼丰富的知识;3)补全标签矫正。
方法:我们设计了一个新颖的补全网络,由多个多路径块组成,用于聚合多尺度特征,并且没有损失降采样操作。我们还设计了一种名为密集到稀疏的知识蒸馏(DSKD)的新型知识蒸馏目标,将密集的、基于关系的场景语义知识从多帧教师模型转移到单帧学生模型,显著提高了单帧模型的表示学习能力。此外,我们还提出了一种简单而有效的标签矫正策略,使用现成的全景分割标签来消除补全标签中的动态对象痕迹,大大提高了深度模型的性能,特别是对于移动对象。
效果:我们在两个公共语义场景补全基准测试集SemanticKITTI和SemanticPOSS上进行了广泛的实验。我们的SCPNet在SemanticKITTI语义场景补全挑战中排名第一,并比竞争性S3CNet高出7.2 mIoU。SCPNet还在SemanticPOSS数据集上优于以前的补全算法。此外,我们的方法在SemanticKITTI语义分割任务上也取得了有竞争力的结果,表明在场景补全中学到的知识对分割任务有益。

Training deep models for semantic scene completion is challenging due to the sparse and incomplete input, a large quantity of objects of diverse scales as well as the inherent label noise for moving objects. To address the above-mentioned problems, we propose the following three solutions: 1) Redesigning the completion network. We design a novel completion network, which consists of several Multi-Path Blocks (MPBs) to aggregate multi-scale features and is free from the lossy downsampling operations. 2) Distilling rich knowledge from the multi-frame model. We design a novel knowledge distillation objective, dubbed Dense-to-Sparse Knowledge Distillation (DSKD). It transfers the dense, relation-based semantic knowledge from the multi-frame teacher to the single-frame student, significantly improving the representation learning of the single-frame model. 3) Completion label rectification. We propose a simple yet effective label rectification strategy, which uses off-the-shelf panoptic segmentation labels to remove the traces of dynamic objects in completion labels, greatly improving the performance of deep models especially for those moving objects. Extensive experiments are conducted in two public semantic scene completion benchmarks, i.e., SemanticKITTI and SemanticPOSS. Our SCPNet ranks 1st on SemanticKITTI semantic scene completion challenge and surpasses the competitive S3CNet by 7.2 mIoU. SCPNet also outperforms previous completion algorithms on the SemanticPOSS dataset. Besides, our method also achieves competitive results on SemanticKITTI semantic segmentation tasks, showing that knowledge learned in the scene completion is beneficial to the segmentation task.

Optimal Proposal Learning for Deployable End-to-End Pedestrian Detection
Song, XiaolinandChen, BinghuiandLi, PengyuandHe, Jun-YanandWang, BiaoandGeng, YifengandXie, XuansongandZhang, Honggang



研究问题:如何训练一个端到端的行人检测模型,以消除非最大抑制(NMS)后处理。
动机:尽管已经探索了一些方法,但大多数方法仍然存在训练时间长、部署复杂等问题,无法在实际工业应用中部署。
方法:提出了一种优化提案学习(OPL)框架,用于可部署的端到端行人检测。具体来说,我们使用基于CNN的轻量级探测器,并引入了两个新的模块,包括一个粗到精(C2F)的学习策略,用于通过减少训练/测试中样本分配/输出的模糊性,为真值(GT)实例提出精确的正提案;以及一个完整的提案网络(CPN),用于产生额外的信息补偿,以进一步召回困难的行人样本。
效果:在CrowdHuman、TJU-Ped和Caltech等数据集上进行了广泛的实验,结果表明,我们提出的OPL方法显著优于竞争方法。

End-to-end pedestrian detection focuses on training a pedestrian detection model via discarding the Non-Maximum Suppression (NMS) post-processing. Though a few methods have been explored, most of them still suffer from longer training time and more complex deployment, which cannot be deployed in the actual industrial applications. In this paper, we intend to bridge this gap and propose an Optimal Proposal Learning (OPL) framework for deployable end-to-end pedestrian detection. Specifically, we achieve this goal by using CNN-based light detector and introducing two novel modules, including a Coarse-to-Fine (C2F) learning strategy for proposing precise positive proposals for the Ground-Truth (GT) instances by reducing the ambiguity of sample assignment/output in training/testing respectively, and a Completed Proposal Network (CPN) for producing extra information compensation to further recall the hard pedestrian samples. Extensive experiments are conducted on CrowdHuman, TJU-Ped and Caltech, and the results show that our proposed OPL method significantly outperforms the competing methods.

Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection
Zhang, ChenandLi, GuorongandQi, YuankaiandWang, ShuhuiandQing, LaiyunandHuang, QingmingandYang, Ming-Hsuan



研究问题:弱监督视频异常检测旨在仅使用视频级别标签识别视频中的异常事件。
动机:最近,两阶段自我训练方法通过自我生成伪标签并用这些标签自我精炼异常分数取得了显著改进。由于伪标签起着关键作用,我们提出了一种利用完整性和不确定性属性进行有效自我训练的增强框架。
方法:我们首先设计了一个多头部分类模块(每个头都作为一个分类器),并带有一个多样性损失,以最大化预测的伪标签在各个头部之间的分布差异。这鼓励生成的伪标签覆盖尽可能多的异常事件。然后,我们设计了一种迭代的不确定性伪标签细化策略,不仅改善初始的伪标签,也改善第二阶段所需分类器获得的更新伪标签。
效果:大量的实验结果表明,所提出的方法在UCF-Crime、TAD和XD-Violence基准数据集上的表现优于最先进的方法。

Weakly supervised video anomaly detection aims to identify abnormal events in videos using only video-level labels. Recently, two-stage self-training methods have achieved significant improvements by self-generating pseudo labels and self-refining anomaly scores with these labels. As the pseudo labels play a crucial role, we propose an enhancement framework by exploiting completeness and uncertainty properties for effective self-training. Specifically, we first design a multi-head classification module (each head serves as a classifier) with a diversity loss to maximize the distribution differences of predicted pseudo labels across heads. This encourages the generated pseudo labels to cover as many abnormal events as possible. We then devise an iterative uncertainty pseudo label refinement strategy, which improves not only the initial pseudo labels but also the updated ones obtained by the desired classifier in the second stage. Extensive experimental results demonstrate the proposed method performs favorably against state-of-the-art approaches on the UCF-Crime, TAD, and XD-Violence benchmark datasets.

Full or Weak Annotations? An Adaptive Strategy for Budget-Constrained Annotation Campaigns
Tejero, JavierGamazoandZinkernagel, MartinS.andWolf, SebastianandSznitman, RaphaelandM\'arquez-Neila, Pablo



研究问题:如何为新的机器学习任务分配标注预算,特别是在图像分割应用中。
动机:手动标注相关图像内容既昂贵又耗时,而且需要专业知识。尽管弱监督学习和迁移学习的发展使分割模型可以从各种类型的标注中受益,但对于任何新的领域应用,数据集构建者仍然需要定义一个策略来分配完全的分割和其他弱标注。
方法:我们提出了一种新的方法来确定分割数据集的标注策略,即在给定固定预算的情况下,估计应收集多少比例的分割和分类标注。为此,我们的方法通过模拟最终分割模型的预期改进来顺序确定预算分数的分割和分类标注的比例。
效果:我们的实验表明,对于许多不同的标注预算和数据集,我们的方法产生的标注性能非常接近最优。

Annotating new datasets for machine learning tasks is tedious, time-consuming, and costly. For segmentation applications, the burden is particularly high as manual delineations of relevant image content are often extremely expensive or can only be done by experts with domain-specific knowledge. Thanks to developments in transfer learning and training with weak supervision, segmentation models can now also greatly benefit from annotations of different kinds. However, for any new domain application looking to use weak supervision, the dataset builder still needs to define a strategy to distribute full segmentation and other weak annotations. Doing so is challenging, however, as it is a priori unknown how to distribute an annotation budget for a given new dataset. To this end, we propose a novel approach to determine annotation strategies for segmentation datasets, whereby estimating what proportion of segmentation and classification annotations should be collected given a fixed budget. To do so, our method sequentially determines proportions of segmentation and classification annotations to collect for budget-fractions by modeling the expected improvement of the final segmentation model. We show in our experiments that our approach yields annotations that perform very close to the optimal for a number of different annotation budgets and datasets.

Leveraging Hidden Positives for Unsupervised Semantic Segmentation
Seong, HyunSeokandMoon, WonJunandLee, SuBeenandHeo, Jae-Pil



研究问题:如何提高无监督语义分割的性能,并保证任务特定训练指导和局部语义一致性。
动机:虽然使用视觉转换器(ViT)主干的最新工作表现出色,但仍缺乏对任务特定训练指导和局部语义一致性的考虑。
方法:通过挖掘隐藏的正例进行对比学习,学习丰富的语义关系,确保局部区域的语义一致性。具体包括发现两种类型的全局隐藏正例,基于预训练的主干和训练中的分割头定义的特征相似性,分别为每个锚点发现任务无关和任务特定的隐藏正例。逐渐增加后者的贡献,使模型捕获任务特定的语义特征。此外,引入梯度传播策略,学习相邻补丁之间的语义一致性。
效果:在COCO-stuff、Cityscapes和Potsdam-3数据集上,该方法实现了新的最先进的结果。

Dramatic demand for manpower to label pixel-level annotations triggered the advent of unsupervised semantic segmentation. Although the recent work employing the vision transformer (ViT) backbone shows exceptional performance, there is still a lack of consideration for task-specific training guidance and local semantic consistency. To tackle these issues, we leverage contrastive learning by excavating hidden positives to learn rich semantic relationships and ensure semantic consistency in local regions. Specifically, we first discover two types of global hidden positives, task-agnostic and task-specific ones for each anchor based on the feature similarities defined by a fixed pre-trained backbone and a segmentation head-in-training, respectively. A gradual increase in the contribution of the latter induces the model to capture task-specific semantic features. In addition, we introduce a gradient propagation strategy to learn semantic consistency between adjacent patches, under the inherent premise that nearby patches are highly likely to possess the same semantics. Specifically, we add the loss propagating to local hidden positives, semantically similar nearby patches, in proportion to the predefined similarity scores. With these training schemes, our proposed method achieves new state-of-the-art (SOTA) results in COCO-stuff, Cityscapes, and Potsdam-3 datasets. Our code is available at: https://github.com/hynnsk/HP.

ALSO: Automotive Lidar Self-Supervision by Occupancy Estimation
Boulch, AlexandreandSautier, CorentinandMichele, Bj\"ornandPuy, GillesandMarlet, Renaud



研究问题:本文旨在提出一种新的自我监督方法,用于预训练处理点云的深度感知模型。
动机:目前的预训练方法大多需要大量标注数据,而本文提出的新方法可以在无标注的情况下学习有用的表示。
方法:通过在重建采样3D点的表面的预训练任务上训练模型,并将底层潜在向量作为感知头输入,来预训练深度感知模型。
效果:实验结果表明,该方法在各种自动驾驶数据集上,对于语义分割和对象检测任务,都能有效地学习有用的表示,并且优于现有的方法。

We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds. The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled, and to use the underlying latent vectors as input to the perception head. The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information, that can be used to boost an actual perception task. This principle has a very simple formulation, which makes it both easy to implement and widely applicable to a large range of 3D sensors and deep networks performing semantic segmentation or object detection. In fact, it supports a single-stream pipeline, as opposed to most contrastive learning approaches, allowing training on limited resources. We conducted extensive experiments on various autonomous driving datasets, involving very different kinds of lidars, for both semantic segmentation and object detection. The results show the effectiveness of our method to learn useful representations without any annotation, compared to existing approaches. The code is available at github.com/valeoai/ALSO

Object Detection With Self-Supervised Scene Adaptation
Zhang, ZekunandHoai, Minh



研究问题:如何提高在固定相机视角下的场景中已训练目标检测器的性能。
动机:在固定相机视角的场景中,现有的目标检测器性能有待提高。
方法:提出一种自我监督适应的方法,通过交叉教学的方式,使用检测器自身生成的伪真标签和物体跟踪器进行适应训练。同时,利用背景等变特性进行无伪影对象混合作为数据增强手段,并利用准确的背景提取作为额外的输入模态。
效果:实验结果表明,该方法可以提高原始检测器的平均精度,大幅超越先前最先进的自我监督领域适应目标检测方法。

This paper proposes a novel method to improve the performance of a trained object detector on scenes with fixed camera perspectives based on self-supervised adaptation. Given a specific scene, the trained detector is adapted using pseudo-ground truth labels generated by the detector itself and an object tracker in a cross-teaching manner. When the camera perspective is fixed, our method can utilize the background equivariance by proposing artifact-free object mixup as a means of data augmentation, and utilize accurate background extraction as an additional input modality. We also introduce a large-scale and diverse dataset for the development and evaluation of scene-adaptive object detection. Experiments on this dataset show that our method can improve the average precision of the original detector, outperforming the previous state-of-the-art self-supervised domain adaptive object detection methods by a large margin. Our dataset and code are published at https://github.com/cvlab-stonybrook/scenes100.

DeepLSD: Line Segment Detection and Refinement With Deep Image Gradients
Pautrat, R\'emiandBarath, DanielandLarsson, ViktorandOswald, MartinR.andPollefeys, Marc



研究问题:如何提高线段检测器的准确性和鲁棒性,使其能在没有真实地面线条的情况下进行训练?
动机:传统的基于图像梯度的线段检测器快速准确,但在噪声图像和挑战性条件下缺乏鲁棒性。而学习型的线段检测器虽然更稳定,但准确性较低且偏向于线框线。
方法:提出结合传统方法和学习型方法的新线段检测器DeepLSD,通过深度网络生成线吸引力场,然后转换为替代图像梯度大小和角度,输入到任何现有的手工制作的线段检测器中。此外,还提出了一种新的优化工具,根据吸引力场和消失点来细化线段。
效果:在低级别的线检测指标以及多个具有挑战性的数据集上的下游任务上展示了该方法的性能。

Line segments are ubiquitous in our human-made world and are increasingly used in vision tasks. They are complementary to feature points thanks to their spatial extent and the structural information they provide. Traditional line detectors based on the image gradient are extremely fast and accurate, but lack robustness in noisy images and challenging conditions. Their learned counterparts are more repeatable and can handle challenging images, but at the cost of a lower accuracy and a bias towards wireframe lines. We propose to combine traditional and learned approaches to get the best of both worlds: an accurate and robust line detector that can be trained in the wild without ground truth lines. Our new line segment detector, DeepLSD, processes images with a deep network to generate a line attraction field, before converting it to a surrogate image gradient magnitude and angle, which is then fed to any existing handcrafted line detector. Additionally, we propose a new optimization tool to refine line segments based on the attraction field and vanishing points. This refinement improves the accuracy of current deep detectors by a large margin. We demonstrate the performance of our method on low-level line detection metrics, as well as on several downstream tasks using multiple challenging datasets. The source code and models are available at https://github.com/cvg/DeepLSD.

Learning Common Rationale To Improve Self-Supervised Representation for Fine-Grained Visual Recognition Problems
Shu, YangyangandvandenHengel, AntonandLiu, Lingqiao



研究问题:现有的自监督学习方法在细粒度视觉识别(FGVR)任务中表现不佳,因为其优化目标并不适用于捕捉FGVR中的微妙差异。
动机:为了解决这个问题,我们提出了一种学习额外筛选机制的方法,通过识别跨实例和类别的常见线索,即公共理性,来改善FGVR的表现。
方法:我们利用SSL目标产生的GradCAM,无需使用预训练的对象部分或显著性检测器,就可以学习到公共理性检测器。具体来说,我们在GradCAM上添加一个具有有限拟合能力的分支,使其能够捕获公共理性并丢弃较少见的判别模式。
效果:在四个视觉任务上的大量实验结果表明,该方法可以在不同的评估设置中带来显著的改进。

Self-supervised learning (SSL) strategies have demonstrated remarkable performance in various recognition tasks. However, both our preliminary investigation and recent studies suggest that they may be less effective in learning representations for fine-grained visual recognition (FGVR) since many features helpful for optimizing SSL objectives are not suitable for characterizing the subtle differences in FGVR. To overcome this issue, we propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes, dubbed as common rationales in this paper. Intuitively, common rationales tend to correspond to the discriminative patterns from the key parts of foreground objects. We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective without using any pre-trained object parts or saliency detectors, making it seamlessly to be integrated with the existing SSL process. Specifically, we fit the GradCAM with a branch with limited fitting capacity, which allows the branch to capture the common rationales and discard the less common discriminative patterns. At the test stage, the branch generates a set of spatial weights to selectively aggregate features representing an instance. Extensive experimental results on four visual tasks demonstrate that the proposed method can lead to a significant improvement in different evaluation settings.

Co-Salient Object Detection With Uncertainty-Aware Group Exchange-Masking
Wu, YangandSong, HuihuiandLiu, BoandZhang, KaihuaandLiu, Dong



研究问题:传统的共同显著物体检测(CoSOD)任务是分割一组相关图像中的常见显著物体,但现有的CoSOD模型在测试图像组中存在无关图像的情况下,模型的鲁棒性存在缺陷,这阻碍了CoSOD模型在现实世界的应用。
动机:为了解决这个问题,本文提出了一种用于学习鲁棒CoSOD模型的群体交换屏蔽(GEM)策略。
方法:通过两个包含不同类型的显著物体的图像组作为输入,GEM首先通过提出基于学习的策略从每个组中选择一组图像,然后将这些图像进行交换。提出的特征提取模块考虑了剩余相关图像中由无关图像引起的不确定性和群体一致性。设计了一个由条件变分自动编码器生成的潜变量生成器分支,以生成基于不确定性的全局随机特征。设计了一个CoSOD转换器分支来捕获包含群体一致性信息的相关性局部特征。最后,将两个分支的输出连接起来并输入到一个基于变压器的解码器,产生鲁棒的共同显著性预测。
效果:通过对有无关图像的共同显著性检测进行广泛的评估,证明了该方法优于各种最先进的方法。

The traditional definition of co-salient object detection (CoSOD) task is to segment the common salient objects in a group of relevant images. Existing CoSOD models by default adopt the group consensus assumption. This brings about model robustness defect under the condition of irrelevant images in the testing image group, which hinders the use of CoSOD models in real-world applications. To address this issue, this paper presents a group exchange-masking (GEM) strategy for robust CoSOD model learning. With two group of image containing different types of salient object as input, the GEM first selects a set of images from each group by the proposed learning based strategy, then these images are exchanged. The proposed feature extraction module considers both the uncertainty caused by the irrelevant images and group consensus in the remaining relevant images. We design a latent variable generator branch which is made of conditional variational autoencoder to generate uncertainly-based global stochastic features. A CoSOD transformer branch is devised to capture the correlation-based local features that contain the group consistency information. At last, the output of two branches are concatenated and fed into a transformer-based decoder, producing robust co-saliency prediction. Extensive evaluations on co-saliency detection with and without irrelevant images demonstrate the superiority of our method over a variety of state-of-the-art methods.

Extracting Class Activation Maps From Non-Discriminative Features As Well
Chen, ZhaozhengandSun, Qianru



研究问题:从分类模型中提取类激活映射(CAM)时,通常会导致前景物体的覆盖范围不佳,即只能识别出判别区域(如“羊头”),其余部分(如“羊腿”)误认为是背景。
动机:这个问题的关键在于分类器(用于计算CAM)的权重仅捕获了对象的判别特征。我们通过引入一种新的CAM计算方法来解决这个问题,该方法明确地捕获非判别特征,从而扩大CAM以覆盖整个物体。
方法:我们省略分类模型的最后一层池化层,并对一个对象类的每个局部特征进行聚类,其中“局部”意味着“在空间像素位置”。我们将结果K个聚类中心称为局部原型——代表像“羊头”、“羊腿”和“羊身”这样的局部语义。对于该类别的一张新图像,我们将其未池化的特征与每个原型进行比较,得出K个相似性矩阵,然后将它们聚合成热力图(即我们的CAM)。因此,我们的CAM捕获了该类别的所有局部特征,没有歧视。
效果:我们在具有挑战性的弱监督语义分割(WSSS)任务中评估了这种方法,并将其插入到多种最先进的WSSS方法中,如MCTformer和AMN,只需用我们的CAM替换它们的原始CAM即可。我们在标准的WSSS基准测试(PASCAL VOC和MS COCO)上进行了大量实验,结果显示我们的方法具有优越性:持续改进且计算开销小。

Extracting class activation maps (CAM) from a classification model often results in poor coverage on foreground objects, i.e., only the discriminative region (e.g., the "head" of "sheep") is recognized and the rest (e.g., the "leg" of "sheep") mistakenly as background. The crux behind is that the weight of the classifier (used to compute CAM) captures only the discriminative features of objects. We tackle this by introducing a new computation method for CAM that explicitly captures non-discriminative features as well, thereby expanding CAM to cover whole objects. Specifically, we omit the last pooling layer of the classification model, and perform clustering on all local features of an object class, where "local" means "at a spatial pixel position". We call the resultant K cluster centers local prototypes - represent local semantics like the "head", "leg", and "body" of "sheep". Given a new image of the class, we compare its unpooled features to every prototype, derive K similarity matrices, and then aggregate them into a heatmap (i.e., our CAM). Our CAM thus captures all local features of the class without discrimination. We evaluate it in the challenging tasks of weakly-supervised semantic segmentation (WSSS), and plug it in multiple state-of-the-art WSSS methods, such as MCTformer and AMN, by simply replacing their original CAM with ours. Our extensive experiments on standard WSSS benchmarks (PASCAL VOC and MS COCO) show the superiority of our method: consistent improvements with little computational overhead.

Towards Professional Level Crowd Annotation of Expert Domain Data
Wang, PeiandVasconcelos, Nuno



研究问题:专家领域的图像识别通常需要精细标注,但成本高昂,限制了数据集大小和学习系统的准确性。
动机:为了解决这个问题,我们考虑使用众包来注释专家数据,提出了一种基于半监督学习和人类过滤的新方法。
方法:我们提出了一种人机协作的半监督学习方法(SSL-HF),通过众包工人作为伪标签的过滤器,取代了最先进的半监督学习方法中使用的不可靠的置信度阈值。
效果:实验表明,在几个基准测试中,SSL-HF显著优于各种替代方法。

Image recognition on expert domains is usually fine-grained and requires expert labeling, which is costly. This limits dataset sizes and the accuracy of learning systems. To address this challenge, we consider annotating expert data with crowdsourcing. This is denoted as PrOfeSsional lEvel cRowd (POSER) annotation. A new approach, based on semi-supervised learning (SSL) and denoted as SSL with human filtering (SSL-HF) is proposed. It is a human-in-the-loop SSL method, where crowd-source workers act as filters of pseudo-labels, replacing the unreliable confidence thresholding used by state-of-the-art SSL methods. To enable annotation by non-experts, classes are specified implicitly, via positive and negative sets of examples and augmented with deliberative explanations, which highlight regions of class ambiguity. In this way, SSL-HF leverages the strong low-shot learning and confidence estimation ability of humans to create an intuitive but effective labeling experience. Experiments show that SSL-HF significantly outperforms various alternative approaches in several benchmarks.

Semi-Weakly Supervised Object Kinematic Motion Prediction
Liu, GengxinandSun, QianandHuang, HaibinandMa, ChongyangandGuo, YulanandYi, LiandHuang, HuiandHu, Ruizhen



研究问题:本文旨在解决3D物体运动预测问题,即识别可移动部分及其相应的运动参数。
动机:由于3D物体在拓扑结构和几何细节上存在大量变化,以及缺乏大规模标记数据,使得基于深度学习的方法在此任务上面临挑战。
方法:本文采用半弱监督的方式处理3D物体的运动预测问题。首先,利用现有的大规模语义分割数据集和对象部分分割方法;其次,通过图神经网络学习分层部分分割与可移动部分参数之间的映射关系,并基于几何对齐进行进一步优化。
效果:实验结果表明,该方法在3D部分扫描的运动预测任务上取得了显著的性能提升。

Given a 3D object, kinematic motion prediction aims to identify the mobile parts as well as the corresponding motion parameters. Due to the large variations in both topological structure and geometric details of 3D objects, this remains a challenging task and the lack of large scale labeled data also constrain the performance of deep learning based approaches. In this paper, we tackle the task of object kinematic motion prediction problem in a semi-weakly supervised manner. Our key observations are two-fold. First, although 3D dataset with fully annotated motion labels is limited, there are existing datasets and methods for object part semantic segmentation at large scale. Second, semantic part segmentation and mobile part segmentation is not always consistent but it is possible to detect the mobile parts from the underlying 3D structure. Towards this end, we propose a graph neural network to learn the map between hierarchical part-level segmentation and mobile parts parameters, which are further refined based on geometric alignment. This network can be first trained on PartNet-Mobility dataset with fully labeled mobility information and then applied on PartNet dataset with fine-grained and hierarchical part-level segmentation. The network predictions yield a large scale of 3D objects with pseudo labeled mobility information and can further be used for weakly-supervised learning with pre-existing segmentation. Our experiments show there are significant performance boosts with the augmented data for previous method designed for kinematic motion prediction on 3D partial scans.

Improving Robustness of Semantic Segmentation to Motion-Blur Using Class-Centric Augmentation
AakankshaandRajagopalan, A.N.



研究问题:语义分割在模糊图像中的性能提升。
动机:现有的研究主要关注清晰图像的分割性能,对于模糊图像的处理较少。
方法:提出一种基于类别的运动模糊增强(CCMBA)策略,通过使用分割图注释生成空间变化的模糊,使网络同时学习清晰图像、自我运动模糊和动态场景模糊的语义分割。
效果:在PASCAL VOC和Cityscapes数据集上,该方法在CNN和Vision Transformer-based语义分割网络上均取得了良好的效果,并在常用的去模糊数据集GoPro和REDS上展示了对复杂真实世界模糊的改善泛化能力。

Semantic segmentation involves classifying each pixel into one of a pre-defined set of object/stuff classes. Such a fine-grained detection and localization of objects in the scene is challenging by itself. The complexity increases manifold in the presence of blur. With cameras becoming increasingly light-weight and compact, blur caused by motion during capture time has become unavoidable. Most research has focused on improving segmentation performance for sharp clean images and the few works that deal with degradations, consider motion-blur as one of many generic degradations. In this work, we focus exclusively on motion-blur and attempt to achieve robustness for semantic segmentation in its presence. Based on the observation that segmentation annotations can be used to generate synthetic space-variant blur, we propose a Class-Centric Motion-Blur Augmentation (CCMBA) strategy. Our approach involves randomly selecting a subset of semantic classes present in the image and using the segmentation map annotations to blur only the corresponding regions. This enables the network to simultaneously learn semantic segmentation for clean images, images with egomotion blur, as well as images with dynamic scene blur. We demonstrate the effectiveness of our approach for both CNN and Vision Transformer-based semantic segmentation networks on PASCAL VOC and Cityscapes datasets. We also illustrate the improved generalizability of our method to complex real-world blur by evaluating on the commonly used deblurring datasets GoPro and REDS.

SMAE: Few-Shot Learning for HDR Deghosting With Saturation-Aware Masked Autoencoders
Yan, QingsenandZhang, SongandChen, WeiyeandTang, HaoandZhu, YuandSun, JinqiuandVanGool, LucandZhang, Yanning



研究问题:如何利用少量数据生成高质量的高动态范围(HDR)图像。
动机:大多数基于深度神经网络(DNNs)的方法需要大量带有真实标签的训练数据,这既繁琐又耗时。少数几篇研究尝试通过使用少量的训练数据来生成满意的HDR图像,但现代的DNN在只有几张图片的情况下很容易过拟合。
方法:我们提出了一种新的半监督方法,称为SSHDR,通过两个阶段的培训来实现少数几篇HDR成像。首先,我们使用自我监督机制生成饱和区域的图像内容,然后通过迭代的半监督学习框架解决鬼影问题。
效果:实验证明,SSHDR在各种数据集上的性能都优于最先进的方法,无论是定量还是定性,都能用少量的标记样本实现吸引人的HDR可视化。

Generating a high-quality High Dynamic Range (HDR) image from dynamic scenes has recently been extensively studied by exploiting Deep Neural Networks (DNNs). Most DNNs-based methods require a large amount of training data with ground truth, requiring tedious and time-consuming work. Few-shot HDR imaging aims to generate satisfactory images with limited data. However, it is difficult for modern DNNs to avoid overfitting when trained on only a few images. In this work, we propose a novel semi-supervised approach to realize few-shot HDR imaging via two stages of training, called SSHDR. Unlikely previous methods, directly recovering content and removing ghosts simultaneously, which is hard to achieve optimum, we first generate content of saturated regions with a self-supervised mechanism and then address ghosts via an iterative semi-supervised learning framework. Concretely, considering that saturated regions can be regarded as masking Low Dynamic Range (LDR) input regions, we design a Saturated Mask AutoEncoder (SMAE) to learn a robust feature representation and reconstruct a non-saturated HDR image. We also propose an adaptive pseudo-label selection strategy to pick high-quality HDR pseudo-labels in the second stage to avoid the effect of mislabeled samples. Experiments demonstrate that SSHDR outperforms state-of-the-art methods quantitatively and qualitatively within and across different datasets, achieving appealing HDR visualization with few labeled samples.

Weakly Supervised Semantic Segmentation via Adversarial Learning of Classifier and Reconstructor
Kweon, HyeokjunandYoon, Sung-HoonandYoon, Kuk-Jin



研究问题:弱监督语义分割中的类别激活图(CAMs)通常无法覆盖整个对象,并且会在无关区域被激活。
动机:为了解决这个问题,我们提出了一种新的弱监督语义分割框架,通过对抗性学习分类器和图像重建器。
方法:我们同时训练两个模型:一个生成类别激活图的分类器,将图像分解为段;另一个测量段之间的可推理性的重建器。如果一个段可以从其他段重建,那么这个段就是不精确的。
效果:我们在广泛的消融研究中验证了该框架的优势。我们的方法在PASCAL VOC 2012和MS COCO 2014上都取得了新的最先进的性能。

In Weakly Supervised Semantic Segmentation (WSSS), Class Activation Maps (CAMs) usually 1) do not cover the whole object and 2) be activated on irrelevant regions. To address the issues, we propose a novel WSSS framework via adversarial learning of a classifier and an image reconstructor. When an image is perfectly decomposed into class-wise segments, information (i.e., color or texture) of a single segment could not be inferred from the other segments. Therefore, inferability between the segments can represent the preciseness of segmentation. We quantify the inferability as a reconstruction quality of one segment from the other segments. If one segment could be reconstructed from the others, then the segment would be imprecise. To bring this idea into WSSS, we simultaneously train two models: a classifier generating CAMs that decompose an image into segments and a reconstructor that measures the inferability between the segments. As in GANs, while being alternatively trained in an adversarial manner, two networks provide positive feedback to each other. We verify the superiority of the proposed framework with extensive ablation studies. Our method achieves new state-of-the-art performances on both PASCAL VOC 2012 and MS COCO 2014. The code is available at https://github.com/sangrockEG/ACR.

Foundation Model Drives Weakly Incremental Learning for Semantic Segmentation
Yu, ChaohuiandZhou, QiangandLi, JingliangandYuan, JianlongandWang, ZhibinandWang, Fan



研究问题:如何有效地利用图像级别的标签来学习新的类别,同时避免忘记旧的类别。
动机:现有的弱增量学习语义分割方法虽然取得了一些成果,但图像级别的标签无法提供足够的细节来定位每个分割区域,这限制了其性能。
方法:提出了一种新颖且数据高效的弱增量学习语义分割框架(FMWIS),包括预训练基础模型生成密集伪标签的知识蒸馏、优化噪声伪掩模的教师-学生架构以及解决旧类别灾难性遗忘问题的基于记忆的复制粘贴增强等方法。
效果:在Pascal VOC和COCO数据集上的大量实验表明,该框架的性能优越,例如,在15-5 VOC设置中,FMWIS达到了70.7%和73.3%的准确率,比最先进的方法高出3.4%和6.1%。

Modern incremental learning for semantic segmentation methods usually learn new categories based on dense annotations. Although achieve promising results, pixel-by-pixel labeling is costly and time-consuming. Weakly incremental learning for semantic segmentation (WILSS) is a novel and attractive task, which aims at learning to segment new classes from cheap and widely available image-level labels. Despite the comparable results, the image-level labels can not provide details to locate each segment, which limits the performance of WILSS. This inspires us to think how to improve and effectively utilize the supervision of new classes given image-level labels while avoiding forgetting old ones. In this work, we propose a novel and data-efficient framework for WILSS, named FMWISS. Specifically, we propose pre-training based co-segmentation to distill the knowledge of complementary foundation models for generating dense pseudo labels. We further optimize the noisy pseudo masks with a teacher-student architecture, where a plug-in teacher is optimized with a proposed dense contrastive loss. Moreover, we introduce memory-based copy-paste augmentation to improve the catastrophic forgetting problem of old classes. Extensive experiments on Pascal VOC and COCO datasets demonstrate the superior performance of our framework, e.g., FMWISS achieves 70.7% and 73.3% in the 15-5 VOC setting, outperforming the state-of-the-art method by 3.4% and 6.1%, respectively.

ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation
Li, KehanandWang, ZhennanandCheng, ZesenandYu, RunyiandZhao, YianandSong, GuoliandLiu, ChangandYuan, LiandChen, Jie



研究问题:如何利用预训练的视觉模型提取像素级语义关系,并有效地进行无监督语义分割。
动机:预训练的视觉模型能够捕捉到像素级的语义关系,但如何将这些关系转化为语义一致的像素组或区域仍是一个挑战。
方法:提出了一种自适应概念化方法(ACSeg),通过将概念编码为可学习的原型,并设计自适应概念生成器(ACG)来适应每张图像的信息概念。同时,考虑到不同图像的场景复杂性,提出了模块化损失函数,以优化ACG的概念数量。
效果:实验结果表明,该方法在无监督语义分割任务上取得了优于现有技术的效果。

Recently, self-supervised large-scale visual pre-training models have shown great promise in representing pixel-level semantic relationships, significantly promoting the development of unsupervised dense prediction tasks, e.g., unsupervised semantic segmentation (USS). The extracted relationship among pixel-level representations typically contains rich class-aware information that semantically identical pixel embeddings in the representation space gather together to form sophisticated concepts. However, leveraging the learned models to ascertain semantically consistent pixel groups or regions in the image is non-trivial since over/ under-clustering overwhelms the conceptualization procedure under various semantic distributions of different images. In this work, we investigate the pixel-level semantic aggregation in self-supervised ViT pre-trained models as image Segmentation and propose the Adaptive Conceptualization approach for USS, termed ACSeg. Concretely, we explicitly encode concepts into learnable prototypes and design the Adaptive Concept Generator (ACG), which adaptively maps these prototypes to informative concepts for each image. Meanwhile, considering the scene complexity of different images, we propose the modularity loss to optimize ACG independent of the concept number based on estimating the intensity of pixel pairs belonging to the same concept. Finally, we turn the USS task into classifying the discovered concepts in an unsupervised manner. Extensive experiments with state-of-the-art results demonstrate the effectiveness of the proposed ACSeg.

Similarity Metric Learning for RGB-Infrared Group Re-Identification
Xiong, JianghaoandLai, Jianhuang



研究问题:本文旨在解决跨模态的群体重识别(G-ReID)问题,特别是RGB-红外(RGB-IR)的匹配问题。
动机:现有的研究主要关注基于RGB的问题,而RGB-IR的跨模态匹配问题尚未得到研究。
方法:提出了一种用于RGB-IR G-ReID的度量学习方法——最接近排列匹配(CPM)。该方法将每个群体建模为一组由MPANet提取的单人特征,然后提出最接近排列距离(CPD)来测量两组特征之间的相似性。CPD在群体成员的顺序变化下保持不变,从而解决了G-ReID中的布局变化问题。此外,还引入了无人员标签的G-ReID问题。在弱监督的情况下,设计了一种关系感知模块(RAM),该模块利用视觉上下文和群体成员之间的关系来生成每个群体中的特征顺序,以形成一种对抗模态变化的稳健的群体表示。
效果:通过在新的大规模RGB-IR G-ReID数据集CM-Group上进行大量实验,证明了所提出模型的有效性和CM-Group的复杂性。

Group re-identification (G-ReID) aims to re-identify a group of people that is observed from non-overlapping camera systems. The existing literature has mainly addressed RGB-based problems, but RGB-infrared (RGB-IR) cross-modality matching problem has not been studied yet. In this paper, we propose a metric learning method Closest Permutation Matching (CPM) for RGB-IR G-ReID. We model each group as a set of single-person features which are extracted by MPANet, then we propose the metric Closest Permutation Distance (CPD) to measure the similarity between two sets of features. CPD is invariant with order changes of group members so that it solves the layout change problem in G-ReID. Furthermore, we introduce the problem of G-ReID without person labels. In the weak-supervised case, we design the Relation-aware Module (RAM) that exploits visual context and relations among group members to produce a modality-invariant order of features in each group, with which group member features within a set can be sorted to form a robust group representation against modality change. To support the study on RGB-IR G-ReID, we construct a new large-scale RGB-IR G-ReID dataset CM-Group. The dataset contains 15,440 RGB images and 15,506 infrared images of 427 groups and 1,013 identities. Extensive experiments on the new dataset demonstrate the effectiveness of the proposed models and the complexity of CM-Group. The code and dataset are available at: https://github.com/WhollyOat/CM-Group.

PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery
Zhang, ShengandKhan, SalmanandShen, ZhiqiangandNaseer, MuzammalandChen, GuangyiandKhan, FahadShahbaz



研究问题:现有的半监督学习模型在处理来自新语义类别的未标注数据时,由于其封闭集假设,大多无法进行有效学习。
动机:针对这一挑战,我们提出了一种通用新颖类别发现(GNCD)方法,旨在利用部分标注已知类别的信息对来自已知和新颖类别的未标注训练数据进行分类。
方法:我们提出了一种两阶段对比亲和性学习方法,并辅以辅助视觉提示,称为PromptCAL。该方法通过发现可靠的样本对亲和性来更好地对已知和新颖类别进行语义聚类。
效果:实验表明,即使在标注有限的情况下,我们的PromptCAL方法也能更有效地发现新颖类别,并在整体准确性上超过了当前最先进的方法(例如,在CUB-200上提高了近11%,在ImageNet-100上提高了9%)。

Although existing semi-supervised learning models achieve remarkable success in learning with unannotated in-distribution data, they mostly fail to learn on unlabeled data sampled from novel semantic classes due to their closed-set assumption. In this work, we target a pragmatic but under-explored Generalized Novel Category Discovery (GNCD) setting. The GNCD setting aims to categorize unlabeled training data coming from known and novel classes by leveraging the information of partially labeled known classes. We propose a two-stage Contrastive Affinity Learning method with auxiliary visual Prompts, dubbed PromptCAL, to address this challenging problem. Our approach discovers reliable pairwise sample affinities to learn better semantic clustering of both known and novel classes for the class token and visual prompts. First, we propose a discriminative prompt regularization loss to reinforce semantic discriminativeness of prompt-adapted pre-trained vision transformer for refined affinity relationships. Besides, we propose contrastive affinity learning to calibrate semantic representations based on our iterative semi-supervised affinity graph generation method for semantically-enhanced supervision. Extensive experimental evaluation demonstrates that our PromptCAL method is more effective in discovering novel classes even with limited annotations and surpasses the current state-of-the-art on generic and fine-grained benchmarks (e.g., with nearly 11% gain on CUB-200, and 9% on ImageNet-100) on overall accuracy. Our code will be released to the public.

Camouflaged Instance Segmentation via Explicit De-Camouflaging
Luo, NaisongandPan, YuwenandSun, RuiandZhang, TianzhuandXiong, ZhiweiandWu, Feng



研究问题:本文旨在解决伪装物体实例分割的问题,即预测野生动物适应周围环境的外观的伪装物体的实例级掩膜。
动机:以往的实例分割方法在处理这种任务时表现不佳,因为它们容易被欺骗性的伪装所干扰。
方法:我们提出了一种新的去伪装网络(DCNet),包括一个像素级的伪装解耦模块和一个实例级的伪装抑制模块。
效果:实验结果表明,我们的DCNet在两个基准测试中的表现优于最先进的CIS方法,例如在COD10K和NC4K数据集上的平均精度提高了5%以上。

Camouflaged Instance Segmentation (CIS) aims at predicting the instance-level masks of camouflaged objects, which are usually the animals in the wild adapting their appearance to match the surroundings. Previous instance segmentation methods perform poorly on this task as they are easily disturbed by the deceptive camouflage. To address these challenges, we propose a novel De-camouflaging Network (DCNet) including a pixel-level camouflage decoupling module and an instance-level camouflage suppression module. The proposed DCNet enjoys several merits. First, the pixel-level camouflage decoupling module can extract camouflage characteristics based on the Fourier transformation. Then a difference attention mechanism is proposed to eliminate the camouflage characteristics while reserving target object characteristics in the pixel feature. Second, the instance-level camouflage suppression module can aggregate rich instance information from pixels by use of instance prototypes. To mitigate the effect of background noise during segmentation, we introduce some reliable reference points to build a more robust similarity measurement. With the aid of these two modules, our DCNet can effectively model de-camouflaging and achieve accurate segmentation for camouflaged instances. Extensive experimental results on two benchmarks demonstrate that our DCNet performs favorably against state-of-the-art CIS methods, e.g., with more than 5% performance gains on COD10K and NC4K datasets in average precision.

Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning
Song, KaiyouandXie, JinandZhang, ShanandLuo, Zimeng



研究问题:如何提高小型模型的视觉表示学习性能。
动机:现有的自监督视觉表示学习方法通过结合知识蒸馏可以提升性能,但大多采用静态预训练教师向学生转移知识的方式。
方法:提出一种多模式在线知识蒸馏方法(MOKD),让两个不同的模型以自监督的方式进行协作学习。具体包括自我蒸馏和交叉蒸馏两种模式,其中自我蒸馏让每个模型独立进行自监督学习,而交叉蒸馏实现不同模型之间的知识交互。在交叉蒸馏中,提出了一种跨注意力特征搜索策略来增强不同模型之间的语义特征对齐。
效果:实验结果表明,两种异构模型可以通过MOKD受益并超越独立训练的基线。此外,MOKD也超越了现有的自监督知识蒸馏方法,无论对于学生模型还是教师模型都表现出更好的性能。

Self-supervised learning (SSL) has made remarkable progress in visual representation learning. Some studies combine SSL with knowledge distillation (SSL-KD) to boost the representation learning performance of small models. In this study, we propose a Multi-mode Online Knowledge Distillation method (MOKD) to boost self-supervised visual representation learning. Different from existing SSL-KD methods that transfer knowledge from a static pre-trained teacher to a student, in MOKD, two different models learn collaboratively in a self-supervised manner. Specifically, MOKD consists of two distillation modes: self-distillation and cross-distillation modes. Among them, self-distillation performs self-supervised learning for each model independently, while cross-distillation realizes knowledge interaction between different models. In cross-distillation, a cross-attention feature search strategy is proposed to enhance the semantic feature alignment between different models. As a result, the two models can absorb knowledge from each other to boost their representation learning performance. Extensive experimental results on different backbones and datasets demonstrate that two heterogeneous models can benefit from MOKD and outperform their independently trained baseline. In addition, MOKD also outperforms existing SSL-KD methods for both the student and teacher models.

ScarceNet: Animal Pose Estimation With Scarce Annotations
Li, ChenandLee, GimHee



研究问题:本文旨在解决动物姿态估计任务中由于缺乏标注数据而未充分探索的问题。
动机:现有方法在只有少量标记数据和未标记图像的情况下,无法有效处理动物姿态估计任务。
方法:提出ScarceNet,一种基于伪标签的方法,为未标记的图像生成人工标签。通过使用小批量标记图像训练的模型生成的伪标签通常带有噪声,直接用于训练可能会影响性能。为此,首先采用小损失技巧选择可靠的伪标签,然后通过一致性约束和可重用样本再标记等方法充分利用未标记数据。
效果:在具有挑战性的AP-10K数据集上评估该方法,其性能大幅超过现有的半监督学习方法。在只有很少标注可用的TigDog数据集上,该方法的性能也优于基于领域适应的方法。

Animal pose estimation is an important but under-explored task due to the lack of labeled data. In this paper, we tackle the task of animal pose estimation with scarce annotations, where only a small set of labeled data and unlabeled images are available. At the core of the solution to this problem setting is the use of the unlabeled data to compensate for the lack of well-labeled animal pose data. To this end, we propose the ScarceNet, a pseudo label-based approach to generate artificial labels for the unlabeled images. The pseudo labels, which are generated with a model trained with the small set of labeled images, are generally noisy and can hurt the performance when directly used for training. To solve this problem, we first use a small-loss trick to select reliable pseudo labels. Although effective, the selection process is improvident since numerous high-loss samples are left unused. We further propose to identify reusable samples from the high-loss samples based on an agreement check. Pseudo labels are re-generated to provide supervision for those reusable samples. Lastly, we introduce a student-teacher framework to enforce a consistency constraint since there are still samples that are neither reliable nor reusable. By combining the reliable pseudo label selection with the reusable sample re-labeling and the consistency constraint, we can make full use of the unlabeled data. We evaluate our approach on the challenging AP-10K dataset, where our approach outperforms existing semi-supervised approaches by a large margin. We also test on the TigDog dataset, where our approach can achieve better performance than domain adaptation based approaches when only very few annotations are available. Our code is available at the project website.

MIANet: Aggregating Unbiased Instance and General Information for Few-Shot Semantic Segmentation
Yang, YongandChen, QiongandFeng, YuanandHuang, Tianlin



研究问题:现有的少数镜头分割方法基于元学习策略,从支持集中提取实例知识,然后将该知识应用于查询集中的目标对象分割。然而,由于从支持集中的少量样本中获取的知识无法应对类内差异的变化,因此提取的知识不足以应对这个问题。
动机:为了解决这个问题,我们提出了一种多信息聚合网络(MIANet),它有效地利用了一般知识(即语义词嵌入)和实例信息进行精确分割。
方法:在MIANet中,我们提出了一个通用信息模块(GIM)来从词嵌入中提取一个通用的类原型作为实例信息的补充。为此,我们设计了一个三元组损失,将通用类原型视为锚点,并在支持集中的局部特征中采样正负对。计算出的三元组损失可以将语言身份之间的语义相似性从词嵌入空间转移到视觉表示空间。
效果:通过引入非参数分层先验模块(HPM)生成无偏的实例级信息,并通过计算支持和查询图像特征之间的像素级相似性,以减轻模型对训练类别的偏见并获取多尺度信息。最后,信息融合模块(IFM)将一般知识和实例信息结合起来,为查询图像进行预测。在PASCAL-5i和COCO-20i上的大量实验表明,MIANet具有优越的性能,并设定了新的最先进的水平。

Existing few-shot segmentation methods are based on the meta-learning strategy and extract instance knowledge from a support set and then apply the knowledge to segment target objects in a query set. However, the extracted knowledge is insufficient to cope with the variable intra-class differences since the knowledge is obtained from a few samples in the support set. To address the problem, we propose a multi-information aggregation network (MIANet) that effectively leverages the general knowledge, i.e., semantic word embeddings, and instance information for accurate segmentation. Specifically, in MIANet, a general information module (GIM) is proposed to extract a general class prototype from word embeddings as a supplement to instance information. To this end, we design a triplet loss that treats the general class prototype as an anchor and samples positive-negative pairs from local features in the support set. The calculated triplet loss can transfer semantic similarities among language identities from a word embedding space to a visual representation space. To alleviate the model biasing towards the seen training classes and to obtain multi-scale information, we then introduce a non-parametric hierarchical prior module (HPM) to generate unbiased instance-level information via calculating the pixel-level similarity between the support and query image features. Finally, an information fusion module (IFM) combines the general and instance information to make predictions for the query image. Extensive experiments on PASCAL-5i and COCO-20i show that MIANet yields superior performance and set a new state-of-the-art. Code is available at github.com/Aldrich2y/MIANet.

MarginMatch: Improving Semi-Supervised Learning with Pseudo-Margins
Sosea, TiberiuandCaragea, Cornelia



研究问题:如何利用未标记数据进行半监督学习,提高模型在低数据量情况下的表现。
动机:现有的半监督学习方法主要依赖模型对未标记数据的预测置信度,但这种方法可能会因为模型的不稳定预测而导致效果不佳。
方法:提出一种新的半监督学习方法MarginMatch,该方法结合了一致性正则化和伪标签技术,通过分析模型在训练过程中对伪标签样本的行为,确保模型预测的稳定性。
效果:实验结果表明,MarginMatch在四个视觉基准测试中以及两个大规模数据集上都有显著的提升,尤其在数据量较少的情况下,如在每个类别只有25个样本的CIFAR-100上,错误率降低了3.25%;在每个类别只有4个样本的STL-10上,错误率降低了4.19%。

We introduce MarginMatch, a new SSL approach combining consistency regularization and pseudo-labeling, with its main novelty arising from the use of unlabeled data training dynamics to measure pseudo-label quality. Instead of using only the model's confidence on an unlabeled example at an arbitrary iteration to decide if the example should be masked or not, MarginMatch also analyzes the behavior of the model on the pseudo-labeled examples as the training progresses, ensuring low fluctuations in the model's predictions from one iteration to another. MarginMatch brings substantial improvements on four vision benchmarks in low data regimes and on two large-scale datasets, emphasizing the importance of enforcing high-quality pseudo-labels. Notably, we obtain an improvement in error rate over the state-of-the-art of 3.25% on CIFAR-100 with only 25 examples per class and of 4.19% on STL-10 using as few as 4 examples per class.

ScaleKD: Distilling Scale-Aware Knowledge in Small Object Detector
Zhu, YichenandZhou, QiqiandLiu, NingandXu, ZhiyuanandOu, ZhicaiandMou, XiaofengandTang, Jian



研究问题:尽管通用目标检测取得了显著的成功,但小目标检测(SOD)的性能和效率仍然不尽人意。
动机:现有的方法在推理速度和SOD性能之间难以取得平衡,因此本文提出了一种新的尺度感知知识蒸馏(ScaleKD)方法,将复杂教师模型的知识转移到紧凑的学生模型中。
方法:设计了两个新的模块来提高SOD中知识转移的质量:1) 一个尺度解耦的特征蒸馏模块,将教师的特征表示分解为多尺度嵌入,使学生模型能够明确模仿小对象的特征;2) 一个跨尺度辅助模块,用于精炼学生模型的噪声和无信息边界框预测,这些预测可能会误导学生模型并损害知识蒸馏的效果。建立了一个多尺度跨注意力层,以捕获多尺度语义信息,从而提高学生模型的性能。
效果:在COCO和VisDrone数据集上进行了实验,使用不同类型的模型(如两阶段和单阶段检测器),评估了提出的方法。结果表明,我们的ScaleKD在通用检测性能上表现优越,并在SOD性能方面取得了显著的改进。

Despite the prominent success of general object detection, the performance and efficiency of Small Object Detection (SOD) are still unsatisfactory. Unlike existing works that struggle to balance the trade-off between inference speed and SOD performance, in this paper, we propose a novel Scale-aware Knowledge Distillation (ScaleKD), which transfers knowledge of a complex teacher model to a compact student model. We design two novel modules to boost the quality of knowledge transfer in distillation for SOD: 1) a scale-decoupled feature distillation module that disentangled teacher's feature representation into multi-scale embedding that enables explicit feature mimicking of the student model on small objects. 2) a cross-scale assistant to refine the noisy and uninformative bounding boxes prediction student models, which can mislead the student model and impair the efficacy of knowledge distillation. A multi-scale cross-attention layer is established to capture the multi-scale semantic information to improve the student model. We conduct experiments on COCO and VisDrone datasets with diverse types of models, i.e., two-stage and one-stage detectors, to evaluate our proposed method. Our ScaleKD achieves superior performance on general detection performance and obtains spectacular improvement regarding the SOD performance.

EFEM: Equivariant Neural Field Expectation Maximization for 3D Object Segmentation Without Scene Supervision
Lei, JiahuiandDeng, CongyueandSchmeckpeper, KarlandGuibas, LeonidasandDaniilidis, Kostas



研究问题:本文旨在开发一种无需标注或训练,即可在3D场景中分割对象的简单、有效且鲁棒的几何算法。
动机:现有的分割方法需要大量的标注数据和复杂的训练过程,而我们的方法通过利用单个对象的形状先验知识,可以在没有标注或训练的情况下进行分割。
方法:我们引入了等变形状表示来消除物体配置变化带来的复杂性,并提出了一种新的EM算法,该算法可以使用等变形状先验迭代地细化分割掩模。
效果:我们在"Chairs and Mugs"数据集上进行了实验,结果表明,我们的方法在不同的场景中都能实现稳定且鲁棒的性能,而弱监督的方法可能会失败。

We introduce Equivariant Neural Field Expectation Maximization (EFEM), a simple, effective, and robust geometric algorithm that can segment objects in 3D scenes without annotations or training on scenes. We achieve such unsupervised segmentation by exploiting single object shape priors. We make two novel steps in that direction. First, we introduce equivariant shape representations to this problem to eliminate the complexity induced by the variation in object configuration. Second, we propose a novel EM algorithm that can iteratively refine segmentation masks using the equivariant shape prior. We collect a novel real dataset Chairs and Mugs that contains various object configurations and novel scenes in order to verify the effectiveness and robustness of our method. Experimental results demonstrate that our method achieves consistent and robust performance across different scenes where the (weakly) supervised methods may fail. Code and data available at https://www.cis.upenn.edu/ leijh/projects/efem

Learning To Detect and Segment for Open Vocabulary Object Detection
Wang, Tao



研究问题:如何利用视觉语言预训练模型识别新的对象类别。
动机:现有的开放词汇目标检测主要关注知识转移,但这种方法在处理新对象类别时效果不佳。
方法:提出CondHead,一种动态网络设计,通过将网络头部条件参数化在语义嵌入上来更好地进行盒子回归和掩膜分割,从而更好地检测新的对象类别。
效果:CondHead显著提高了开放词汇目标检测的性能,例如,在新的类别上超过了RegionClip模型3.0的检测AP,而计算开销仅增加了1.1%。

Open vocabulary object detection has been greately advanced by the recent development of vision-language pre-trained model, which helps recognizing the novel objects with only semantic categories. The prior works mainly focus on knowledge transferring to the object proposal classification and employ class-agnostic box and mask prediction. In this work, we propose CondHead, a principled dynamic network design to better generalize the box regression and mask segmentation for open vocabulary setting. The core idea is to conditionally parametrize the network heads on semantic embedding and thus the model is guided with class-specific knowledge to better detect novel categories. Specifically, CondHead is composed of two streams of network heads, the dynamically aggregated heads and dynamically generated heads. The former is instantiated with a set of static heads that are conditionally aggregated, these heads are optimized as experts and are expected to learn sophisticated prediction. The Latter is instantiated with dynamically generated parameters and encodes general class-specific information. With such conditional design, the detection model is bridged by the semantic embedding to offer strongly generalizable class-wise box and mask prediction. Our method brings significant improvement to the prior state-of-the-art open vocabulary object detection methods with very minor overhead, e.g., it surpasses a RegionClip model by 3.0 detection AP on novel categories, with only 1.1% more computation.

Box-Level Active Detection
Lyu, MengyaoandZhou, JundongandChen, HuiandHuang, YijieandYu, DongdongandLi, YaqianandGuo, YandongandGuo, YuchenandXiang, LiuyuandDing, Guiguang



研究问题:如何有效地利用标注预算进行主动学习,以减少冗余标签并提高模型性能。
动机:现有的主动检测基准在评估过程中存在图像级别的偏见和工作量估计不准确的问题。
方法:提出了一种框级别主动检测框架,通过控制每个周期的框预算,优先选择信息量大的目标,避免冗余标签,实现公平比较和有效应用。同时设计了一种名为“互补伪主动策略”(ComPAS)的新流程,充分利用人类注释和模型智能。
效果:在统一的代码库中,ComPAS在四种设置下均优于10个竞争对手。仅使用标注数据进行监督,它在VOC0712上实现了100%的监督性能,但只标注了19%的框。在COCO数据集上,它比第二名的方法提高了4.3%的mAP。此外,ComPAS还支持使用未标注的数据集进行训练,标注减少85%的情况下,其监督性能超过了90%的COCO。

Active learning selects informative samples for annotation within budget, which has proven efficient recently on object detection. However, the widely used active detection benchmarks conduct image-level evaluation, which is unrealistic in human workload estimation and biased towards crowded images. Furthermore, existing methods still perform image-level annotation, but equally scoring all targets within the same image incurs waste of budget and redundant labels. Having revealed above problems and limitations, we introduce a box-level active detection framework that controls a box-based budget per cycle, prioritizes informative targets and avoids redundancy for fair comparison and efficient application. Under the proposed box-level setting, we devise a novel pipeline, namely Complementary Pseudo Active Strategy (ComPAS). It exploits both human annotations and the model intelligence in a complementary fashion: an efficient input-end committee queries labels for informative objects only; meantime well-learned targets are identified by the model and compensated with pseudo-labels. ComPAS consistently outperforms 10 competitors under 4 settings in a unified codebase. With supervision from labeled data only, it achieves 100% supervised performance of VOC0712 with merely 19% box annotations. On the COCO dataset, it yields up to 4.3% mAP improvement over the second-best method. ComPAS also supports training with the unlabeled pool, where it surpasses 90% COCO supervised performance with 85% label reduction. Our source code is publicly available at https://github.com/lyumengyao/blad.

Generative Semantic Segmentation
Chen, JiaqiandLu, JiachenandZhu, XiatianandZhang, Li



研究问题:本文旨在提出一种生成式语义分割(GSS)方法,将语义分割转化为图像条件掩膜生成问题。
动机:传统的像素判别学习方法在语义分割任务中存在局限性,因此作者提出了一种新的潜在先验学习过程来替代。
方法:通过替换传统的像素判别学习方法为潜在先验学习过程,将语义分割转化为图像条件掩膜生成问题。具体来说,我们模型化了给定分割掩膜的潜在变量的后验分布。为了实现对给定图像的语义分割,我们还引入了一个条件网络。
效果:大量的实验表明,我们的GSS在标准的语义分割设置中可以与现有技术相媲美,同时在更具挑战性的跨领域设置中实现了新的最先进的水平。

We present Generative Semantic Segmentation (GSS), a generative learning approach for semantic segmentation. Uniquely, we cast semantic segmentation as an image-conditioned mask generation problem. This is achieved by replacing the conventional per-pixel discriminative learning with a latent prior learning process. Specifically, we model the variational posterior distribution of latent variables given the segmentation mask. To that end, the segmentation mask is expressed with a special type of image (dubbed as maskige). This posterior distribution allows to generate segmentation masks unconditionally. To achieve semantic segmentation on a given image, we further introduce a conditioning network. It is optimized by minimizing the divergence between the posterior distribution of maskige (i.e., segmentation masks) and the latent prior distribution of input training images. Extensive experiments on standard benchmarks show that our GSS can perform competitively to prior art alternatives in the standard semantic segmentation setting, whilst achieving a new state of the art in the more challenging cross-domain setting.

SDC-UDA: Volumetric Unsupervised Domain Adaptation Framework for Slice-Direction Continuous Cross-Modality Medical Image Segmentation
Shin, HyungseobandKim, HyeongyuandKim, SewonandJun, YohanandEo, TaejoonandHwang, Dosik



研究问题:如何利用无监督领域适应(UDA)在医学图像分割中实现跨模态的切片方向连续分割。
动机:获取像素级专家注释在医学成像领域非常昂贵和费力,而现有的UDA方法无法保证切片方向的连续性。
方法:提出SDC-UDA,一种简单有效的体积UDA框架,结合了切片内和切片间的自我注意力图像转换、不确定性约束的伪标签细化和体积自训练。
效果:通过多个公开的跨模态医学图像分割数据集进行验证,SDC-UDA实现了最先进的分割性能,并且与以往研究相比,预测的切片方向连续性更高。

Recent advances in deep learning-based medical image segmentation studies achieve nearly human-level performance in fully supervised manner. However, acquiring pixel-level expert annotations is extremely expensive and laborious in medical imaging fields. Unsupervised domain adaptation (UDA) can alleviate this problem, which makes it possible to use annotated data in one imaging modality to train a network that can successfully perform segmentation on target imaging modality with no labels. In this work, we propose SDC-UDA, a simple yet effective volumetric UDA framework for Slice-Direction Continuous cross-modality medical image segmentation which combines intra- and inter-slice self-attentive image translation, uncertainty-constrained pseudo-label refinement, and volumetric self-training. Our method is distinguished from previous methods on UDA for medical image segmentation in that it can obtain continuous segmentation in the slice direction, thereby ensuring higher accuracy and potential in clinical practice. We validate SDC-UDA with multiple publicly available cross-modality medical image segmentation datasets and achieve state-of-the-art segmentation performance, not to mention the superior slice-direction continuity of prediction compared to previous studies.

DoNet: Deep De-Overlapping Network for Cytology Instance Segmentation
Jiang, HaoandZhang, RushanandZhou, YanningandWang, YumengandChen, Hao



研究问题:细胞分割在细胞学图像中对生物分析和癌症筛查有重要意义,但因为1)大量重叠的半透明细胞簇导致边界模糊,2)模仿物和碎片混淆为核,所以仍然具有挑战性。
动机:提出一种去重叠网络(DoNet),采用分解-重组策略,解决细胞分割问题。
方法:设计了一个双路径区域分割模块(DRM)来明确地将细胞簇分解为交集和互补区域,然后通过语义一致性引导的重组模块(CRM)进行整合。为了进一步引入细胞核在细胞质中的包含关系,设计了一个掩膜引导的区域提案策略(MRP),整合了细胞注意力图进行内部细胞实例预测。
效果:实验表明,提出的DoNet显著优于其他最先进的细胞实例分割方法。

Cell instance segmentation in cytology images has significant importance for biology analysis and cancer screening, while remains challenging due to 1) the extensive overlapping translucent cell clusters that cause the ambiguous boundaries, and 2) the confusion of mimics and debris as nuclei. In this work, we proposed a De-overlapping Network (DoNet) in a decompose-and-recombined strategy. A Dual-path Region Segmentation Module (DRM) explicitly decomposes the cell clusters into intersection and complement regions, followed by a Semantic Consistency-guided Recombination Module (CRM) for integration. To further introduce the containment relationship of the nucleus in the cytoplasm, we design a Mask-guided Region Proposal Strategy (MRP) that integrates the cell attention maps for inner-cell instance prediction. We validate the proposed approach on ISBI2014 and CPS datasets. Experiments show that our proposed DoNet significantly outperforms other state-of-the-art (SOTA) cell instance segmentation methods. The code is available at https://github.com/DeepDoNet/DoNet.

DATE: Domain Adaptive Product Seeker for E-Commerce
Li, HaoyuanandJiang, HaoandJin, TaoandLi, MengyanandChen, YanandLin, ZhijieandZhao, YangandZhao, Zhou



研究问题:本文旨在解决商品检索和目标定位问题,以改善购物体验。
动机:由于缺乏相关数据集,我们收集了淘宝商城和直播领域的两个大规模基准数据集,并手动标注了每个图像中的对象边界框,以实现无监督的领域适应。
方法:我们设计了一个语义聚合的特征提取器来获取集中和全面的特征,然后提出了两个合作搜索器来同时进行图像检索和精细的目标定位。此外,我们还设计了一个领域对齐器来缓解源域和目标域之间的单模态边际和多模态条件分布偏移,并设计了一个伪框生成器来动态选择可靠的实例并生成边界框以进一步进行知识转移。
效果:实验表明,我们的DATE在全监督的商品检索、目标定位和无监督的目标定位领域适应方面都取得了满意的性能。

Product Retrieval (PR) and Grounding (PG), aiming to seek image and object-level products respectively according to a textual query, have attracted great interest recently for better shopping experience. Owing to the lack of relevant datasets, we collect two large-scale benchmark datasets from Taobao Mall and Live domains with about 474k and 101k image-query pairs for PR, and manually annotate the object bounding boxes in each image for PG. As annotating boxes is expensive and time-consuming, we attempt to transfer knowledge from annotated domain to unannotated for PG to achieve un-supervised Domain Adaptation (PG-DA). We propose a Domain Adaptive producT sEeker (DATE) framework, regarding PR and PG as Product Seeking problem at different levels, to assist the query date the product. Concretely, we first design a semantics-aggregated feature extractor for each modality to obtain concentrated and comprehensive features for following efficient retrieval and fine-grained grounding tasks. Then, we present two cooperative seekers to simultaneously search the image for PR and localize the product for PG. Besides, we devise a domain aligner for PG-DA to alleviate uni-modal marginal and multi-modal conditional distribution shift between source and target domains, and design a pseudo box generator to dynamically select reliable instances and generate bounding boxes for further knowledge transfer. Extensive experiments show that our DATE achieves satisfactory performance in fully-supervised PR, PG and un-supervised PG-DA. Our desensitized datasets will be publicly available here https://github.com/Taobao-live/Product-Seeking.

Sparse Multi-Modal Graph Transformer With Shared-Context Processing for Representation Learning of Giga-Pixel Images
Nakhli, RaminandMoghadam, PuriaAzadiandMi, HaoyangandFarahani, HosseinandBaras, AlexanderandGilks, BlakeandBashashati, Ali



研究问题:处理十亿像素全幅病理图像(WSI)是一项计算密集型任务,现有的多实例学习(MIL)方法忽略了细胞级别的信息。
动机:本文提出了共享上下文处理的新概念,设计了一种多模态图转换器,利用组织中的细胞图来为患者提供单一表示,同时利用组织的层次结构,实现细胞级别和组织级别信息的动态关注。
方法:通过将图像分割成更小的补丁进行进一步处理,我们的方法能够充分利用细胞级别的信息,并通过对生存预测的基准测试,证明了其优越性。
效果:实验结果表明,该方法在生存预测方面显著优于包括分层视觉转换器(ViT)在内的所有最先进的方法。更重要的是,该方法对缺失信息具有很强的鲁棒性,即使只有20%的数据也能实现相同的性能。在两个不同的癌症数据集上,该方法能够将患者分为低风险和高风险组,而其他最先进的方法则无法实现这一目标。

Processing giga-pixel whole slide histopathology images (WSI) is a computationally expensive task. Multiple instance learning (MIL) has become the conventional approach to process WSIs, in which these images are split into smaller patches for further processing. However, MIL-based techniques ignore explicit information about the individual cells within a patch. In this paper, by defining the novel concept of shared-context processing, we designed a multi-modal Graph Transformer that uses the cellular graph within the tissue to provide a single representation for a patient while taking advantage of the hierarchical structure of the tissue, enabling a dynamic focus between cell-level and tissue-level information. We benchmarked the performance of our model against multiple state-of-the-art methods in survival prediction and showed that ours can significantly outperform all of them including hierarchical vision Transformer (ViT). More importantly, we show that our model is strongly robust to missing information to an extent that it can achieve the same performance with as low as 20% of the data. Finally, in two different cancer datasets, we demonstrated that our model was able to stratify the patients into low-risk and high-risk groups while other state-of-the-art methods failed to achieve this goal. We also publish a large dataset of immunohistochemistry (IHC) images containing 1,600 tissue microarray (TMA) cores from 188 patients along with their survival information, making it one of the largest publicly available datasets in this context.

Sparsely Annotated Semantic Segmentation With Adaptive Gaussian Mixtures
Wu, LinshanandZhong, ZhunandFang, LeyuanandHe, XingxinandLiu, QiangandMa, JiayiandChen, Hao



研究问题:本文旨在解决稀疏标注的语义分割问题,即通过图像中的稀疏标签(例如点或涂鸦)来学习分割模型。
动机:现有的方法主要关注引入低层次亲和力或生成伪标签以加强监督,但在很大程度上忽略了已标记和未标记像素之间的固有关系。
方法:本文提出了一种新的SASS框架,配备了自适应高斯混合模型(AGMM)。AGMM能够根据已标记和未标记像素的分布为未标记像素提供可靠的监督。具体来说,我们首先使用已标记像素及其相对相似的未标记像素构建高斯混合模型,其中已标记像素作为质心,用于对每个类别的特征分布进行建模。然后,我们利用来自已标记像素的可靠信息和自适应生成的GMM预测来监督未标记像素的训练,实现在线、动态和鲁棒的自我监督。此外,通过捕获类别特定的高斯混合模型,AGMM鼓励模型以端到端对比学习的方式学习判别性类别决策边界。
效果:在PASCAL VOC 2012和Cityscapes数据集上进行的实验结果表明,我们的AGMM可以建立新的最先进的SASS性能。代码可在https://github.com/Luffy03/AGMM-SASS获取。

Sparsely annotated semantic segmentation (SASS) aims to learn a segmentation model by images with sparse labels (i.e., points or scribbles). Existing methods mainly focus on introducing low-level affinity or generating pseudo labels to strengthen supervision, while largely ignoring the inherent relation between labeled and unlabeled pixels. In this paper, we observe that pixels that are close to each other in the feature space are more likely to share the same class. Inspired by this, we propose a novel SASS framework, which is equipped with an Adaptive Gaussian Mixture Model (AGMM). Our AGMM can effectively endow reliable supervision for unlabeled pixels based on the distributions of labeled and unlabeled pixels. Specifically, we first build Gaussian mixtures using labeled pixels and their relatively similar unlabeled pixels, where the labeled pixels act as centroids, for modeling the feature distribution of each class. Then, we leverage the reliable information from labeled pixels and adaptively generated GMM predictions to supervise the training of unlabeled pixels, achieving online, dynamic, and robust self-supervision. In addition, by capturing category-wise Gaussian mixtures, AGMM encourages the model to learn discriminative class decision boundaries in an end-to-end contrastive learning manner. Experimental results conducted on the PASCAL VOC 2012 and Cityscapes datasets demonstrate that our AGMM can establish new state-of-the-art SASS performance. Code is available at https://github.com/Luffy03/AGMM-SASS.

InstMove: Instance Motion for Object-Centric Video Segmentation
Liu, QihaoandWu, JunfengandJiang, YiandBai, XiangandYuille, AlanL.andBai, Song



研究问题:尽管已经做出了大量努力,但尖端的视频分割方法仍然对遮挡和快速移动敏感。
动机:目前的解决方法主要依赖于对象嵌入的外观,这在遮挡和快速移动的情况下容易受到影响。
方法:我们提出了一种名为InstMove的新方法,该方法主要依赖于实例级别的运动信息,这种信息不受图像特征嵌入的影响,并且具有物理解释性,使其在遮挡和快速移动的对象面前更加准确和鲁棒。
效果:实验结果表明,实例级别的运动是强大且准确的,可以有效解决复杂场景下的视频分割问题。

Despite significant efforts, cutting-edge video segmentation methods still remain sensitive to occlusion and rapid movement, due to their reliance on the appearance of objects in the form of object embeddings, which are vulnerable to these disturbances. A common solution is to use optical flow to provide motion information, but essentially it only considers pixel-level motion, which still relies on appearance similarity and hence is often inaccurate under occlusion and fast movement. In this work, we study the instance-level motion and present InstMove, which stands for Instance Motion for Object-centric Video Segmentation. In comparison to pixel-wise motion, InstMove mainly relies on instance-level motion information that is free from image feature embeddings, and features physical interpretations, making it more accurate and robust toward occlusion and fast-moving objects. To better fit in with the video segmentation tasks, InstMove uses instance masks to model the physical presence of an object and learns the dynamic model through a memory network to predict its position and shape in the next frame. With only a few lines of code, InstMove can be integrated into current SOTA methods for three different video segmentation tasks and boost their performance. Specifically, we improve the previous arts by 1.5 AP on OVIS dataset, which features heavy occlusions, and 4.9 AP on YouTubeVIS-Long dataset, which mainly contains fast-moving objects. These results suggest that instance-level motion is robust and accurate, and hence serving as a powerful solution in complex scenarios for object-centric video segmentation.

GRES: Generalized Referring Expression Segmentation
Liu, ChangandDing, HenghuiandJiang, Xudong



研究问题:本文旨在解决现有参照表达式分割(RES)只能处理单目标表达式的问题,提出研究问题:本文旨在解决现有参照表达式分割(RES)只能处理单目标表达式的问题,提出一种通用参照表达式分割(GRES)方法,允许表达式引用任意数量的目标对象。
动机:现有的RES数据集和方法仅支持单目标表达式,限制了其在实际应用中的使用。
方法:构建了第一个大规模的GRES数据集gRefCOCO,包含多目标、无目标和单目标表达式。设计了一种新的基于区域的GRES基线ReLA,通过自适应地将图像划分为具有子实例线索的区域,并显式地建模区域-区域和区域-语言依赖关系。
效果:实验结果表明,提出的ReLA方法在新的GRES任务和传统的RES任务上均取得了最先进的性能。

Referring Expression Segmentation (RES) aims to generate a segmentation mask for the object described by a given language expression. Existing classic RES datasets and methods commonly support single-target expressions only, i.e., one expression refers to one target object. Multi-target and no-target expressions are not considered. This limits the usage of RES in practice. In this paper, we introduce a new benchmark called Generalized Referring Expression Segmentation (GRES), which extends the classic RES to allow expressions to refer to an arbitrary number of target objects. Towards this, we construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions. GRES and gRefCOCO are designed to be well-compatible with RES, facilitating extensive experiments to study the performance gap of the existing RES methods on the GRES task. In the experimental study, we find that one of the big challenges of GRES is complex relationship modeling. Based on this, we propose a region-based GRES baseline ReLA that adaptively divides the image into regions with sub-instance clues, and explicitly models the region-region and region-language dependencies. The proposed approach ReLA achieves new state-of-the-art performance on the both newly proposed GRES and classic RES tasks. The proposed gRefCOCO dataset and method are available at https://henghuiding.github.io/GRES.

Iterative Proposal Refinement for Weakly-Supervised Video Grounding
Cao, MengandWei, FangyunandXu, CanandGeng, XiuboandChen, LongandZhang, CanandZou, YuexianandShen, TaoandJiang, Daxin



研究问题:本文旨在解决弱监督视频定位(WSVG)中的问题,即在只有视频级别标注的情况下,对未剪辑的视频中的感兴趣事件进行定位。
动机:尽管现有的弱监督视频定位方法已经取得了一些进展,但是它们存在两个主要问题:1)缺乏显式对应建模;2)复杂事件的覆盖不全面。
方法:为了解决这些问题,作者提出了一种新的迭代提案精炼网络(IRON)。具体来说,作者设置了两个轻量级的蒸馏分支,以在语义和概念层面上揭示跨模态的对应关系。然后,设计了一个迭代的标签传播策略,防止网络过度关注最具鉴别性的事件,而不是整个句子的内容。
效果:通过对两个具有挑战性的WSVG数据集进行大量的实验和消融研究,证明了IRON的有效性。

Weakly-Supervised Video Grounding (WSVG) aims to localize events of interest in untrimmed videos with only video-level annotations. To date, most of the state-of-the-art WSVG methods follow a two-stage pipeline, i.e., firstly generating potential temporal proposals and then grounding with these proposal candidates. Despite the recent progress, existing proposal generation methods suffer from two drawbacks: 1) lack of explicit correspondence modeling; and 2) partial coverage of complex events. To this end, we propose a novel IteRative prOposal refiNement network (dubbed as IRON) to gradually distill the prior knowledge into each proposal and encourage proposals with more complete coverage. Specifically, we set up two lightweight distillation branches to uncover the cross-modal correspondence on both the semantic and conceptual levels. Then, an iterative Label Propagation (LP) strategy is devised to prevent the network from focusing excessively on the most discriminative events instead of the whole sentence content. Precisely, during each iteration, the proposal with the minimal distillation loss and its adjacent ones are regarded as the positive samples, which refines proposal confidence scores in a cascaded manner. Extensive experiments and ablation studies on two challenging WSVG datasets have attested to the effectiveness of our IRON.

FAC: 3D Representation Learning via Foreground Aware Feature Contrast
Liu, KangchengandXiao, AoranandZhang, XiaoqinandLu, ShijianandShao, Ling



研究问题:如何提高对比学习在无监督预训练3D场景理解任务中的效果。
动机:现有的对比学习方法存在对背景点选择的偏见,忽视了物体意识和前景与背景的区分,导致效果不佳。
方法:提出一种通用的前景色感知特征对比(FAC)框架,通过构建更有效和信息丰富的对比对来学习更高效的点云表示。
效果:实验证明,FAC在各种下游3D语义分割和对象检测任务上实现了优越的知识转移和数据效率。

Contrastive learning has recently demonstrated great potential for unsupervised pre-training in 3D scene understanding tasks. However, most existing work randomly selects point features as anchors while building contrast, leading to a clear bias toward background points that often dominate in 3D scenes. Also, object awareness and foreground-to-background discrimination are neglected, making contrastive learning less effective. To tackle these issues, we propose a general foreground-aware feature contrast (FAC) framework to learn more effective point cloud representations in pre-training. FAC consists of two novel contrast designs to construct more effective and informative contrast pairs. The first is building positive pairs within the same foreground segment where points tend to have the same semantics. The second is that we prevent over-discrimination between 3D segments/objects and encourage foreground-to-background distinctions at the segment level with adaptive feature learning in a Siamese correspondence network, which adaptively learns feature correlations within and across point cloud views effectively. Visualization with point activation maps shows that our contrast pairs capture clear correspondences among foreground regions during pre-training. Quantitative experiments also show that FAC achieves superior knowledge transfer and data efficiency in various downstream 3D semantic segmentation and object detection tasks. All codes, data, and models are available at:https://github.com/KangchengLiu/FAC_Foreground_Aware_Contrast.

SIM: Semantic-Aware Instance Mask Generation for Box-Supervised Instance Segmentation
Li, RuihuangandHe, ChenhangandZhang, YabinandLi, ShuaiandChen, LiyiandZhang, Lei



研究问题:如何利用仅有的边界框注释进行弱监督实例分割。
动机:目前的弱监督实例分割方法大多依赖低层次图像特征作为额外监督,没有明确利用物体的高级别语义信息,当前景物体与背景或其他附近的物体外观相似时,这种方法将变得无效。
方法:提出一种新的基于盒子的监督实例分割方法,通过开发一个语义感知实例掩膜(SIM)生成范式。不依赖于相邻像素之间的局部成对亲和力,而是构建一组类别特定的特征质心作为原型来识别前景物体并分配语义级别的伪标签。考虑到语义感知原型无法区分同一语义的不同实例,提出了一种自我修正机制来纠正错误激活的区域,同时增强正确的区域。此外,为了处理对象之间的遮挡,为弱监督实例分割任务量身定制了复制粘贴操作以增强具有挑战性的训练数据。
效果:大量实验结果表明,所提出的SIM方法优于其他最先进的方法。

Weakly supervised instance segmentation using only bounding box annotations has recently attracted much research attention. Most of the current efforts leverage low-level image features as extra supervision without explicitly exploiting the high-level semantic information of the objects, which will become ineffective when the foreground objects have similar appearances to the background or other objects nearby. We propose a new box-supervised instance segmentation approach by developing a Semantic-aware Instance Mask (SIM) generation paradigm. Instead of heavily relying on local pair-wise affinities among neighboring pixels, we construct a group of category-wise feature centroids as prototypes to identify foreground objects and assign them semantic-level pseudo labels. Considering that the semantic-aware prototypes cannot distinguish different instances of the same semantics, we propose a self-correction mechanism to rectify the falsely activated regions while enhancing the correct ones. Furthermore, to handle the occlusions between objects, we tailor the Copy-Paste operation for the weakly-supervised instance segmentation task to augment challenging training data. Extensive experimental results demonstrate the superiority of our proposed SIM approach over other state-of-the-art methods. The source code: https://github.com/lslrh/SIM.

VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision
Liu, MengyinandJiang, JieandZhu, ChaoandYin, Xu-Cheng



研究问题:如何在城市场景中准确检测行人,特别是在小尺度或严重遮挡的情况下。
动机:目前的行人检测方法主要依赖物体区域,对于小尺度或严重遮挡的行人检测效果不佳。同时,现有的上下文感知行人检测方法要么只学习视觉线索的潜在上下文,要么需要大量的注释来获取明确的语义上下文。
方法:本文提出了一种新的视觉语言语义自我监督的上下文感知行人检测(VLPD)方法,无需额外注释即可显式建模语义上下文。首先,提出了一种自我监督的视觉语言语义(VLS)分割方法,通过视觉语言模型生成的语义类别的明确标签,学习全监督行人检测和上下文分割。其次,提出了一种自我监督的原型语义对比(PSC)学习方法,基于从VLS获得的更明确和语义的上下文,更好地区分行人和其他类别。
效果:在流行的基准测试上的大量实验表明,所提出的VLPD在小尺度和严重遮挡等具有挑战性的情况下,优于先前最先进的方法。

Detecting pedestrians accurately in urban scenes is significant for realistic applications like autonomous driving or video surveillance. However, confusing human-like objects often lead to wrong detections, and small scale or heavily occluded pedestrians are easily missed due to their unusual appearances. To address these challenges, only object regions are inadequate, thus how to fully utilize more explicit and semantic contexts becomes a key problem. Meanwhile, previous context-aware pedestrian detectors either only learn latent contexts with visual clues, or need laborious annotations to obtain explicit and semantic contexts. Therefore, we propose in this paper a novel approach via Vision-Language semantic self-supervision for context-aware Pedestrian Detection (VLPD) to model explicitly semantic contexts without any extra annotations. Firstly, we propose a self-supervised Vision-Language Semantic (VLS) segmentation method, which learns both fully-supervised pedestrian detection and contextual segmentation via self-generated explicit labels of semantic classes by vision-language models. Furthermore, a self-supervised Prototypical Semantic Contrastive (PSC) learning method is proposed to better discriminate pedestrians and other classes, based on more explicit and semantic contexts obtained from VLS. Extensive experiments on popular benchmarks show that our proposed VLPD achieves superior performances over the previous state-of-the-arts, particularly under challenging circumstances like small scale and heavy occlusion. Code is available at https://github.com/lmy98129/VLPD.

Unsupervised Object Localization: Observing the Background To Discover Objects
Sim\'eoni, OrianeandSekkat, Chlo\'eandPuy, GillesandVobeck\'y, Anton{\'\i



研究问题:如何实现无监督的图像对象发现和实例分割。
动机:现有的自监督视觉表示学习方法为解决无监督任务(如对象发现和实例分割)铺平了道路,但如何在没有监督的情况下在图像中发现对象是一项非常困难的任务。
方法:我们提出了一种寻找背景的新方法,突出的对象作为副产品出现,无需对对象应有的样子做任何强假设。我们提出了FOUND模型,该模型由一个单独的conv 1x1组成,使用从自监督基于补丁的表示中提取的粗糙背景掩码进行初始化。
效果:经过快速训练和精炼这些种子掩码后,该模型在无监督显著性检测和对象发现基准测试中达到了最先进的结果。此外,我们的方法是有效的,可以在无监督语义分割检索任务中获得良好的结果。

Recent advances in self-supervised visual representation learning have paved the way for unsupervised methods tackling tasks such as object discovery and instance segmentation. However, discovering objects in an image with no supervision is a very hard task; what are the desired objects, when to separate them into parts, how many are there, and of what classes? The answers to these questions depend on the tasks and datasets of evaluation. In this work, we take a different approach and propose to look for the background instead. This way, the salient objects emerge as a by-product without any strong assumption on what an object should be. We propose FOUND, a simple model made of a single conv 1x1 initialized with coarse background masks extracted from self-supervised patch-based representations. After fast training and refining these seed masks, the model reaches state-of-the-art results on unsupervised saliency detection and object discovery benchmarks. Moreover, we show that our approach yields good results in the unsupervised semantic segmentation retrieval task. The code to reproduce our results is available at https://github.com/valeoai/FOUND.

Exemplar-FreeSOLO: Enhancing Unsupervised Instance Segmentation With Exemplars
Ishtiak, TaoseefandEn, QingandGuo, Yuhong



研究问题:如何利用无标注的实例进行图像分割,以减轻模型训练的负担。
动机:为了解决传统实例分割方法需要大量密集标注的问题,提出了一种无需标注的无监督实例分割方法。
方法:提出了一种新的无监督实例分割方法Exemplar-FreeSOLO,通过使用少量未标注和未分割的实例来提取有用的自上而下的指导知识。
效果:实验结果表明,该方法在三个图像实例分割数据集上都优于现有技术。

Instance segmentation seeks to identify and segment each object from images, which often relies on a large number of dense annotations for model training. To alleviate this burden, unsupervised instance segmentation methods have been developed to train class-agnostic instance segmentation models without any annotation. In this paper, we propose a novel unsupervised instance segmentation approach, Exemplar-FreeSOLO, to enhance unsupervised instance segmentation by exploiting a limited number of unannotated and unsegmented exemplars. The proposed framework offers a new perspective on directly perceiving top-down information without annotations. Specifically, Exemplar-FreeSOLO introduces a novel exemplarknowledge abstraction module to acquire beneficial top-down guidance knowledge for instances using unsupervised exemplar object extraction. Moreover, a new exemplar embedding contrastive module is designed to enhance the discriminative capability of the segmentation model by exploiting the contrastive exemplar-based guidance knowledge in the embedding space. To evaluate the proposed ExemplarFreeSOLO, we conduct comprehensive experiments and perform in-depth analyses on three image instance segmentation datasets. The experimental results demonstrate that the proposed approach is effective and outperforms the state-of-the-art methods.

Spatiotemporal Self-Supervised Learning for Point Clouds in the Wild
Wu, YanhaoandZhang, TongandKe, WeiandS\"usstrunk, SabineandSalzmann, Mathieu



研究问题:如何利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Self-supervised learning (SSL) has the potential to benefit many applications, particularly those where manually annotating data is cumbersome. One such situation is the semantic segmentation of point clouds. In this context, existing methods employ contrastive learning strategies and define positive pairs by performing various augmentation of point clusters in a single frame. As such, these methods do not exploit the temporal nature of LiDAR data. In this paper, we introduce an SSL strategy that leverages positive pairs in both the spatial and temporal domains. To this end, we design (i) a point-to-cluster learning strategy that aggregates spatial information to distinguish objects; and (ii) a cluster-to-cluster learning strategy based on unsupervised object tracking that exploits temporal correspondences. We demonstrate the benefits of our approach via extensive experiments performed by self-supervised training on two large-scale LiDAR datasets and transferring the resulting models to other point cloud segmentation benchmarks. Our results evidence that our method outperforms the state-of-the-art point cloud SSL methods.

Semi-Supervised Learning Made Simple With Self-Supervised Clustering
Fini, EnricoandAstolfi, PietroandAlahari, KarteekandAlameda-Pineda, XavierandMairal, JulienandNabi, MoinandRicci, Elisa



研究问题:如何将基于聚类的自监督学习方法转化为半监督学习模型。
动机:在许多实际情况中,标签是部分可用的,这激发了从自监督原则出发的半监督方法的研究。
方法:提出了一种概念简单但实证有效的方法,将基于聚类的自监督方法(如SwAV或DINO)转化为半监督学习器。具体来说,通过单一交叉熵损失引入了一个结合了使用真实标签的有监督目标和依赖聚类分配的自监督目标的多任务框架。
效果:实验结果表明,该方法非常有效,并在CIFAR100和ImageNet上取得了最先进的性能。

Self-supervised learning models have been shown to learn rich visual representations without requiring human annotations. However, in many real-world scenarios, labels are partially available, motivating a recent line of work on semi-supervised methods inspired by self-supervised principles. In this paper, we propose a conceptually simple yet empirically powerful approach to turn clustering-based self-supervised methods such as SwAV or DINO into semi-supervised learners. More precisely, we introduce a multi-task framework merging a supervised objective using ground-truth labels and a self-supervised objective relying on clustering assignments with a single cross-entropy loss. This approach may be interpreted as imposing the cluster centroids to be class prototypes. Despite its simplicity, we provide empirical evidence that our approach is highly effective and achieves state-of-the-art performance on CIFAR100 and ImageNet.

Harmonious Teacher for Cross-Domain Object Detection
Deng, JinhongandXu, DongliandLi, WenandDuan, Lixin



研究问题:本文旨在解决跨领域目标检测中自我训练方法的问题,即如何提高伪标签的质量。
动机:现有的自我训练方法在跨领域目标检测中取得了良好的效果,但伪标签的质量直接影响了模型的训练效果。
方法:提出了一种新的和谐教师方法,通过在训练检测模型时对分类和定位得分的一致性进行正则化来提高伪标签的质量,并采用基于分类和定位得分一致性的样本重选策略来改进预测的排序。
效果:实验结果表明,该方法在各种跨领域场景中均优于现有的最佳基线,验证了和谐教师方法的有效性。

Self-training approaches recently achieved promising results in cross-domain object detection, where people iteratively generate pseudo labels for unlabeled target domain samples with a model, and select high-confidence samples to refine the model. In this work, we reveal that the consistency of classification and localization predictions are crucial to measure the quality of pseudo labels, and propose a new Harmonious Teacher approach to improve the self-training for cross-domain object detection. In particular, we first propose to enhance the quality of pseudo labels by regularizing the consistency of the classification and localization scores when training the detection model. The consistency losses are defined for both labeled source samples and the unlabeled target samples. Then, we further remold the traditional sample selection method by a sample reweighing strategy based on the consistency of classification and localization scores to improve the ranking of predictions. This allows us to fully exploit all instance predictions from the target domain without abandoning valuable hard examples. Without bells and whistles, our method shows superior performance in various cross-domain scenarios compared with the state-of-the-art baselines, which validates the effectiveness of our Harmonious Teacher. Our codes will be available at https://github.com/kinredon/Harmonious-Teacher.

Semi-Supervised Video Inpainting With Cycle Consistency Constraints
Wu, ZhiliangandXuan, HanyuandSun, ChangchangandGuan, WeiliandZhang, KangandYan, Yan



研究问题:如何利用少量标注数据实现视频修复任务。
动机:现有的深度学习方法需要大量标注数据,但标注过程耗时且成本高,限制了实际应用。
方法:提出一种半监督的视频修复框架,通过一个已标注的掩码预测下一帧需要修复的区域,并训练完成网络生成当前帧的修复内容。同时引入循环一致性损失约束两个网络的训练参数,使它们相互制约,提高模型性能。
效果:在模拟真实场景的数据集上进行实验,证明该方法在视频修复任务上表现优越,即使使用半监督方式训练,其性能也可与全监督方法相媲美。

Deep learning-based video inpainting has yielded promising results and gained increasing attention from researchers. Generally, these methods usually assume that the corrupted region masks of each frame are known and easily obtained. However, the annotation of these masks are labor-intensive and expensive, which limits the practical application of current methods. Therefore, we expect to relax this assumption by defining a new semi-supervised inpainting setting, making the networks have the ability of completing the corrupted regions of the whole video using the annotated mask of only one frame. Specifically, in this work, we propose an end-to-end trainable framework consisting of completion network and mask prediction network, which are designed to generate corrupted contents of the current frame using the known mask and decide the regions to be filled of the next frame, respectively. Besides, we introduce a cycle consistency loss to regularize the training parameters of these two networks. In this way, the completion network and the mask prediction network can constrain each other, and hence the overall performance of the trained model can be maximized. Furthermore, due to the natural existence of prior knowledge (e.g., corrupted contents and clear borders), current video inpainting datasets are not suitable in the context of semi-supervised video inpainting. Thus, we create a new dataset by simulating the corrupted video of real-world scenarios. Extensive experimental results are reported to demonstrate the superiority of our model in the video inpainting task. Remarkably, although our model is trained in a semi-supervised manner, it can achieve comparable performance as fully-supervised methods.

Out-of-Candidate Rectification for Weakly Supervised Semantic Segmentation
Cheng, ZesenandQiao, PengchongandLi, KehanandLi, SihengandWei, PengxuandJi, XiangyangandYuan, LiandLiu, ChangandChen, Jie



研究问题:弱监督语义分割中存在的Out-of-Candidate错误预测问题。
动机:现有的方法在处理这类错误预测时,由于无法有效检测到其与图像级别类别标签的矛盾,导致误判情况频繁发生。
方法:本文提出了一种基于组排序的Out-of-Candidate Rectification(OCR)机制。首先,根据像素点的先验标注关联性和后验预测关联性,将语义类别自适应地划分为In-Candidate(IC)和Out-of-Candidate(OC)两组。然后,通过设计一个可微分的修正损失函数,迫使OC像素点向IC组转移。
效果:将OCR机制应用于AffinityNet、SEAM、MCTformer等基础模型后,在Pascal VOC和MS COCO数据集上取得了显著的性能提升,且训练开销几乎无增加。这证明了OCR机制的有效性和通用性。

Weakly supervised semantic segmentation is typically inspired by class activation maps, which serve as pseudo masks with class-discriminative regions highlighted. Although tremendous efforts have been made to recall precise and complete locations for each class, existing methods still commonly suffer from the unsolicited Out-of-Candidate (OC) error predictions that do not belong to the label candidates, which could be avoidable since the contradiction with image-level class tags is easy to be detected. In this paper, we develop a group ranking-based Out-of-Candidate Rectification (OCR) mechanism in a plug-and-play fashion. Firstly, we adaptively split the semantic categories into In-Candidate (IC) and OC groups for each OC pixel according to their prior annotation correlation and posterior prediction correlation. Then, we derive a differentiable rectification loss to force OC pixels to shift to the IC group. Incorporating OCR with seminal baselines (e.g., AffinityNet, SEAM, MCTformer), we can achieve remarkable performance gains on both Pascal VOC (+3.2%, +3.3%, +0.8% mIoU) and MS COCO (+1.0%, +1.3%, +0.5% mIoU) datasets with negligible extra training overhead, which justifies the effectiveness and generality of OCR.

Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection
Wang, LutingandLiu, YiandDu, PenghuiandDing, ZihanandLiao, YueandQi, QiaosongandChen, BiaolongandLiu, Si



研究问题:本文旨在解决开放词汇目标检测中的问题,即如何让物体检测器具有对任意文本查询描述的物体进行检测的泛化能力。
动机:现有的方法采用知识蒸馏从预训练的视觉-语言模型(PVLMs)中提取知识并转移到检测器上,但由于非适应性的提案裁剪和单级特征模仿过程,这些方法在知识提取和知识转移过程中存在信息破坏和效率低下的问题。
方法:本文提出了一个对象感知蒸馏金字塔(OADP)框架,包括一个对象感知知识提取(OAKE)模块和一个蒸馏金字塔(DP)机制。在从PVLMs中提取对象知识时,前者自适应地转换对象提案并采用对象感知的掩码注意力以获取精确和完整的对象知识。后者引入了全局和块蒸馏以进行更全面的知识转移,以补偿对象蒸馏中缺失的关系信息。
效果:大量实验表明,该方法比当前的方法取得了显著的改进。特别是在MS-COCO数据集上,我们的OADP框架达到了35.6 mAP^N_50,超过了当前最先进的方法3.3 mAP^N_50。代码在补充材料中匿名提供。

Open-vocabulary object detection aims to provide object detectors trained on a fixed set of object categories with the generalizability to detect objects described by arbitrary text queries. Previous methods adopt knowledge distillation to extract knowledge from Pretrained Vision-and-Language Models (PVLMs) and transfer it to detectors. However, due to the non-adaptive proposal cropping and single-level feature mimicking processes, they suffer from information destruction during knowledge extraction and inefficient knowledge transfer. To remedy these limitations, we propose an Object-Aware Distillation Pyramid (OADP) framework, including an Object-Aware Knowledge Extraction (OAKE) module and a Distillation Pyramid (DP) mechanism. When extracting object knowledge from PVLMs, the former adaptively transforms object proposals and adopts object-aware mask attention to obtain precise and complete knowledge of objects. The latter introduces global and block distillation for more comprehensive knowledge transfer to compensate for the missing relation information in object distillation. Extensive experiments show that our method achieves significant improvement compared to current methods. Especially on the MS-COCO dataset, our OADP framework reaches 35.6 mAP^N_50, surpassing the current state-of-the-art method by 3.3 mAP^N_50. Code is anonymously provided in the supplementary materials.

Vision Transformers Are Good Mask Auto-Labelers
Lan, ShiyiandYang, XitongandYu, ZhidingandWu, ZuxuanandAlvarez, JoseM.andAnandkumar, Anima



研究问题:提出一种高质量的基于Transformer的遮罩自动标注框架,用于仅使用框注释进行实例分割。
动机:当前的实例分割模型需要大量的标注数据,而手动标注成本高且耗时。我们希望通过自动生成高质量的遮罩来减少这种需求。
方法:我们提出了Mask Auto-Labeler(MAL)框架,该框架接受裁剪后的图像作为输入,并有条件地生成其遮罩伪标签。
效果:实验结果表明,我们的遮罩自动标注器在质量上与人工标注相当,使用MAL生成的遮罩训练的实例分割模型性能接近全监督模型,最高可达44.1% mAP,优于现有的最先进的框监督方法。

We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates their mask pseudo-labels.We show that Vision Transformers are good mask auto-labelers. Our method significantly reduces the gap between auto-labeling and human annotation regarding mask quality. Instance segmentation models trained using the MAL-generated masks can nearly match the performance of their fully-supervised counterparts, retaining up to 97.4% performance of fully supervised models. The best model achieves 44.1% mAP on COCO instance segmentation (test-dev 2017), outperforming state-of-the-art box-supervised methods by significant margins. Qualitative results indicate that masks produced by MAL are, in some cases, even better than human annotations.

MAESTER: Masked Autoencoder Guided Segmentation at Pixel Resolution for Accurate, Self-Supervised Subcellular Structure Recognition
Xie, RonaldandPang, KuanandBader, GaryD.andWang, Bo



研究问题:如何准确分割细胞图像,特别是在生物结构形态的固有变异性下。
动机:由于生物结构的形态固有变异性,完全手动分割大型数据集是不可行的。虽然有监督方法被提出以自动化分割,但它们通常依赖于手动生成的地面真值,这在生物学中特别具有挑战性和耗时,因为需要领域专业知识。此外,这些方法的泛化能力有限,需要为每个数据集和使用情况生成额外的手动标签。
方法:我们引入了MAESTER(Masked AutoEncoder guided SegmenTation at pixEl Resolution),一种用于在像素级别准确分割亚细胞结构的自监督方法。MAESTER将分割视为表示学习和聚类问题。具体来说,MAESTER在学习多像素图像补丁的语义有意义的令牌表示的同时,保持足够大的视场进行上下文学习。我们还开发了一种覆盖和步进推理策略,以实现像素级别的亚细胞结构分割。
效果:我们在公开的老鼠胰腺胰岛beta细胞的体积电子显微镜(VEM)数据集上评估了MAESTER,并在相同的评估标准下实现了超过29.1%的性能提升。此外,我们的结果与在同一任务上训练的有监督方法相当,缩小了自监督和有监督方法之间的差距。MAESTER显示出有望缓解成像相关数据分析的关键瓶颈——地面真值生成,从而大大提高生物发现的速度。

Accurate segmentation of cellular images remains an elusive task due to the intrinsic variability in morphology of biological structures. Complete manual segmentation is unfeasible for large datasets, and while supervised methods have been proposed to automate segmentation, they often rely on manually generated ground truths which are especially challenging and time consuming to generate in biology due to the requirement of domain expertise. Furthermore, these methods have limited generalization capacity, requiring additional manual labels to be generated for each dataset and use case. We introduce MAESTER (Masked AutoEncoder guided SegmenTation at pixEl Resolution), a self-supervised method for accurate, subcellular structure segmentation at pixel resolution. MAESTER treats segmentation as a representation learning and clustering problem. Specifically, MAESTER learns semantically meaningful token representations of multi-pixel image patches while simultaneously maintaining a sufficiently large field of view for contextual learning. We also develop a cover-and-stride inference strategy to achieve pixel-level subcellular structure segmentation. We evaluated MAESTER on a publicly available volumetric electron microscopy (VEM) dataset of primary mouse pancreatic islets beta cells and achieved upwards of 29.1% improvement over state-of-the-art under the same evaluation criteria. Furthermore, our results are competitive against supervised methods trained on the same tasks, closing the gap between self-supervised and supervised approaches. MAESTER shows promise for alleviating the critical bottleneck of ground truth generation for imaging related data analysis and thereby greatly increasing the rate of biological discovery.

OCELOT: Overlapped Cell on Tissue Dataset for Histopathology
Ryu, JeongunandPuche, AaronValeroandShin, JaeWoongandPark, SeonwookandBrattoli, BiagioandLee, JinheeandJung, WonkyungandCho, SooIckandPaeng, KyunghyunandOck, Chan-YoungandYoo, DonggeunandPereira, S\'ergio



研究问题:如何更准确地进行细胞检测,特别是在组织层面结构和细胞形态及其周围环境方面。
动机:目前缺乏同时考虑细胞和组织关系的数据集,限制了计算病理学中细胞检测模型的发展。
方法:提出并公开OCELOT数据集,该数据集包含多个器官的图像,具有重叠的细胞和组织标注。同时,提出多任务学习方法,同时学习细胞和组织任务。
效果:在3个数据集上进行实验,与只训练细胞检测任务的模型相比,所提出的多任务学习方法在OCELOT测试集上F1分数提高了6.79%。

Cell detection is a fundamental task in computational pathology that can be used for extracting high-level medical information from whole-slide images. For accurate cell detection, pathologists often zoom out to understand the tissue-level structures and zoom in to classify cells based on their morphology and the surrounding context. However, there is a lack of efforts to reflect such behaviors by pathologists in the cell detection models, mainly due to the lack of datasets containing both cell and tissue annotations with overlapping regions. To overcome this limitation, we propose and publicly release OCELOT, a dataset purposely dedicated to the study of cell-tissue relationships for cell detection in histopathology. OCELOT provides overlapping cell and tissue annotations on images acquired from multiple organs. Within this setting, we also propose multi-task learning approaches that benefit from learning both cell and tissue tasks simultaneously. When compared against a model trained only for the cell detection task, our proposed approaches improve cell detection performance on 3 datasets: proposed OCELOT, public TIGER, and internal CARP datasets. On the OCELOT test set in particular, we show up to 6.79 improvement in F1-score. We believe the contributions of this paper, including the release of the OCELOT dataset at https://lunit-io.github.io/research/publications/ocelot are a crucial starting point toward the important research direction of incorporating cell-tissue relationships in computation pathology.

Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss
Mahmoud, AnasandHu, JordanS.K.andKuai, TianshuandHarakeh, AliandPaull, LiamandWaslander, StevenL.



研究问题:如何通过对比学习提炼丰富的自我监督图像特征,以解决自动驾驶数据集的图像到点表示学习面临的主要挑战。
动机:图像到点表示学习在自动驾驶数据集中面临两个主要挑战:1)自我相似性的丰富性导致对比损失将语义相似的点和图像区域推开,从而扰乱了学习的表示的局部语义结构;2)严重的类别不平衡,预训练被过度表示的类别主导。
方法:我们提出了一种新的语义容忍的图像到点对比损失来解决自我相似性问题,该损失考虑了正负图像区域之间的语义距离,以最小化对比语义相似的点和图像区域。此外,我们通过设计一种类无关的平衡损失来解决类别不平衡问题,该损失通过一个样本对样本的语义相似度度量来近似类别不平衡的程度。
效果:我们的实验表明,我们的语义容忍的对比损失与类别平衡在所有的3D语义分割评估设置中都改进了最先进的2D到3D表示学习方法。我们的方法在所有类型的2D自我监督预训练模型上都优于最先进的2D到3D表示学习方法。

An effective framework for learning 3D representations for perception tasks is distilling rich self-supervised image features via contrastive learning. However, image-to-point representation learning for autonomous driving datasets faces two main challenges: 1) the abundance of self-similarity, which results in the contrastive losses pushing away semantically similar point and image regions and thus disturbing the local semantic structure of the learned representations, and 2) severe class imbalance as pretraining gets dominated by over-represented classes. We propose to alleviate the self-similarity problem through a novel semantically tolerant image-to-point contrastive loss that takes into consideration the semantic distance between positive and negative image regions to minimize contrasting semantically similar point and image regions. Additionally, we address class imbalance by designing a class-agnostic balanced loss that approximates the degree of class imbalance through an aggregate sample-to-samples semantic similarity measure. We demonstrate that our semantically-tolerant contrastive loss with class balancing improves state-of-the-art 2D-to-3D representation learning in all evaluation settings on 3D semantic segmentation. Our method consistently outperforms state-of-the-art 2D-to-3D representation learning frameworks across a wide range of 2D self-supervised pretrained models.

ProtoCon: Pseudo-Label Refinement via Online Clustering and Prototypical Consistency for Efficient Semi-Supervised Learning
Nassar, IslamandHayat, MunawarandAbbasnejad, EhsanandRezatofighi, HamidandHaffari, Gholamreza



研究问题:本文旨在解决半监督学习中标签稀缺的问题,特别是在高置信度伪标签方法通常表现不佳的情况下。
动机:现有的半监督学习方法主要依赖于高置信度的伪标签进行训练,但在标签稀缺的情况下,这种方法往往效果不佳。
方法:本文提出了一种新的半监督学习方法ProtoCon,通过利用伪标签最近邻的信息来优化伪标签。该方法采用在线聚类的方式在嵌入空间中识别邻居,并通过原型损失来鼓励形成良好的聚类。ProtoCon的在线特性使其能够在一个训练周期内利用整个数据集的标签历史来优化下一个周期的标签,而无需存储图像嵌入,从而能够无缝扩展到更大的数据集。
效果:实验结果表明,ProtoCon在5个数据集上(包括CIFARs、ImageNet和DomainNet)相比最先进的方法取得了显著的改进和更快的收敛速度。

Confidence-based pseudo-labeling is among the dominant approaches in semi-supervised learning (SSL). It relies on including high-confidence predictions made on unlabeled data as additional targets to train the model. We propose ProtoCon, a novel SSL method aimed at the less-explored label-scarce SSL where such methods usually underperform. ProtoCon refines the pseudo-labels by leveraging their nearest neighbours' information. The neighbours are identified as the training proceeds using an online clustering approach operating in an embedding space trained via a prototypical loss to encourage well-formed clusters. The online nature of ProtoCon allows it to utilise the label history of the entire dataset in one training cycle to refine labels in the following cycle without the need to store image embeddings. Hence, it can seamlessly scale to larger datasets at a low cost. Finally, ProtoCon addresses the poor training signal in the initial phase of training (due to fewer confident predictions) by introducing an auxiliary self-supervised loss. It delivers significant gains and faster convergence over state-of-the-art across 5 datasets, including CIFARs, ImageNet and DomainNet.

BKinD-3D: Self-Supervised 3D Keypoint Discovery From Multi-View Videos
Sun, JenniferJ.andKarashchuk, LiliandDravid, AmilandRyou, SerimandFereidooni, SoniaandTuthill, JohnC.andKatsaggelos, AggelosandBrunton, BingniW.andGkioxari, GeorgiaandKennedy, AnnandYue, YisongandPerona, Pietro



研究问题:如何有效地在3D空间中进行自我监督的关键发现,以估计没有注释的3D姿势。
动机:手动姿态注释获取成本高且耗时,而现有的关键点发现方法通常只处理单个2D视图,无法在3D空间中操作。
方法:提出一种新方法,通过多视角的行为代理视频进行自我监督的3D关键点发现,无需任何2D或3D关键点或边界框监督。该方法使用编码器-解码器架构和3D体积热图,训练重建多个视图之间的时空差异,以及对学习的主体的3D骨架的关节长度约束。
效果:在人类和大鼠的视频中发现了关键点,无需手动监督,展示了3D关键点发现在研究行为方面的潜力。

Quantifying motion in 3D is important for studying the behavior of humans and other animals, but manual pose annotations are expensive and time-consuming to obtain. Self-supervised keypoint discovery is a promising strategy for estimating 3D poses without annotations. However, current keypoint discovery approaches commonly process single 2D views and do not operate in the 3D space. We propose a new method to perform self-supervised keypoint discovery in 3D from multi-view videos of behaving agents, without any keypoint or bounding box supervision in 2D or 3D. Our method, BKinD-3D, uses an encoder-decoder architecture with a 3D volumetric heatmap, trained to reconstruct spatiotemporal differences across multiple views, in addition to joint length constraints on a learned 3D skeleton of the subject. In this way, we discover keypoints without requiring manual supervision in videos of humans and rats, demonstrating the potential of 3D keypoint discovery for studying behavior.

Boosting Semi-Supervised Learning by Exploiting All Unlabeled Data
Chen, YuhaoandTan, XinandZhao, BoruiandChen, ZhaoweiandSong, RenjieandLiang, JiajunandLu, Xuequan



研究问题:现有的半监督学习方法在处理复杂例子时存在浪费,因为所有伪标签都需要通过高阈值筛选以过滤掉噪声。
动机:为了更好的利用所有未标记的例子,提出了两种新的技术:熵意义损失(EML)和自适应负学习(ANL)。
方法:EML将非目标类别的预测分布纳入优化目标,避免与目标类别竞争,从而为选择伪标签生成更高置信度的预测。ANL为所有未标记的数据引入额外的负伪标签,以利用低置信度的例子。这两种方法都没有引入任何额外的参数和超参数。
效果:在几个常见的半监督学习基准测试(CIFAR-10/100,SVHN,STL-10和ImageNet)上进行的大量实验表明,FullMatch大大超过了FixMatch。与FlexMatch(一个基于FixMatch的先进框架)结合使用时,实现了最先进的性能。

Semi-supervised learning (SSL) has attracted enormous attention due to its vast potential of mitigating the dependence on large labeled datasets. The latest methods (e.g., FixMatch) use a combination of consistency regularization and pseudo-labeling to achieve remarkable successes. However, these methods all suffer from the waste of complicated examples since all pseudo-labels have to be selected by a high threshold to filter out noisy ones. Hence, the examples with ambiguous predictions will not contribute to the training phase. For better leveraging all unlabeled examples, we propose two novel techniques: Entropy Meaning Loss (EML) and Adaptive Negative Learning (ANL). EML incorporates the prediction distribution of non-target classes into the optimization objective to avoid competition with target class, and thus generating more high-confidence predictions for selecting pseudo-label. ANL introduces the additional negative pseudo-label for all unlabeled data to leverage low-confidence examples. It adaptively allocates this label by dynamically evaluating the top-k performance of the model. EML and ANL do not introduce any additional parameter and hyperparameter. We integrate these techniques with FixMatch, and develop a simple yet powerful framework called FullMatch. Extensive experiments on several common SSL benchmarks (CIFAR-10/100, SVHN, STL-10 and ImageNet) demonstrate that FullMatch exceeds FixMatch by a large margin. Integrated with FlexMatch (an advanced FixMatch-based framework), we achieve state-of-the-art performance. Source code is available at https://github.com/megvii-research/FullMatch.

CHMATCH: Contrastive Hierarchical Matching and Robust Adaptive Threshold Boosted Semi-Supervised Learning
Wu, JianlongandYang, HaozheandGan, TianandDing, NingandJiang, FeijunandNie, Liqiang



研究问题:现有的半监督学习方法FixMatch和FlexMatch在处理少量标注样本时,存在阈值固定或自适应但不稳定的问题,且特征表示不具有区分性。
动机:提出一种新的CHMatch方法,通过对比分层匹配学习鲁棒的自适应阈值和区别性特征。
方法:首先,使用基于记忆库的鲁棒阈值学习策略选择高置信度样本;其次,充分利用层次标签中的结构化信息,学习准确的亲和力图进行对比学习。
效果:实验结果表明,CHMatch在多个常用基准测试上表现出非常稳定和优越的结果。例如,在CIFAR-100上,仅每个类别有4个和25个标注样本时,CHMatch分别比FlexMatch减少了8.44%和9.02%的错误率。

The recently proposed FixMatch and FlexMatch have achieved remarkable results in the field of semi-supervised learning. But these two methods go to two extremes as FixMatch and FlexMatch use a pre-defined constant threshold for all classes and an adaptive threshold for each category, respectively. By only investigating consistency regularization, they also suffer from unstable results and indiscriminative feature representation, especially under the situation of few labeled samples. In this paper, we propose a novel CHMatch method, which can learn robust adaptive thresholds for instance-level prediction matching as well as discriminative features by contrastive hierarchical matching. We first present a memory-bank based robust threshold learning strategy to select highly-confident samples. In the meantime, we make full use of the structured information in the hierarchical labels to learn an accurate affinity graph for contrastive learning. CHMatch achieves very stable and superior results on several commonly-used benchmarks. For example, CHMatch achieves 8.44% and 9.02% error rate reduction over FlexMatch on CIFAR-100 under WRN-28-2 with only 4 and 25 labeled samples per class, respectively.

Content-Aware Token Sharing for Efficient Semantic Segmentation With Vision Transformers
Lu, ChenyanganddeGeus, DaanandDubbelman, Gijs



研究问题:本文旨在提高使用视觉转换器的语义分割网络的计算效率。
动机:现有的令牌减少方法主要用于改进基于ViT的图像分类网络的效率,但这些方法不能直接应用于语义分割。
方法:本文提出了一种内容感知令牌共享(CTS)方法,该方法通过预测图像补丁是否包含相同的语义类别并让它们共享一个令牌来利用这种冗余信息。
效果:实验表明,使用内容感知令牌共享,我们可以将处理的令牌数量减少多达44%,而不降低分割质量。

This paper introduces Content-aware Token Sharing (CTS), a token reduction approach that improves the computational efficiency of semantic segmentation networks that use Vision Transformers (ViTs). Existing works have proposed token reduction approaches to improve the efficiency of ViT-based image classification networks, but these methods are not directly applicable to semantic segmentation, which we address in this work. We observe that, for semantic segmentation, multiple image patches can share a token if they contain the same semantic class, as they contain redundant information. Our approach leverages this by employing an efficient, class-agnostic policy network that predicts if image patches contain the same semantic class, and lets them share a token if they do. With experiments, we explore the critical design choices of CTS and show its effectiveness on the ADE20K, Pascal Context and Cityscapes datasets, various ViT backbones, and different segmentation decoders. With Content-aware Token Sharing, we are able to reduce the number of processed tokens by up to 44%, without diminishing the segmentation quality.

Incrementer: Transformer for Class-Incremental Semantic Segmentation With Knowledge Distillation Focusing on Old Class
Shang, ChaoandLi, HongliangandMeng, FanmanandWu, QingboandQiu, HeqianandWang, Lanxiao



研究问题:如何在学习新类别的同时,保持对旧类别的分割能力,并解决灾难性遗忘的问题。
动机:现有的方法主要基于卷积神经网络,通过知识蒸馏防止遗忘,但这需要添加额外的卷积层来预测新类别,并在知识蒸馏过程中忽略了区分新旧类别的不同区域,限制了新类别的学习。
方法:提出了一种新的Transformer框架——Incrementer,用于类别增量语义分割。只需在Transformer解码器中添加新类别的令牌即可学习新类别。同时,提出了一种新的知识蒸馏方案,专注于旧类别区域的蒸馏,降低了旧模型对新类别学习的约束,提高了可塑性。此外,还提出了一种类别去混淆策略,以减轻对新类别的过拟合和相似类别的混淆。
效果:实验结果表明,该方法比现有技术有显著改进(在Pascal VOC和ADE20k上分别提升了5-15个百分点)。希望Incrementer能成为类别增量语义分割的新的强大管道。

Class-incremental semantic segmentation aims to incrementally learn new classes while maintaining the capability to segment old ones, and suffers catastrophic forgetting since the old-class labels are unavailable. Most existing methods are based on convolutional networks and prevent forgetting through knowledge distillation, which (1) need to add additional convolutional layers to predict new classes, and (2) ignore to distinguish different regions corresponding to old and new classes during knowledge distillation and roughly distill all the features, thus limiting the learning of new classes. Based on the above observations, we propose a new transformer framework for class-incremental semantic segmentation, dubbed Incrementer, which only needs to add new class tokens to the transformer decoder for new-class learning. Based on the Incrementer, we propose a new knowledge distillation scheme that focuses on the distillation in the old-class regions, which reduces the constraints of the old model on the new-class learning, thus improving the plasticity. Moreover, we propose a class deconfusion strategy to alleviate the overfitting to new classes and the confusion of similar classes. Our method is simple and effective, and extensive experiments show that our method outperforms the SOTAs by a large margin (5 15 absolute points boosts on both Pascal VOC and ADE20k). We hope that our Incrementer can serve as a new strong pipeline for class-incremental semantic segmentation.

Zero-Shot Object Counting
Xu, JingyiandLe, HieuandNguyen, VuandRanjan, VireshandSamaras, Dimitris



研究问题:如何实现在测试阶段对任意类别的对象进行计数,特别是在没有人类注释示例的情况下。
动机:现有的方法需要人类注释的示例作为输入,这在新的类别,特别是自主系统中通常是不可用的。因此,我们提出了零样本对象计数(ZSC),这是一种在测试阶段仅使用类名的新设置。
方法:我们从类名开始,提出一种方法来准确识别最优的补丁,然后将其用作计数示例。具体来说,我们首先构建一个类原型来选择可能包含目标对象的补丁,即与类相关的补丁。此外,我们还引入了一个模型,可以定量测量任意补丁作为计数示例的合适程度。通过将此模型应用于所有候选补丁,我们可以选择最合适的补丁作为计数示例。
效果:我们在最近的类别无关计数数据集FSC-147上进行的实验结果验证了我们的方法的有效性。

Class-agnostic object counting aims to count object instances of an arbitrary class at test time. It is challenging but also enables many potential applications. Current methods require human-annotated exemplars as inputs which are often unavailable for novel categories, especially for autonomous systems. Thus, we propose zero-shot object counting (ZSC), a new setting where only the class name is available during test time. Such a counting system does not require human annotators in the loop and can operate automatically. Starting from a class name, we propose a method that can accurately identify the optimal patches which can then be used as counting exemplars. Specifically, we first construct a class prototype to select the patches that are likely to contain the objects of interest, namely class-relevant patches. Furthermore, we introduce a model that can quantitatively measure how suitable an arbitrary patch is as a counting exemplar. By applying this model to all the candidate patches, we can select the most suitable patches as exemplars for counting. Experimental results on a recent class-agnostic counting dataset, FSC-147, validate the effectiveness of our method.

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching
Wu, XiaoshiandZhu, FengandZhao, RuiandLi, Hongsheng



研究问题:本文旨在解决开放词汇检测(OVD)任务中的问题,即如何从训练模型的基础类别之外识别新的对象类别。
动机:现有的OVD方法依赖于大规模视觉-语言预训练模型,如CLIP,来识别新的对象。然而,这些模型在应用于检测器训练时存在两个核心障碍:1)将整个图像上训练的视觉-语言模型应用于区域识别任务时产生的分布不匹配;2)定位未见过类别的对象的困难。
方法:为了克服这些障碍,作者提出了CORA,这是一个基于DETR的框架,通过区域提示和锚点预匹配来调整CLIP以适应开放词汇检测。区域提示通过提示基于CLIP的区域分类器的区域特征来缓解整个到区域的分布差距。锚点预匹配通过类感知匹配机制帮助学习可泛化的对象定位。
效果:在COCO OVD基准测试中,CORA实现了41.7 AP50的新类别,比之前的SOTA高出2.4 AP50,即使没有使用额外的训练数据。当有额外的训练数据可用时,作者训练了CORA+,它在COCO OVD基准测试上实现了43.1 AP50,在LVIS OVD基准测试上实现了28.1 box APr。

Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects from novel categories beyond the base categories on which the detector is trained. Recent OVD methods rely on large-scale visual-language pre-trained models, such as CLIP, for recognizing novel objects. We identify the two core obstacles that need to be tackled when incorporating these models into detector training: (1) the distribution mismatch that happens when applying a VL-model trained on whole images to region recognition tasks; (2) the difficulty of localizing objects of unseen classes. To overcome these obstacles, we propose CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching. Region prompting mitigates the whole-to-region distribution gap by prompting the region features of the CLIP-based region classifier. Anchor pre-matching helps learning generalizable object localization by a class-aware matching mechanism. We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting to extra training data. When extra training data is available, we train CORA+ on both ground-truth base-category annotations and additional pseudo bounding box labels computed by CORA. CORA+ achieves 43.1 AP50 on the COCO OVD benchmark and 28.1 box APr on the LVIS OVD benchmark. The code is available at https://github.com/tgxs002/CORA.

FedSeg: Class-Heterogeneous Federated Learning for Semantic Segmentation
Miao, JiaxuandYang, ZongxinandFan, LeileiandYang, Yi



研究问题:如何在保护数据隐私的同时,进行分布式学习以改进语义分割任务。
动机:尽管存在许多用于分类任务的联邦学习算法,但很少有工作关注更具挑战性的语义分割任务,特别是在类别异构的联邦学习情况下。
方法:我们提出了FedSeg,一种基本的联邦学习方法,用于处理类别异构的语义分割。我们首先提出了一种简单但强大的修改过的交叉熵损失函数,以修正局部优化并解决前景-背景不一致的问题。在此基础上,我们引入了像素级的对比学习,以强制本地像素嵌入属于全局语义空间。
效果:我们在四个语义分割基准(Cityscapes,CamVID,PascalVOC和ADE20k)上进行了大量实验,证明了我们的FedSeg的有效性。

Federated Learning (FL) is a distributed learning paradigm that collaboratively learns a global model across multiple clients with data privacy-preserving. Although many FL algorithms have been proposed for classification tasks, few works focus on more challenging semantic seg-mentation tasks, especially in the class-heterogeneous FL situation. Compared with classification, the issues from heterogeneous FL for semantic segmentation are more severe: (1) Due to the non-IID distribution, different clients may contain inconsistent foreground-background classes, resulting in divergent local updates. (2) Class-heterogeneity for complex dense prediction tasks makes the local optimum of clients farther from the global optimum. In this work, we propose FedSeg, a basic federated learning approach for class-heterogeneous semantic segmentation. We first propose a simple but strong modified cross-entropy loss to correct the local optimization and address the foreground-background inconsistency problem. Based on it, we introduce pixel-level contrastive learning to enforce local pixel embeddings belonging to the global semantic space. Extensive experiments on four semantic segmentation benchmarks (Cityscapes, CamVID, PascalVOC and ADE20k) demonstrate the effectiveness of our FedSeg. We hope this work will attract more attention from the FL community to the challenging semantic segmentation federated learning.

DPF: Learning Dense Prediction Fields With Weak Supervision
Chen, XiaoxueandZheng, YuhangandZheng, YupengandZhou, QiangandZhao, HaoandZhou, GuyueandZhang, Ya-Qin



研究问题:如何利用廉价的点级弱监督进行视觉场景理解。
动机:像素级的密集标注既昂贵又不可能,因此需要利用便宜的点级弱监督。
方法:提出一种新的预测范式,对点坐标查询进行预测,该方法被称为密集预测场(DPFs)。
效果:通过三个大规模的公共数据集PascalContext、ADE20k和IIW进行基准测试,DPFs在所有数据集上都取得了新的最先进的性能,且优势显著。

Nowadays, many visual scene understanding problems are addressed by dense prediction networks. But pixel-wise dense annotations are very expensive (e.g., for scene parsing) or impossible (e.g., for intrinsic image decomposition), motivating us to leverage cheap point-level weak supervision. However, existing pointly-supervised methods still use the same architecture designed for full supervision. In stark contrast to them, we propose a new paradigm that makes predictions for point coordinate queries, as inspired by the recent success of implicit representations, like distance or radiance fields. As such, the method is named as dense prediction fields (DPFs). DPFs generate expressive intermediate features for continuous sub-pixel locations, thus allowing outputs of an arbitrary resolution. DPFs are naturally compatible with point-level supervision. We showcase the effectiveness of DPFs using two substantially different tasks: high-level semantic parsing and low-level intrinsic image decomposition. In these two cases, supervision comes in the form of single-point semantic category and two-point relative reflectance, respectively. As benchmarked by three large-scale public datasets PascalContext, ADE20k and IIW, DPFs set new state-of-the-art performance on all of them with significant margins. Code can be accessed at https://github.com/cxx226/DPF.

Task-Specific Fine-Tuning via Variational Information Bottleneck for Weakly-Supervised Pathology Whole Slide Image Classification
Li, HonglinandZhu, ChengluandZhang, YunlongandSun, YuxuanandShui, ZhongyiandKuang, WenweiandZheng, SunyiandYang, Lin



研究问题:多重实例学习(MIL)在数字病理全幅幻灯片图像(WSI)分析中表现出了良好的效果,但因为高计算成本和对千万像素WSI的有限监督,这种范例仍然面临着性能和泛化的问题。
动机:为了解决计算问题,先前的方法利用从ImageNet预训练的冻结模型来获取表示,但由于大领域差距,可能会丢失关键信息,且没有图像级别的训练时间增强会阻碍泛化能力。尽管自我监督学习(SSL)提出了可行的表示学习方案,但通过部分标签微调的特定任务特征尚未被探索。
方法:我们提出了一种有效的WSI微调框架,该框架受到信息瓶颈理论的启发,能够找到WSI的最小充分统计量,从而将主干网络微调为仅依赖于WSI级别弱标签的特定任务表示。
效果:我们在五个病理WSI数据集上的各种WSI头部进行了实验,结果表明,与先前的工作相比,该方法在准确性和泛化性方面都有显著改进。

While Multiple Instance Learning (MIL) has shown promising results in digital Pathology Whole Slide Image (WSI) analysis, such a paradigm still faces performance and generalization problems due to high computational costs and limited supervision of Gigapixel WSIs. To deal with the computation problem, previous methods utilize a frozen model pretrained from ImageNet to obtain representations, however, it may lose key information owing to the large domain gap and hinder the generalization ability without image-level training-time augmentation. Though Self-supervised Learning (SSL) proposes viable representation learning schemes, the downstream task-specific features via partial label tuning are not explored. To alleviate this problem, we propose an efficient WSI fine-tuning framework motivated by the Information Bottleneck theory. The theory enables the framework to find the minimal sufficient statistics of WSI, thus supporting us to fine-tune the backbone into a task-specific representation only depending on WSI-level weak labels. The WSI-MIL problem is further analyzed to theoretically deduce our fine-tuning method. We evaluate the method on five pathological WSI datasets on various WSI heads. The experimental results show significant improvements in both accuracy and generalization compared with previous works. Source code will be available at https://github.com/invoker-LL/WSI-finetuning.

Detecting Everything in the Open World: Towards Universal Object Detection
Wang, ZhenyuandLi, YaliandChen, XiandLim, Ser-NamandTorralba, AntonioandZhao, HengshuangandWang, Shengjin



研究问题:本文旨在解决通用物体检测问题,目标是检测每个场景并预测每个类别。
动机:传统的物体检测器对人工标注的依赖、视觉信息的局限性以及开放世界中的新类别严重限制了其通用性。
方法:我们提出了UniDetector,一种能够识别开放世界中大量类别的通用物体检测器。关键之处在于:1)通过图像和文本空间的对齐,利用多源图像和异构标签空间进行训练,保证了通用表示的充分信息;2)由于视觉和语言模态的丰富信息,易于扩展到开放世界,同时保持已见和未见类别之间的平衡;3)通过提出的解耦训练方式和概率校准进一步提升对新类别的泛化能力。
效果:这些贡献使UniDetector能够检测到超过7k个类别,是目前可测量的最大类别大小,而只需要大约500个类别参与训练。在LVIS、ImageNetBoxes和VisualGenome等大型词汇表数据集上,UniDetector表现出强大的零样本泛化能力——在没有看到任何相应图像的情况下,平均超越传统监督基线4%以上。在包含各种场景的13个公共检测数据集上,UniDetector也仅用3%的训练数据就实现了最先进的性能。

In this paper, we formally address universal object detection, which aims to detect every scene and predict every category. The dependence on human annotations, the limited visual information, and the novel categories in the open world severely restrict the universality of traditional detectors. We propose UniDetector, a universal object detector that has the ability to recognize enormous categories in the open world. The critical points for the universality of UniDetector are: 1) it leverages images of multiple sources and heterogeneous label spaces for training through the alignment of image and text spaces, which guarantees sufficient information for universal representations. 2) it generalizes to the open world easily while keeping the balance between seen and unseen classes, thanks to abundant information from both vision and language modalities. 3) it further promotes the generalization ability to novel categories through our proposed decoupling training manner and probability calibration. These contributions allow UniDetector to detect over 7k categories, the largest measurable category size so far, with only about 500 classes participating in training. Our UniDetector behaves the strong zero-shot generalization ability on large-vocabulary datasets like LVIS, ImageNetBoxes, and VisualGenome - it surpasses the traditional supervised baselines by more than 4% on average without seeing any corresponding images. On 13 public detection datasets with various scenes, UniDetector also achieves state-of-the-art performance with only a 3% amount of training data.

Look Before You Match: Instance Understanding Matters in Video Object Segmentation
Wang, JunkeandChen, DongdongandWu, ZuxuanandLuo, ChongandTang, ChuanxinandDai, XiyangandZhao, YuchengandXie, YujiaandYuan, LuandJiang, Yu-Gang



研究问题:视频对象分割(VOS)中,如何有效地处理目标和摄像机的移动带来的外观变化和视角变化。
动机:当前的记忆基础方法在视频对象分割任务上取得了显著的效果,但由于缺乏实例理解能力,对目标和摄像机的移动导致的大外观变化或视角变化较为敏感。
方法:提出了一种双分支网络进行视频对象分割,一个是基于查询的视频对象分割分支,用于获取当前帧的实例详细信息;另一个是视频对象分割分支,用于执行与记忆库的空间-时间匹配。同时,引入了多路径融合块来有效地结合记忆读取和来自实例分割解码器的多尺度特征。
效果:该方法在DAVIS 2016/2017验证集、DAVIS 2017测试集以及YouTube-VOS 2018/2019验证集上均取得了最先进的性能,明显优于其他替代方法。

Exploring dense matching between the current frame and past frames for long-range context modeling, memory-based methods have demonstrated impressive results in video object segmentation (VOS) recently. Nevertheless, due to the lack of instance understanding ability, the above approaches are oftentimes brittle to large appearance variations or viewpoint changes resulted from the movement of objects and cameras. In this paper, we argue that instance understanding matters in VOS, and integrating it with memory-based matching can enjoy the synergy, which is intuitively sensible from the definition of VOS task, i.e., identifying and segmenting object instances within the video. Towards this goal, we present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank. We employ the well-learned object queries from IS branch to inject instance-specific information into the query key, with which the instance-augmented matching is further performed. In addition, we introduce a multi-path fusion block to effectively combine the memory readout with multi-scale features from the instance segmentation decoder, which incorporates high-resolution instance-aware features to produce final segmentation results. Our method achieves state-of-the-art performance on DAVIS 2016/2017 val (92.6% and 87.1%), DAVIS 2017 test-dev (82.8%), and YouTube-VOS 2018/2019 val (86.3% and 86.3%), outperforming alternative methods by clear margins.

Orthogonal Annotation Benefits Barely-Supervised Medical Image Segmentation
Cai, HengandLi, ShumengandQi, LeiandYu, QianandShi, YinghuanandGao, Yang



研究问题:如何提高3D医疗图像分割的半监督学习性能。
动机:与2D图像相比,3D医疗体积包含来自不同方向的信息,如横截面、矢状面和冠状面,可以提供互补的视图。这些互补的视图和相邻3D切片的内在相似性激发了我们开发一种新的注释方式和相应的半监督模型进行有效分割。
方法:我们首先提出了正交标注,只标注一个已标记体积的两个正交切片,大大减轻了标注负担。然后,通过注册获取稀疏标记体积的初始伪标签。接着,引入未标记体积,我们提出了一种名为密集-稀疏共同训练(DeSCO)的双重网络范式,该范式在早期阶段利用密集伪标签,在后期阶段利用稀疏标签,并同时强制两个网络的输出一致。
效果:我们在三个基准数据集上的实验结果验证了我们的方法在性能和效率上的效果。例如,仅使用10个标注切片,我们的方法在KiTS19数据集上达到86.93%的Dice。

Recent trends in semi-supervised learning have significantly boosted the performance of 3D semi-supervised medical image segmentation. Compared with 2D images, 3D medical volumes involve information from different directions, e.g., transverse, sagittal, and coronal planes, so as to naturally provide complementary views. These complementary views and the intrinsic similarity among adjacent 3D slices inspire us to develop a novel annotation way and its corresponding semi-supervised model for effective segmentation. Specifically, we firstly propose the orthogonal annotation by only labeling two orthogonal slices in a labeled volume, which significantly relieves the burden of annotation. Then, we perform registration to obtain the initial pseudo labels for sparsely labeled volumes. Subsequently, by introducing unlabeled volumes, we propose a dual-network paradigm named Dense-Sparse Co-training (DeSCO) that exploits dense pseudo labels in early stage and sparse labels in later stage and meanwhile forces consistent output of two networks. Experimental results on three benchmark datasets validated our effectiveness in performance and efficiency in annotation. For example, with only 10 annotated slices, our method reaches a Dice up to 86.93% on KiTS19 dataset.

Knowledge Distillation for 6D Pose Estimation by Aligning Distributions of Local Predictions
Guo, ShuxuanandHu, YinlinandAlvarez, JoseM.andSalzmann, Mathieu



研究问题:本文旨在解决图像基6D物体姿态估计中的知识蒸馏问题。
动机:尽管知识蒸馏在许多任务中取得了巨大成功,但在图像基6D物体姿态估计中尚未得到研究。
方法:提出一种由6D姿态估计任务驱动的知识蒸馏方法,将教师网络的局部预测分布蒸馏到学生网络中,以帮助其训练。
效果:实验结果表明,该方法在不同的紧凑型学生模型和基于关键点和密集预测的架构上均取得了最先进的结果。

Knowledge distillation facilitates the training of a compact student network by using a deep teacher one. While this has achieved great success in many tasks, it remains completely unstudied for image-based 6D object pose estimation. In this work, we introduce the first knowledge distillation method driven by the 6D pose estimation task. To this end, we observe that most modern 6D pose estimation frameworks output local predictions, such as sparse 2D keypoints or dense representations, and that the compact student network typically struggles to predict such local quantities precisely. Therefore, instead of imposing prediction-to-prediction supervision from the teacher to the student, we propose to distill the teacher's distribution of local predictions into the student network, facilitating its training. Our experiments on several benchmarks show that our distillation method yields state-of-the-art results with different compact student models and for both keypoint-based and dense prediction-based architectures.

Three Guidelines You Should Know for Universally Slimmable Self-Supervised Learning
Cao, Yun-HaoandSun, PeiqinandZhou, Shuchang



研究问题:如何实现一种通用的可瘦身自监督学习(US3L),以在不同的设备上部署自监督模型时获得更好的精度-效率权衡。
动机:直接将自监督学习(SSL)适应到通用可瘦身网络会导致训练过程频繁崩溃,因此需要找到一种方法来保证SSL的成功。
方法:提出了三个损失设计指南,从统一梯度的角度确保了时间一致性,并提出了动态采样和组正则化策略,以提高训练效率和准确性。
效果:在卷积神经网络和视觉变换器上进行了实证验证,仅进行一次训练和权重复制,该方法在识别、目标检测和实例分割等基准测试中优于各种最先进的方法(单独训练或未训练)。

We propose universally slimmable self-supervised learning (dubbed as US3L) to achieve better accuracy-efficiency trade-offs for deploying self-supervised models across different devices. We observe that direct adaptation of self-supervised learning (SSL) to universally slimmable networks misbehaves as the training process frequently collapses. We then discover that temporal consistent guidance is the key to the success of SSL for universally slimmable networks, and we propose three guidelines for the loss design to ensure this temporal consistency from a unified gradient perspective. Moreover, we propose dynamic sampling and group regularization strategies to simultaneously improve training efficiency and accuracy. Our US3L method has been empirically validated on both convolutional neural networks and vision transformers. With only once training and one copy of weights, our method outperforms various state-of-the-art methods (individually trained or not) on benchmarks including recognition, object detection and instance segmentation.

Learning Articulated Shape With Keypoint Pseudo-Labels From Web Images
Stathopoulos, AnastasisandPavlakos, GeorgiosandHan, LigongandMetaxas, DimitrisN.



研究问题:如何利用少量标注的2D关键点图像进行具有关节结构的物体(如马、牛、羊)的单目3D重建模型学习。
动机:目前的模型需要大量的标注数据才能进行3D重建,而本文提出的方法只需要50-150张标注图像即可实现。
方法:通过训练特定类别的2D关键点估计器,在未标注的网络图像上生成2D关键点伪标签,并使用已标注和自标注的数据集来训练3D重建模型。
效果:实验结果表明,该方法能够有效地利用网络图像,提高了几种具有关节结构的物体类别的3D重建性能,超过了全监督基线。此外,该方法可以快速启动一个模型,并且只需要少量的标注2D关键点图像。

This paper shows that it is possible to learn models for monocular 3D reconstruction of articulated objects (e.g. horses, cows, sheep), using as few as 50-150 images labeled with 2D keypoints. Our proposed approach involves training category-specific keypoint estimators, generating 2D keypoint pseudo-labels on unlabeled web images, and using both the labeled and self-labeled sets to train 3D reconstruction models. It is based on two key insights: (1) 2D keypoint estimation networks trained on as few as 50-150 images of a given object category generalize well and generate reliable pseudo-labels; (2) a data selection mechanism can automatically create a "curated" subset of the unlabeled web images that can be used for training -- we evaluate four data selection methods. Coupling these two insights enables us to train models that effectively utilize web images, resulting in improved 3D reconstruction performance for several articulated object categories beyond the fully-supervised baseline. Our approach can quickly bootstrap a model and requires only a few images labeled with 2D keypoints. This requirement can be easily satisfied for any new object category. To showcase the practicality of our approach for predicting the 3D shape of arbitrary object categories, we annotate 2D keypoints on 250 giraffe and bear images from COCO in just 2.5 hours per category.

Matching Is Not Enough: A Two-Stage Framework for Category-Agnostic Pose Estimation
Shi, MinandHuang, ZihaoandMa, XianzhengandHu, XiaoweiandCao, Zhiguo



研究问题:如何准确预测任意类别的姿态关键点?
动机:现有的方法依赖于图像匹配进行定位,但这种单阶段匹配模式的准确性较差,且易受开放集特性影响产生噪声。
方法:提出了一种两阶段框架,将第一阶段匹配的关键点视为相似性感知的位置建议,然后模型学习提取相关特征以修正初始建议。并使用专为CAPE设计的变压器模型进行实例化。
效果:该方法在CAPE基准MP-100上的准确性和效率上都大幅超越了之前的最佳方法。

Category-agnostic pose estimation (CAPE) aims to predict keypoints for arbitrary categories given support images with keypoint annotations. Existing approaches match the keypoints across the image for localization. However, such a one-stage matching paradigm shows inferior accuracy: the prediction heavily relies on the matching results, which can be noisy due to the open set nature in CAPE. For example, two mirror-symmetric keypoints (e.g., left and right eyes) in the query image can both trigger high similarity on certain support keypoints (eyes), which leads to duplicated or opposite predictions. To calibrate the inaccurate matching results, we introduce a two-stage framework, where matched keypoints from the first stage are viewed as similarity-aware position proposals. Then, the model learns to fetch relevant features to correct the initial proposals in the second stage. We instantiate the framework with a transformer model tailored for CAPE. The transformer encoder incorporates specific designs to improve the representation and similarity modeling in the first matching stage. In the second stage, similarity-aware proposals are packed as queries in the decoder for refinement via cross-attention. Our method surpasses the previous best approach by large margins on CAPE benchmark MP-100 on both accuracy and efficiency. Code available at https://github.com/flyinglynx/CapeFormer

LaserMix for Semi-Supervised LiDAR Semantic Segmentation
Kong, LingdongandRen, JiaweiandPan, LiangandLiu, Ziwei



研究问题:如何有效地利用未标记的数据进行激光雷达语义分割。
动机:由于标注激光雷达点云数据的成本较高,这限制了全监督学习方法的可扩展性。
方法:提出了一种名为LaserMix的半监督学习方法,通过混合不同激光雷达扫描的激光束,并鼓励模型在混合前后做出一致和自信的预测。
效果:在流行的激光雷达分割数据集(nuScenes、SemanticKITTI和ScribbleKITTI)上进行的全面实验分析表明,该方法具有高效性和优越性。特别是在有2到5倍更少标签的情况下,其结果与全监督方法相当,并在有监督的基准线上取得了显著的10.8%的提升。

Densely annotating LiDAR point clouds is costly, which often restrains the scalability of fully-supervised learning methods. In this work, we study the underexplored semi-supervised learning (SSL) in LiDAR semantic segmentation. Our core idea is to leverage the strong spatial cues of LiDAR point clouds to better exploit unlabeled data. We propose LaserMix to mix laser beams from different LiDAR scans and then encourage the model to make consistent and confident predictions before and after mixing. Our framework has three appealing properties. 1) Generic: LaserMix is agnostic to LiDAR representations (e.g., range view and voxel), and hence our SSL framework can be universally applied. 2) Statistically grounded: We provide a detailed analysis to theoretically explain the applicability of the proposed framework. 3) Effective: Comprehensive experimental analysis on popular LiDAR segmentation datasets (nuScenes, SemanticKITTI, and ScribbleKITTI) demonstrates our effectiveness and superiority. Notably, we achieve competitive results over fully-supervised counterparts with 2x to 5x fewer labels and improve the supervised-only baseline significantly by relatively 10.8%. We hope this concise yet high-performing framework could facilitate future research in semi-supervised LiDAR segmentation. Code is publicly available.

Category Query Learning for Human-Object Interaction Classification
Xie, ChiandZeng, FangaoandHu, YueandLiang, ShuangandWei, Yichen



研究问题:提出一种新的方法,通过学习类别查询来改善人类-物体交互特征。
动机:大多数先前的HOI方法都集中在学习更好的人类-物体特征上,而我们提出了一种新颖且互补的方法,称为类别查询学习。
方法:将查询明确地关联到交互类别,通过转换器解码器转换为图像特定的类别表示,并通过辅助的图像级分类任务进行学习。
效果:该方法在三个代表性的HOI基线上进行了验证,并在两个基准测试中实现了新的最先进的结果。

Unlike most previous HOI methods that focus on learning better human-object features, we propose a novel and complementary approach called category query learning. Such queries are explicitly associated to interaction categories, converted to image specific category representation via a transformer decoder, and learnt via an auxiliary image-level classification task. This idea is motivated by an earlier multi-label image classification method, but is for the first time applied for the challenging human-object interaction classification task. Our method is simple, general and effective. It is validated on three representative HOI baselines and achieves new state-of-the-art results on two benchmarks.

MDQE: Mining Discriminative Query Embeddings To Segment Occluded Instances on Challenging Videos
Li, MinghanandLi, ShuaiandXiang, WangmengandZhang, Lei



研究问题:视频实例分割(VIS)方法在处理遮挡物体和拥挤场景的挑战性视频时,由于实例查询无法有效编码实例的判别嵌入,导致难以区分“困难”实例。
动机:为了解决这些问题,我们提出了挖掘判别查询嵌入(MDQE)的方法来分割挑战性视频中的遮挡实例。
方法:首先,通过考虑空间上下文信息和帧间物体运动,初始化对象查询的位置嵌入和内容特征。其次,提出一种实例间掩膜排斥损失,使每个实例远离其附近的非目标实例。
效果:提出的MDQE是首个在具有挑战性的视频上取得最新成果,并在简单视频上具有竞争力表现的基于每片段输入的视频实例分割方法。具体来说,使用ResNet50的MDQE在OVIS和YouTube-VIS 2021上分别实现了33.0%和44.5%的掩膜AP。

While impressive progress has been achieved, video instance segmentation (VIS) methods with per-clip input often fail on challenging videos with occluded objects and crowded scenes. This is mainly because instance queries in these methods cannot encode well the discriminative embeddings of instances, making the query-based segmenter difficult to distinguish those 'hard' instances. To address these issues, we propose to mine discriminative query embeddings (MDQE) to segment occluded instances on challenging videos. First, we initialize the positional embeddings and content features of object queries by considering their spatial contextual information and the inter-frame object motion. Second, we propose an inter-instance mask repulsion loss to distance each instance from its nearby non-target instances. The proposed MDQE is the first VIS method with per-clip input that achieves state-of-the-art results on challenging videos and competitive performance on simple videos. In specific, MDQE with ResNet50 achieves 33.0% and 44.5% mask AP on OVIS and YouTube-VIS 2021, respectively. Code of MDQE can be found at https://github.com/MinghanLi/MDQE_CVPR2023.

Class Attention Transfer Based Knowledge Distillation
Guo, ZiyaoandYan, HaonanandLi, HuiandLin, Xiaodong



研究问题:现有的知识蒸馏方法在模型压缩任务上表现优秀,但其转移的知识如何帮助提升学生网络的性能难以解释。
动机:主流的CNN模型通过识别输入的类别判别区域来进行分类,这种能力可以通过转移类激活图来获取和增强。
方法:提出一种具有高解释性和竞争力的知识蒸馏方法——基于类注意力转移的知识蒸馏(CAT-KD)。
效果:CAT-KD不仅提高了解释性,也有助于更好地理解CNN,同时在多个基准测试中实现了最先进的性能。

Previous knowledge distillation methods have shown their impressive performance on model compression tasks, however, it is hard to explain how the knowledge they transferred helps to improve the performance of the student network. In this work, we focus on proposing a knowledge distillation method that has both high interpretability and competitive performance. We first revisit the structure of mainstream CNN models and reveal that possessing the capacity of identifying class discriminative regions of input is critical for CNN to perform classification. Furthermore, we demonstrate that this capacity can be obtained and enhanced by transferring class activation maps. Based on our findings, we propose class attention transfer based knowledge distillation (CAT-KD). Different from previous KD methods, we explore and present several properties of the knowledge transferred by our method, which not only improve the interpretability of CAT-KD but also contribute to a better understanding of CNN. While having high interpretability, CAT-KD achieves state-of-the-art performance on multiple benchmarks. Code is available at: https://github.com/GzyAftermath/CAT-KD.

Weakly Supervised Segmentation With Point Annotations for Histopathology Images via Contrast-Based Variational Model
Zhang, HongrunandBurrows, LiamandMeng, YandaandSculthorpe, DeclanandMukherjee, AbhikandCoupland, SarahE.andChen, KeandZheng, Yalin



研究问题:如何降低标注成本,提高图像分割效果。
动机:对于形态变化大、形状不规则的组织病理学图像,获取标注数据的成本高且困难。
方法:提出一种基于对比的变分模型生成分割结果,作为训练深度分割模型的可靠补充监督。
效果:该方法能生成更一致的区域和更平滑的边界分割,对未标记的“新”区域更具鲁棒性。在两个不同的组织病理学数据集上进行的实验表明,其效果和效率优于先前的模型。

Image segmentation is a fundamental task in the field of imaging and vision. Supervised deep learning for segmentation has achieved unparalleled success when sufficient training data with annotated labels are available. However, annotation is known to be expensive to obtain, especially for histopathology images where the target regions are usually with high morphology variations and irregular shapes. Thus, weakly supervised learning with sparse annotations of points is promising to reduce the annotation workload. In this work, we propose a contrast-based variational model to generate segmentation results, which serve as reliable complementary supervision to train a deep segmentation model for histopathology images. The proposed method considers the common characteristics of target regions in histopathology images and can be trained in an end-to-end manner. It can generate more regionally consistent and smoother boundary segmentation, and is more robust to unlabeled 'novel' regions. Experiments on two different histology datasets demonstrate its effectiveness and efficiency in comparison to previous models. Code is available at: https://github.com/hrzhang1123/CVM_WS_Segmentation.

Compositor: Bottom-Up Clustering and Compositing for Robust Part and Object Segmentation
He, JuandChen, JienengandLin, Ming-XianandYu, QihangandYuille, AlanL.



研究问题:本文旨在提出一种鲁棒的联合部分和对象分割方法。
动机:现有的部分和对象分割方法往往忽视了低语义层次信息到高语义层次信息的整合,本研究希望通过自底向上的聚类方式解决这个问题。
方法:将部分和对象分割转化为优化问题,构建包括像素、部分和对象级别的嵌入特征表示的分层特征表示,以自底向上的聚类方式进行解决。通过这种方式,像素被分为几个簇,其中部分级别的嵌入作为簇中心。然后,通过合成部分建议获得对象掩码。
效果:实验结果表明,该方法在PartImageNet和Pascal-Part上取得了最先进的性能,相比之前的方法提高了约0.9%和1.3%的部分和对象mIoU,对遮挡的鲁棒性也提高了约4.4%和7.1%。

In this work, we present a robust approach for joint part and object segmentation. Specifically, we reformulate object and part segmentation as an optimization problem and build a hierarchical feature representation including pixel, part, and object-level embeddings to solve it in a bottom-up clustering manner. Pixels are grouped into several clusters where the part-level embeddings serve as cluster centers. Afterwards, object masks are obtained by compositing the part proposals. This bottom-up interaction is shown to be effective in integrating information from lower semantic levels to higher semantic levels. Based on that, our novel approach Compositor produces part and object segmentation masks simultaneously while improving the mask quality. Compositor achieves state-of-the-art performance on PartImageNet and Pascal-Part by outperforming previous methods by around 0.9% and 1.3% on PartImageNet, 0.4% and 1.7% on Pascal-Part in terms of part and object mIoU and demonstrates better robustness against occlusion by around 4.4% and 7.1% on part and object respectively.

ZBS: Zero-Shot Background Subtraction via Instance-Level Background Modeling and Foreground Selection
An, YongqiandZhao, XuandYu, TaoandGuo, HaiyunandZhao, ChaoyangandTang, MingandWang, Jinqiao



研究问题:如何提取视频帧中的所有移动对象以获取二值前景分割掩码。
动机:虽然深度学习已被广泛用于背景减除领域,但现有的无监督深度学习背景减除算法在复杂场景(如阴影或夜光)下表现不佳,且无法检测到预定义类别之外的对象。
方法:提出了一种基于零样本目标检测的无监督背景减除算法,称为零样本背景减除ZBS。该方法充分利用了零样本目标检测的优点,建立了开放词汇实例级的背景模型。在此基础上,通过将新帧的检测结果与背景模型进行比较,可以有效地提取前景。
效果:ZBS在复杂场景下表现出色,并且具有丰富和可扩展的类别。此外,我们的方法可以轻松地推广到其他任务,例如在未见过的环境中进行遗弃物体检测。实验表明,ZBS在CDnet 2014数据集上超越了最先进的无监督背景减除方法,F-Measure提高了4.70%。代码已在https://github.com/CASIA-IVA-Lab/ZBS上发布。

Background subtraction (BGS) aims to extract all moving objects in the video frames to obtain binary foreground segmentation masks. Deep learning has been widely used in this field. Compared with supervised-based BGS methods, unsupervised methods have better generalization. However, previous unsupervised deep learning BGS algorithms perform poorly in sophisticated scenarios such as shadows or night lights, and they cannot detect objects outside the pre-defined categories. In this work, we propose an unsupervised BGS algorithm based on zero-shot object detection called Zero-shot Background Subtraction ZBS. The proposed method fully utilizes the advantages of zero-shot object detection to build the open-vocabulary instance-level background model. Based on it, the foreground can be effectively extracted by comparing the detection results of new frames with the background model. ZBS performs well for sophisticated scenarios, and it has rich and extensible categories. Furthermore, our method can easily generalize to other tasks, such as abandoned object detection in unseen environments. We experimentally show that ZBS surpasses state-of-the-art unsupervised BGS methods by 4.70% F-Measure on the CDnet 2014 dataset. The code is released at https://github.com/CASIA-IVA-Lab/ZBS.

Siamese DETR
Chen, ZerenandHuang, GengshiandLi, WeiandTeng, JianingandWang, KunandShao, JingandLoy, ChenChangeandSheng, Lu



研究问题:如何设计一种适用于DETR的自我监督预训练方法。
动机:现有的自我监督方法主要针对基础模型如ResNets或ViTs进行表示学习,难以直接应用于具有任务特定Transformer模块的DETR。
方法:提出Siamese DETR,一种用于DETR中Transformer架构的Siamese自我监督预训练方法。通过两个互补的任务,即局部化和鉴别,在一个新颖的多视图学习框架中同时学习视图不变和面向检测的表示。设计了两种自我监督预训练任务:(i)多视图区域检测,旨在学习在输入的增强视图之间定位感兴趣区域;(ii)多视图语义鉴别,尝试提高每个区域的物体级鉴别能力。
效果:所提出的Siamese DETR在所有设置下,使用不同的DETR变体,在COCO和PASCAL VOC检测上实现了最先进的迁移性能。代码可在https://github.com/Zx55/SiameseDETR获取。

Recent self-supervised methods are mainly designed for representation learning with the base model, e.g., ResNets or ViTs. They cannot be easily transferred to DETR, with task-specific Transformer modules. In this work, we present Siamese DETR, a Siamese self-supervised pretraining approach for the Transformer architecture in DETR. We consider learning view-invariant and detection-oriented representations simultaneously through two complementary tasks, i.e., localization and discrimination, in a novel multi-view learning framework. Two self-supervised pretext tasks are designed: (i) Multi-View Region Detection aims at learning to localize regions-of-interest between augmented views of the input, and (ii) Multi-View Semantic Discrimination attempts to improve object-level discrimination for each region. The proposed Siamese DETR achieves state-of-the-art transfer performance on COCO and PASCAL VOC detection using different DETR variants in all setups. Code is available at https://github.com/Zx55/SiameseDETR.

Center Focusing Network for Real-Time LiDAR Panoptic Segmentation
Li, XiaoyanandZhang, GangandWang, BoyueandHu, YongliandYin, Baocai



研究问题:如何实现一种准确且实时的LiDAR全景分割方法,使自动驾驶车辆能够全面理解周围物体和场景。
动机:现有的无提议方法虽然加速了算法,但由于难以建模不存在的实例中心以及昂贵的基于中心的聚类模块,其效果和效率仍然有限。
方法:提出了一种新的中心聚焦网络(CFNet),其中包括中心聚焦特征编码(CFFE)和快速中心去重模块(CDM)。CFFE通过移动激光雷达点并填充中心点,明确理解原始激光雷达点与虚拟实例中心之间的关系。CDM则用于利用冗余检测的中心,为每个实例只选择一个中心。
效果:在SemanticKITTI和nuScenes全景分割基准测试中,实验证明我们的CFNet比所有现有方法都有大幅度的提升,并且比最高效的方法快1.6倍。

LiDAR panoptic segmentation facilitates an autonomous vehicle to comprehensively understand the surrounding objects and scenes and is required to run in real time. The recent proposal-free methods accelerate the algorithm, but their effectiveness and efficiency are still limited owing to the difficulty of modeling non-existent instance centers and the costly center-based clustering modules. To achieve accurate and real-time LiDAR panoptic segmentation, a novel center focusing network (CFNet) is introduced. Specifically, the center focusing feature encoding (CFFE) is proposed to explicitly understand the relationships between the original LiDAR points and virtual instance centers by shifting the LiDAR points and filling in the center points. Moreover, to leverage the redundantly detected centers, a fast center deduplication module (CDM) is proposed to select only one center for each instance. Experiments on the SemanticKITTI and nuScenes panoptic segmentation benchmarks demonstrate that our CFNet outperforms all existing methods by a large margin and is 1.6 times faster than the most efficient method.

Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization
Xu, LianandOuyang, WanliandBennamoun, MohammedandBoussaid, FaridandXu, Dan



研究问题:弱监督密集物体定位(WSDOL)通常依赖类激活映射(CAM),但其在处理类别内变化时的能力有限,导致像素特征关联不准确,影响密集定位图的准确性。
动机:为了解决上述问题,本文提出利用对比语言-图像预训练(CLIP)显式构建多模态类别表示,以指导密集定位。
方法:提出了一个统一的转换器框架来学习两类特定模态的类别标记,即特定于类别的视觉和文本标记。前者从目标视觉数据中捕获语义,后者利用来自CLIP的与类别相关的语言先验知识,提供互补信息以更好地感知类别内的多样性。此外,还提出了用样本特定的上下文(包括视觉上下文和图像-语言上下文)丰富多模态类别特定标记的方法,使类别表示学习更具适应性,进一步促进密集定位。
效果:大量实验表明,该方法在两个多标签数据集(PASCAL VOC和MS COCO)和一个单标签数据集(OpenImages)上的WSDOL性能优越。密集定位图还在PASCAL VOC和MS COCO上实现了最先进的弱监督语义分割(WSSS)结果。

Weakly supervised dense object localization (WSDOL) relies generally on Class Activation Mapping (CAM), which exploits the correlation between the class weights of the image classifier and the pixel-level features. Due to the limited ability to address intra-class variations, the image classifier cannot properly associate the pixel features, leading to inaccurate dense localization maps. In this paper, we propose to explicitly construct multi-modal class representations by leveraging the Contrastive Language-Image Pre-training (CLIP), to guide dense localization. More specifically, we propose a unified transformer framework to learn two-modalities of class-specific tokens, i.e., class-specific visual and textual tokens. The former captures semantics from the target visual data while the latter exploits the class-related language priors from CLIP, providing complementary information to better perceive the intra-class diversities. In addition, we propose to enrich the multi-modal class-specific tokens with sample-specific contexts comprising visual context and image-language context. This enables more adaptive class representation learning, which further facilitates dense localization. Extensive experiments show the superiority of the proposed method for WSDOL on two multi-label datasets, i.e., PASCAL VOC and MS COCO, and one single-label dataset, i.e., OpenImages. Our dense localization maps also lead to the state-of-the-art weakly supervised semantic segmentation (WSSS) results on PASCAL VOC and MS COCO.

Decoupled Semantic Prototypes Enable Learning From Diverse Annotation Types for Semi-Weakly Segmentation in Expert-Driven Domains
Rei{\ss



研究问题:如何将现有的图像分割解决方案扩展到专家驱动的领域,如显微镜应用或医疗健康?
动机:由于专家数量有限且难以提供像素级标注,因此将现有的图像分割解决方案扩展到专家驱动的领域仍然具有挑战性。
方法:通过分析现有训练算法的灵活性和可扩展性,提出了一种新的训练方法——解耦语义原型(DSP),该方法可以处理各种类型的标注,包括图像级别、点、边界框和像素级标注。
效果:在细胞器分割这一具有挑战性的领域中进行广泛评估后发现,DSP方法比现有的半监督和弱监督分割算法能够更好地利用各种类型的标注,并显著提高了准确性。

A vast amount of images and pixel-wise annotations allowed our community to build scalable segmentation solutions for natural domains. However, the transfer to expert-driven domains like microscopy applications or medical healthcare remains difficult as domain experts are a critical factor due to their limited availability for providing pixel-wise annotations. To enable affordable segmentation solutions for such domains, we need training strategies which can simultaneously handle diverse annotation types and are not bound to costly pixel-wise annotations. In this work, we analyze existing training algorithms towards their flexibility for different annotation types and scalability to small annotation regimes. We conduct an extensive evaluation in the challenging domain of organelle segmentation and find that existing semi- and semi-weakly supervised training algorithms are not able to fully exploit diverse annotation types. Driven by our findings, we introduce Decoupled Semantic Prototypes (DSP) as a training method for semantic segmentation which enables learning from annotation types as diverse as image-level-, point-, bounding box-, and pixel-wise annotations and which leads to remarkable accuracy gains over existing solutions for semi-weakly segmentation.

Iterative Next Boundary Detection for Instance Segmentation of Tree Rings in Microscopy Images of Shrub Cross Sections
Gillert, AlexanderandResente, GiuliaandAnadon-Rosell, AlbaandWilmking, MartinandvonLukas, UweFreiherr



研究问题:本文旨在解决在显微图像中检测灌木截面的树环的问题。
动机:这是一种特殊的实例分割任务,由于物体的同心圆形环状形状和高精度要求,现有方法的性能不佳。
方法:我们提出了一种新的迭代方法,称为“迭代下一步边界检测”(INBD)。它直观地模拟了自然生长方向,从灌木截面的中心开始,并在每个迭代步骤中检测下一个环边界。
效果:实验结果表明,INBD的性能优于通用实例分割方法,并且是唯一一个具有内置时间顺序概念的方法。我们的数据集和源代码可在http://github.com/alexander-g/INBD上获取。

We address the problem of detecting tree rings in microscopy images of shrub cross sections. This can be regarded as a special case of the instance segmentation task with several unique challenges such as the concentric circular ring shape of the objects and high precision requirements that result in inadequate performance of existing methods. We propose a new iterative method which we term Iterative Next Boundary Detection (INBD). It intuitively models the natural growth direction, starting from the center of the shrub cross section and detecting the next ring boundary in each iteration step. In our experiments, INBD shows superior performance to generic instance segmentation methods and is the only one with a built-in notion of chronological order. Our dataset and source code are available at http://github.com/alexander-g/INBD.

Universal Instance Perception As Object Discovery and Retrieval
Yan, BinandJiang, YiandWu, JiannanandWang, DongandLuo, PingandYuan, ZehuanandLu, Huchuan



研究问题:如何将各种实例感知任务统一,以更有效地利用大量数据和减少冗余计算。
动机:目前的实例感知任务被分割成多个独立的子任务,缺乏统一的处理方式。
方法:提出了一种下一代通用实例感知模型UNINEXT,将不同的实例感知任务统一为对象发现和检索模式,通过改变输入提示灵活地感知不同类型的对象。
效果:UNINEXT在10个实例级任务的20个挑战性基准测试中表现出优越的性能,包括经典的图像级任务(目标检测和实例分割)、视觉语言任务(指代表达式理解和分割)以及六个视频级对象跟踪任务。

All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks. In this work, we present a universal instance perception model of the next generation, termed UNINEXT. UNINEXT reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm and can flexibly perceive different types of objects by simply changing the input prompts. This unified formulation brings the following benefits: (1) enormous data from different tasks and label vocabularies can be exploited for jointly training general instance-level representations, which is especially beneficial for tasks lacking in training data. (2) the unified model is parameter-efficient and can save redundant computation when handling multiple tasks simultaneously. UNINEXT shows superior performance on 20 challenging benchmarks from 10 instance-level tasks including classical image-level tasks (object detection and instance segmentation), vision-and-language tasks (referring expression comprehension and segmentation), and six video-level object tracking tasks. Code is available at https://github.com/MasterBin-IIAU/UNINEXT.

MCF: Mutual Correction Framework for Semi-Supervised Medical Image Segmentation
Wang, YongchaoandXiao, BinandBi, XiuliandLi, WeishengandGao, Xinbo



研究问题:如何在有限的标注下,通过半监督学习方法提高医学图像分割的准确性,并解决模型认知偏差对边缘区域分割性能的影响。
动机:目前的半监督医学图像分割(SSMIS)方法缺乏处理模型认知偏差的设计,而这种偏差会随着训练的进行逐渐加深,难以自我纠正。
方法:提出一种新的互校正框架(MCF),引入两个不同的子网络来探索和利用子网络之间的差异以纠正模型的认知偏差。具体包括提出对比差异回顾(CDR)模块来找出预测不一致的区域并进行回顾训练,以及动态竞争伪标签生成(DCPLG)模块实时评估子网络的性能,动态选择更可靠的伪标签。
效果:在两种不同模态(CT和MRI)的医学图像数据库上的实验结果表明,该方法比几种最先进的方法具有更好的性能。

Semi-supervised learning is a promising method for medical image segmentation under limited annotation. However, the model cognitive bias impairs the segmentation performance, especially for edge regions. Furthermore, current mainstream semi-supervised medical image segmentation (SSMIS) methods lack designs to handle model bias. The neural network has a strong learning ability, but the cognitive bias will gradually deepen during the training, and it is difficult to correct itself. We propose a novel mutual correction framework (MCF) to explore network bias correction and improve the performance of SSMIS. Inspired by the plain contrast idea, MCF introduces two different subnets to explore and utilize the discrepancies between subnets to correct cognitive bias of the model. More concretely, a contrastive difference review (CDR) module is proposed to find out inconsistent prediction regions and perform a review training. Additionally, a dynamic competitive pseudo-label generation (DCPLG) module is proposed to evaluate the performance of subnets in real-time, dynamically selecting more reliable pseudo-labels. Experimental results on two medical image databases with different modalities (CT and MRI) show that our method achieves superior performance compared to several state-of-the-art methods. The code will be available at https://github.com/WYC-321/MCF.

Semantic Human Parsing via Scalable Semantic Transfer Over Multiple Label Domains
Yang, JieandWang, ChaoqunandLi, ZhenandWang, JunleandZhang, Ruimao



研究问题:如何利用不同标签领域的数据(即不同级别的标签粒度)的互惠互利,训练强大的人体解析网络。
动机:在实际应用中,存在两种常见的应用场景,即通用解析和专用解析,前者旨在从多个标签领域学习同质的人体表示,并通过仅使用不同的分割头来切换预测;后者则旨在学习特定领域的预测,同时从其他领域提炼语义知识。
方法:提出了可扩展的语义转移(SST)训练范式,将多个标签领域的人体部位语义关联嵌入到人体表示学习过程中。
效果:实验结果表明,SST能够有效地实现通用人体解析性能,并在三个人体解析基准测试(即PASCAL-Person-Part、ATR和CIHP)上取得了显著的改进。

This paper presents Scalable Semantic Transfer (SST), a novel training paradigm, to explore how to leverage the mutual benefits of the data from different label domains (i.e. various levels of label granularity) to train a powerful human parsing network. In practice, two common application scenarios are addressed, termed universal parsing and dedicated parsing, where the former aims to learn homogeneous human representations from multiple label domains and switch predictions by only using different segmentation heads, and the latter aims to learn a specific domain prediction while distilling the semantic knowledge from other domains. The proposed SST has the following appealing benefits: (1) it can capably serve as an effective training scheme to embed semantic associations of human body parts from multiple label domains into the human representation learning process; (2) it is an extensible semantic transfer framework without predetermining the overall relations of multiple label domains, which allows continuously adding human parsing datasets to promote the training. (3) the relevant modules are only used for auxiliary training and can be removed during inference, eliminating the extra reasoning cost. Experimental results demonstrate SST can effectively achieve promising universal human parsing performance as well as impressive improvements compared to its counterparts on three human parsing benchmarks (i.e., PASCAL-Person-Part, ATR, and CIHP). Code is available at https://github.com/yangjie-cv/SST.

RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension
Jin, LeiandLuo, GenandZhou, YiyiandSun, XiaoshuaiandJiang, GuannanandShu, AnnanandJi, Rongrong



研究问题:如何降低指代表达式理解(REC)任务的开发成本,并提高其性能?
动机:现有的弱监督方法主要基于两阶段检测网络,计算量大且昂贵。
方法:提出了一种名为RefCLIP的弱监督模型,将REC定义为锚文本匹配问题,避免了现有方法中的复杂后处理。通过引入基于锚点的对比损失来优化RefCLIP。
效果:在四个REC基准测试中,RefCLIP不仅显著提高了现有弱监督模型的性能,如在RefCOCO上提高了24.87%,而且推理速度比现有模型快5倍。

Referring Expression Comprehension (REC) is a task of grounding the referent based on an expression, and its development is greatly limited by expensive instance-level annotations. Most existing weakly supervised methods are built based on two-stage detection networks, which are computationally expensive. In this paper, we resort to the efficient one-stage detector and propose a novel weakly supervised model called RefCLIP. Specifically, RefCLIP redefines weakly supervised REC as an anchor-text matching problem, which can avoid the complex post-processing in existing methods. To achieve weakly supervised learning, we introduce anchor-based contrastive loss to optimize RefCLIP via numerous anchor-text pairs. Based on RefCLIP, we further propose the first model-agnostic weakly supervised training scheme for existing REC models, where RefCLIP acts as a mature teacher to generate pseudo-labels for teaching common REC models. With our careful designs, this scheme can even help existing REC models achieve better weakly supervised performance than RefCLIP, e.g., TransVG and SimREC. To validate our approaches, we conduct extensive experiments on four REC benchmarks, i.e., RefCOCO, RefCOCO+, RefCOCOg and ReferItGame. Experimental results not only report our significant performance gains over existing weakly supervised models, e.g., +24.87% on RefCOCO, but also show the 5x faster inference speed. Project: https://refclip.github.io.

Boundary-Enhanced Co-Training for Weakly Supervised Semantic Segmentation
Rong, ShenghaiandTu, BohaiandWang, ZileiandLi, Junjie



研究问题:现有的弱监督语义分割方法主要关注生成准确完整的类激活图作为伪标签,却忽视了训练分割网络的重要性。
动机:研究发现,伪标签的质量与最终分割模型的性能存在不一致性,且错误标记的像素主要位于边界区域。因此,弱监督语义分割的重点应转向对噪声伪标签的鲁棒学习。
方法:提出一种基于边界增强的协同训练(BECO)方法来训练分割模型。具体来说,首先使用两个交互网络的协同训练范式来提高不确定像素的学习;然后提出一种边界增强策略,利用可靠的预测结果构建人工边界,以提升困难边界区域的预测精度。
效果:实验证明,该方法在PASCAL VOC 2012和MS COCO 2014数据集上的表现优于其他最先进的方法。

The existing weakly supervised semantic segmentation (WSSS) methods pay much attention to generating accurate and complete class activation maps (CAMs) as pseudo-labels, while ignoring the importance of training the segmentation networks. In this work, we observe that there is an inconsistency between the quality of the pseudo-labels in CAMs and the performance of the final segmentation model, and the mislabeled pixels mainly lie on the boundary areas. Inspired by these findings, we argue that the focus of WSSS should be shifted to robust learning given the noisy pseudo-labels, and further propose a boundary-enhanced co-training (BECO) method for training the segmentation model. To be specific, we first propose to use a co-training paradigm with two interactive networks to improve the learning of uncertain pixels. Then we propose a boundary-enhanced strategy to boost the prediction of difficult boundary areas, which utilizes reliable predictions to construct artificial boundaries. Benefiting from the design of co-training and boundary enhancement, our method can achieve promising segmentation performance for different CAMs. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 validate the superiority of our BECO over other state-of-the-art methods.

Uni3D: A Unified Baseline for Multi-Dataset 3D Object Detection
Zhang, BoandYuan, JiakangandShi, BotianandChen, TaoandLi, YikangandQiao, Yu



研究问题:当前3D物体检测模型在单一数据集上进行训练和测试,当直接部署到另一个数据集时,检测精度会大幅下降。
动机:由于不同LiDAR类型和数据获取标准导致的数据级差异和分类级别变化,使得从多个数据集中训练统一3D检测器成为一项具有挑战性的任务。
方法:提出Uni3D模型,通过简单的数据级校正操作和设计的语义级耦合-重接模块,分别缓解不可避免的数据级和分类级别差异。该方法简单易行,可与许多3D物体检测基线(如PV-RCNN和Voxel-RCNN)结合,使它们能有效地从多个现成的3D数据集中学习,获得更具辨别力和泛化能力的特征表示。
效果:实验表明,Uni3D在许多数据集整合设置中超越了单个数据集上训练的一系列独立检测器,其参数增加量比选定的基线检测器多1.04倍。这项工作预计将推动3D通用性的研究,因为它将推动感知性能的极限。

Current 3D object detection models follow a single dataset-specific training and testing paradigm, which often faces a serious detection accuracy drop when they are directly deployed in another dataset. In this paper, we study the task of training a unified 3D detector from multiple datasets. We observe that this appears to be a challenging task, which is mainly due to that these datasets present substantial data-level differences and taxonomy-level variations caused by different LiDAR types and data acquisition standards. Inspired by such observation, we present a Uni3D which leverages a simple data-level correction operation and a designed semantic-level coupling-and-recoupling module to alleviate the unavoidable data-level and taxonomy-level differences, respectively. Our method is simple and easily combined with many 3D object detection baselines such as PV-RCNN and Voxel-RCNN, enabling them to effectively learn from multiple off-the-shelf 3D datasets to obtain more discriminative and generalizable representations. Experiments are conducted on many dataset consolidation settings. Their results demonstrate that Uni3D exceeds a series of individual detectors trained on a single dataset, with a 1.04x parameter increase over a selected baseline detector. We expect this work will inspire the research of 3D generalization since it will push the limits of perceptual performance. Our code is available at: https://github.com/PJLab-ADG/3DTrans

Devil's on the Edges: Selective Quad Attention for Scene Graph Generation
Jung, DeunsolandKim, SanghyunandKim, WonHwaandCho, Minsu



研究问题:如何从图像中构造语义图结构,同时处理图像中的干扰对象和关系。
动机:场景图生成任务的主要挑战在于图像中存在大量无关的对象和关系,这严重干扰了上下文推理。
方法:提出选择性四元注意力网络(SQUAT),通过学习选择相关对象对并通过多样化的上下文交互来消除歧义。
效果:实验表明,SQUAT在视觉基因组和Open Images v6基准测试上表现出强大的性能和鲁棒性,达到了最先进的水平。

Scene graph generation aims to construct a semantic graph structure from an image such that its nodes and edges respectively represent objects and their relationships. One of the major challenges for the task lies in the presence of distracting objects and relationships in images; contextual reasoning is strongly distracted by irrelevant objects or backgrounds and, more importantly, a vast number of irrelevant candidate relations. To tackle the issue, we propose the Selective Quad Attention Network (SQUAT) that learns to select relevant object pairs and disambiguate them via diverse contextual interactions. SQUAT consists of two main components: edge selection and quad attention. The edge selection module selects relevant object pairs, i.e., edges in the scene graph, which helps contextual reasoning, and the quad attention module then updates the edge features using both edge-to-node and edge-to-edge cross-attentions to capture contextual information between objects and object pairs. Experiments demonstrate the strong performance and robustness of SQUAT, achieving the state of the art on the Visual Genome and Open Images v6 benchmarks.

NIFF: Alleviating Forgetting in Generalized Few-Shot Object Detection via Neural Instance Feature Forging
Guirguis, KarimandMeier, JohannesandEskandar, GeorgeandKayser, MatthiasandYang, BinandBeyerer, J\"urgen



研究问题:如何缓解AI训练中对大量数据的需求,特别是在需要共享和存储数据有问题的情况下。
动机:目前的通用少样本目标检测(G-FSOD)方法需要访问旧类别(即基础类别)的图像以学习新类别,但当共享和存储数据存在问题时,这种方法就无法使用。
方法:提出了一种无需访问基础图像的数据自由知识蒸馏(DFKD)方法。该方法利用基础模型感兴趣区域(RoI)特征的统计数据来生成实例级特征。
效果:这种方法可以显著减少基础内存需求,同时在具有挑战性的MS-COCO和PASCAL-VOC基准测试中为G-FSOD设定了新的标准。

Privacy and memory are two recurring themes in a broad conversation about the societal impact of AI. These concerns arise from the need for huge amounts of data to train deep neural networks. A promise of Generalized Few-shot Object Detection (G-FSOD), a learning paradigm in AI, is to alleviate the need for collecting abundant training samples of novel classes we wish to detect by leveraging prior knowledge from old classes (i.e., base classes). G-FSOD strives to learn these novel classes while alleviating catastrophic forgetting of the base classes. However, existing approaches assume that the base images are accessible, an assumption that does not hold when sharing and storing data is problematic. In this work, we propose the first data-free knowledge distillation (DFKD) approach for G-FSOD that leverages the statistics of the region of interest (RoI) features from the base model to forge instance-level features without accessing the base images. Our contribution is three-fold: (1) we design a standalone lightweight generator with (2) class-wise heads (3) to generate and replay diverse instance-level base features to the RoI head while finetuning on the novel data. This stands in contrast to standard DFKD approaches in image classification, which invert the entire network to generate base images. Moreover, we make careful design choices in the novel finetuning pipeline to regularize the model. We show that our approach can dramatically reduce the base memory requirements, all while setting a new standard for G-FSOD on the challenging MS-COCO and PASCAL-VOC benchmarks.

PartDistillation: Learning Parts From Instance Segmentation
Cho, JangHyunandKr\"ahenb\"uhl, PhilippandRamanathan, Vignesh



研究问题:如何从物体实例标签中学习部分分割。
动机:现有的实例分割模型包含了大量隐藏的部分信息,但这部分信息通常噪声大、不完整且不一致。
方法:通过在大型数据集上进行自我监督的自我训练,将实例分割模型的部分信息转移到部分分割模型中,形成的结果模型鲁棒、准确且泛化能力强。
效果:在各种部分分割数据集上评估该模型,结果显示,该模型在零射击一般化性能上优于监督部分分割,并在目标数据集上的微调性能上也优于监督对应和其他基线,特别是在少数射击区域。此外,当评估超过10K个物体类别时,该模型提供了更广泛的罕见部分覆盖。

We present a scalable framework to learn part segmentation from object instance labels. State-of-the-art instance segmentation models contain a surprising amount of part information. However, much of this information is hidden from plain view. For each object instance, the part information is noisy, inconsistent, and incomplete. PartDistillation transfers the part information of an instance segmentation model into a part segmentation model through self-supervised self-training on a large dataset. The resulting segmentation model is robust, accurate, and generalizes well. We evaluate the model on various part segmentation datasets. Our model outperforms supervised part segmentation in zero-shot generalization performance by a large margin. Our model outperforms when finetuned on target datasets compared to supervised counterpart and other baselines especially in few-shot regime. Finally, our model provides a wider coverage of rare parts when evaluated over 10K object classes. Code is at https://github.com/facebookresearch/PartDistillation.

Boosting Video Object Segmentation via Space-Time Correspondence Learning
Zhang, YurongandLi, LiuleiandWang, WenguanandXie, RongandSong, LiandZhang, Wenjun



研究问题:视频对象分割(VOS)中,如何更好地进行空间-时间对应匹配。
动机:当前主流的VOS方法仅利用地面真值掩码进行学习,没有对空间-时间对应匹配施加任何约束,这是其基础构建模块中的一个关键但常被忽视的问题。
方法:设计了一种对应感知训练框架,通过在网络学习过程中明确鼓励稳健的对应匹配,提升基于匹配的VOS解决方案。
效果:在四个广泛使用的基准测试(DAVIS2016&2017,YouTube-VOS2018&2019)上,该算法在著名的基于匹配的VOS解决方案之上取得了显著的性能提升,且无需额外的训练标注成本,也不会增加部署速度延迟或需要修改架构。

Current top-leading solutions for video object segmentation (VOS) typically follow a matching-based regime: for each query frame, the segmentation mask is inferred according to its correspondence to previously processed and the first annotated frames. They simply exploit the supervisory signals from the groundtruth masks for learning mask prediction only, without posing any constraint on the space-time correspondence matching, which, however, is the fundamental building block of such regime. To alleviate this crucial yet commonly ignored issue, we devise a correspondence-aware training framework, which boosts matching-based VOS solutions by explicitly encouraging robust correspondence matching during network learning. Through comprehensively exploring the intrinsic coherence in videos on pixel and object levels, our algorithm reinforces the standard, fully supervised training of mask segmentation with label-free, contrastive correspondence learning. Without neither requiring extra annotation cost during training, nor causing speed delay during deployment, nor incurring architectural modification, our algorithm provides solid performance gains on four widely used benchmarks, i.e., DAVIS2016&2017, and YouTube-VOS2018&2019, on the top of famous matching-based VOS solutions. Our implementation will be released.

You Only Segment Once: Towards Real-Time Panoptic Segmentation
Hu, JieandHuang, LinyanandRen, TianheandZhang, ShengchuanandJi, RongrongandCao, Liujuan



研究问题:本文提出了一种实时全景分割框架YOSO。
动机:为了解决实例和语义分割任务需要分别处理的问题,减少计算开销。
方法:通过在图像特征图和全景内核之间进行动态卷积来预测掩码,设计了特征金字塔聚合器进行特征图提取,以及可分离的动态解码器生成全景内核。
效果:实验结果表明,YOSO在COCO、Cityscapes、ADE20K和Mapillary Vistas等数据集上的表现优于最先进的模型,同时具有较高的效率和准确性。

In this paper, we propose YOSO, a real-time panoptic segmentation framework. YOSO predicts masks via dynamic convolutions between panoptic kernels and image feature maps, in which you only need to segment once for both instance and semantic segmentation tasks. To reduce the computational overhead, we design a feature pyramid aggregator for the feature map extraction, and a separable dynamic decoder for the panoptic kernel generation. The aggregator re-parameterizes interpolation-first modules in a convolution-first way, which significantly speeds up the pipeline without any additional costs. The decoder performs multi-head cross-attention via separable dynamic convolution for better efficiency and accuracy. To the best of our knowledge, YOSO is the first real-time panoptic segmentation framework that delivers competitive performance compared to state-of-the-art models. Specifically, YOSO achieves 46.4 PQ, 45.6 FPS on COCO; 52.5 PQ, 22.6 FPS on Cityscapes; 38.0 PQ, 35.4 FPS on ADE20K; and 34.1 PQ, 7.1 FPS on Mapillary Vistas. Code is available at https://github.com/hujiecpp/YOSO.

CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection
Ma, ShuaileiandWang, YuefengandWei, YingandFan, JiaqiandLi, ThomasH.andLiu, HongliandLv, Fanbing



研究问题:如何训练一个模型,使其能识别已知和未知的对象,并逐步学习识别这些未知对象。
动机:现有的检测框架和固定伪标签机制存在以下问题:(i) 检测未知对象会显著降低模型检测已知对象的能力;(ii) 伪标签机制没有充分利用输入的先验知识;(iii) 伪标签机制的固定选择方式不能保证模型在正确的方向上进行训练。
方法:提出一种新的解决方案,称为CAT(Localization and Identification Cascade Detection Transformer),通过级联解码的方式将检测过程解耦。同时,提出自适应伪标签机制,结合模型驱动和输入驱动的伪标签机制,自适应地为未知对象生成鲁棒的伪标签,显著提高CAT检索未知对象的能力。
效果:在MS-COCO和PASCAL VOC两个基准数据集上的全面实验表明,我们的模型在开放世界目标检测、增量目标检测和开放集检测任务中的所有指标上都优于最先进的技术。

Open-world object detection (OWOD), as a more general and challenging goal, requires the model trained from data on known objects to detect both known and unknown objects and incrementally learn to identify these unknown objects. The existing works which employ standard detection framework and fixed pseudo-labelling mechanism (PLM) have the following problems: (i) The inclusion of detecting unknown objects substantially reduces the model's ability to detect known ones. (ii) The PLM does not adequately utilize the priori knowledge of inputs. (iii) The fixed selection manner of PLM cannot guarantee that the model is trained in the right direction. We observe that humans subconsciously prefer to focus on all foreground objects and then identify each one in detail, rather than localize and identify a single object simultaneously, for alleviating the confusion. This motivates us to propose a novel solution called CAT: LoCalization and IdentificAtion Cascade Detection Transformer which decouples the detection process via the shared decoder in the cascade decoding way. In the meanwhile, we propose the self-adaptive pseudo-labelling mechanism which combines the model-driven with input-driven PLM and self-adaptively generates robust pseudo-labels for unknown objects, significantly improving the ability of CAT to retrieve unknown objects. Comprehensive experiments on two benchmark datasets, i.e., MS-COCO and PASCAL VOC, show that our model outperforms the state-of-the-art in terms of all metrics in the task of OWOD, incremental object detection (IOD) and open-set detection.

LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding
Li, GenandJampani, VarunandSun, DeqingandSevilla-Lara, Laura



研究问题:如何通过观察识别物体的可执行部分,即“affordance grounding”。
动机:人类通过观察学习使用新工具的能力是智能系统与世界互动的基础。
方法:提出一个名为LOCATE的框架,通过在图像中寻找匹配的对象部分,将知识从被动对象(用于测试的自身中心图像)转移到主动对象(用于学习的外中心图像)。
效果:实验证明,该方法在已知和未知对象的处理上都大幅超越了现有技术。

Humans excel at acquiring knowledge through observation. For example, we can learn to use new tools by watching demonstrations. This skill is fundamental for intelligent systems to interact with the world. A key step to acquire this skill is to identify what part of the object affords each action, which is called affordance grounding. In this paper, we address this problem and propose a framework called LOCATE that can identify matching object parts across images, to transfer knowledge from images where an object is being used (exocentric images used for learning), to images where the object is inactive (egocentric ones used to test). To this end, we first find interaction areas and extract their feature embeddings. Then we learn to aggregate the embeddings into compact prototypes (human, object part, and background), and select the one representing the object part. Finally, we use the selected prototype to guide affordance grounding. We do this in a weakly supervised manner, learning only from image-level affordance and object labels. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods by a large margin on both seen and unseen objects.

Cut and Learn for Unsupervised Object Detection and Instance Segmentation
Wang, XudongandGirdhar, RohitandYu, StellaX.andMisra, Ishan



研究问题:本文旨在提出一种名为Cut-and-LEaRn(CutLER)的简单方法,用于训练无监督的对象检测和分割模型。
动机:利用自监督模型的特性,无需人工标签即可发现并定位对象,从而训练出先进的本地化模型。
方法:首先使用提出的MaskCut方法为图像中的多个对象生成粗略的遮罩,然后在这些遮罩上使用稳健的损失函数学习检测器。通过在预测结果上进行自我训练,进一步提高性能。
效果:与先前的工作相比,CutLER更简单,可与不同的检测架构兼容,并能检测到多个对象。同时,CutLER是一种零次无监督检测器,并在视频帧、绘画、草图等不同领域的11个基准测试中,将检测性能AP_50提高了超过2.7倍。经过微调后,当使用5%的标签进行训练时,CutLER可以作为低次无监督检测器,在COCO上超越MoCo-v2 7.3%的AP^box和6.6%的AP^mask。

We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupervised object detection and segmentation models. We leverage the property of self-supervised models to 'discover' objects without supervision and amplify it to train a state-of-the-art localization model without any human labels. CutLER first uses our proposed MaskCut approach to generate coarse masks for multiple objects in an image, and then learns a detector on these masks using our robust loss function. We further improve performance by self-training the model on its predictions. Compared to prior work, CutLER is simpler, compatible with different detection architectures, and detects multiple objects. CutLER is also a zero-shot unsupervised detector and improves detection performance AP_50 by over 2.7x on 11 benchmarks across domains like video frames, paintings, sketches, etc. With finetuning, CutLER serves as a low-shot detector surpassing MoCo-v2 by 7.3% AP^box and 6.6% AP^mask on COCO when training with 5% labels.

Side Adapter Network for Open-Vocabulary Semantic Segmentation
Xu, MengdeandZhang, ZhengandWei, FangyunandHu, HanandBai, Xiang



研究问题:本文提出了一种新的开放词汇语义分割框架,名为SAN。
动机:将语义分割任务模型化为区域识别问题,并利用预训练的视觉语言模型CLIP进行改进。
方法:在冻结的CLIP模型上附加一个侧边网络,用于预测掩膜建议和注意力偏差。这种解耦设计使CLIP能够识别掩膜建议的类别。整个网络可以端到端训练,使侧边网络适应冻结的CLIP模型,使预测的掩膜建议具有CLIP意识。
效果:在多个语义分割基准测试中,该方法显著优于其他方法,训练参数减少了18倍,推理速度提高了19倍。希望这种方法能成为坚实的基线,有助于未来开放词汇语义分割的研究。

This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named SAN. Our approach models the semantic segmentation task as a region recognition problem. A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias which is applied in the CLIP model to recognize the class of masks. This decoupled design has the benefit CLIP in recognizing the class of mask proposals. Since the attached side network can reuse CLIP features, it can be very light. In addition, the entire network can be trained end-to-end, allowing the side network to be adapted to the frozen CLIP model, which makes the predicted mask proposals CLIP-aware. Our approach is fast, accurate, and only adds a few additional trainable parameters. We evaluate our approach on multiple semantic segmentation benchmarks. Our method significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed. We hope our approach will serve as a solid baseline and help ease future research in open-vocabulary semantic segmentation.

MISC210K: A Large-Scale Dataset for Multi-Instance Semantic Correspondence
Sun, YixuanandHuang, YiwenandGuo, HaijingandZhao, YuzhouandWu, RunminandYu, YizhouandGe, WeifengandZhang, Wenqiang



研究问题:现有的单对象匹配模式难以发现类别之间的共性,无法满足真实世界识别任务的需求。
动机:为了填补这一空白,我们设计了多实例语义对应任务,旨在构建图像对中多个对象的对应关系。
方法:我们从COCO Detection 2017任务中构建了一个名为MISC210K的多实例语义对应(MISC)数据集。我们通过三个步骤构建数据集:(1)类别选择和数据清理;(2)基于3D模型和对象描述规则的关键点设计;(3)人机协作注释。然后,我们选择了34个类别的对象,使用精心设计的半自动工作流程对4812张具有挑战性的图片进行了标注,最终获得了218,179个带有实例掩码和实例级关键点对的图像对。我们设计了一个双路径协同学习流程,同时训练实例级共同分割任务和细粒度级别对应任务。
效果:我们提供了基准评估和进一步的消融结果分析,并提出了三个未来方向。我们的项目可以在https://github.com/YXSUNMADMAX/MISC210K上找到。

Semantic correspondence have built up a new way for object recognition. However current single-object matching schema can be hard for discovering commonalities for a category and far from the real-world recognition tasks. To fill this gap, we design the multi-instance semantic correspondence task which aims at constructing the correspondence between multiple objects in an image pair. To support this task, we build a multi-instance semantic correspondence (MISC) dataset from COCO Detection 2017 task called MISC210K. We construct our dataset as three steps: (1) category selection and data cleaning; (2) keypoint design based on 3D models and object description rules; (3) human-machine collaborative annotation. Following these steps, we select 34 classes of objects with 4,812 challenging images annotated via a well designed semi-automatic workflow, and finally acquire 218,179 image pairs with instance masks and instance-level keypoint pairs annotated. We design a dual-path collaborative learning pipeline to train instance-level co-segmentation task and fine-grained level correspondence task together. Benchmark evaluation and further ablation results with detailed analysis are provided with three future directions proposed. Our project is available on https://github.com/YXSUNMADMAX/MISC210K.

GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds
Zhang, ZihuiandYang, BoandWang, BingandLi, Bo



研究问题:本文旨在解决从原始点云中进行3D语义分割的问题。
动机:现有的方法主要依赖于大量的人工标注来训练神经网络,而我们提出了第一个完全无监督的方法GrowSP,无需任何类型的人工标签或预训练模型,就能成功识别3D场景中每个点的复杂语义类别。
方法:我们的方法通过逐步增长超级点的3D语义元素发现,主要由三个部分组成:1)特征提取器,从输入的点云中学习每个点的特征;2)超级点构造器,逐步增长超级点的大小;3)语义原语聚类模块,将超级点分组为最终的语义分割的语义元素。
效果:我们在多个数据集上广泛评估了我们的方法,证明其在所有无监督基线上的性能优越,并接近经典的全监督PointNet。我们希望我们的研究能激发更多先进的无监督3D语义学习方法。

We study the problem of 3D semantic segmentation from raw point clouds. Unlike existing methods which primarily rely on a large amount of human annotations for training neural networks, we propose the first purely unsupervised method, called GrowSP, to successfully identify complex semantic classes for every point in 3D scenes, without needing any type of human labels or pretrained models. The key to our approach is to discover 3D semantic elements via progressive growing of superpoints. Our method consists of three major components, 1) the feature extractor to learn per-point features from input point clouds, 2) the superpoint constructor to progressively grow the sizes of superpoints, and 3) the semantic primitive clustering module to group superpoints into semantic elements for the final semantic segmentation. We extensively evaluate our method on multiple datasets, demonstrating superior performance over all unsupervised baselines and approaching the classic fully supervised PointNet. We hope our work could inspire more advanced methods for unsupervised 3D semantic learning.

ConQueR: Query Contrast Voxel-DETR for 3D Object Detection
Zhu, BenjinandWang, ZheandShi, ShaoshuaiandXu, HangandHong, LanqingandLi, Hongsheng



研究问题:尽管基于DETR的3D检测器简化了检测流程并实现了直接稀疏预测,但其性能仍然落后于具有后处理的密集检测器。
动机:大多数3D对象检测中的误报是由于缺乏对局部相似查询的显式监督而导致的。
方法:提出了一种名为Query Contrast Voxel-DETR(ConQueR)的简单而有效的稀疏3D检测器,通过构建每个GT的正负GT-查询对,并基于特征相似性使用对比损失来增强正GT-查询对与负对之间的差异,以消除挑战性的误报。
效果:ConQueR缩小了稀疏和密集3D检测器之间的差距,并将误报减少了60%。在具有挑战性的Waymo开放数据集验证集上,单帧ConQueR实现了71.6 mAPH/L2,比先前的最佳方法高出2.0 mAPH/L2。

Although DETR-based 3D detectors simplify the detection pipeline and achieve direct sparse predictions, their performance still lags behind dense detectors with post-processing for 3D object detection from point clouds. DETRs usually adopt a larger number of queries than GTs (e.g., 300 queries v.s. 40 objects in Waymo) in a scene, which inevitably incur many false positives during inference. In this paper, we propose a simple yet effective sparse 3D detector, named Query Contrast Voxel-DETR (ConQueR), to eliminate the challenging false positives, and achieve more accurate and sparser predictions. We observe that most false positives are highly overlapping in local regions, caused by the lack of explicit supervision to discriminate locally similar queries. We thus propose a Query Contrast mechanism to explicitly enhance queries towards their best-matched GTs over all unmatched query predictions. This is achieved by the construction of positive and negative GT-query pairs for each GT, and a contrastive loss to enhance positive GT-query pairs against negative ones based on feature similarities. ConQueR closes the gap of sparse and dense 3D detectors, and reduces 60% false positives. Our single-frame ConQueR achieves 71.6 mAPH/L2 on the challenging Waymo Open Dataset validation set, outperforming previous sota methods by over 2.0 mAPH/L2. Code: https://github.com/poodarchu/EFG.

BoxTeacher: Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation
Cheng, TianhengandWang, XinggangandChen, ShaoyuandZhang, QianandLiu, Wenyu



研究问题:如何利用弱监督实例分割中的边界框生成高质量的分割掩码,以提高实例分割的性能。
动机:现有的弱监督实例分割方法主要依赖于设计启发式损失函数,而忽略了边界框中可能包含的精细分割信息。
方法:提出BoxTeacher框架,通过一个复杂的教师模型生成高质量的伪标签作为分割掩码,同时引入噪声感知像素损失和噪声减少亲和性损失来优化学生模型。
效果:在具有挑战性的COCO数据集上,BoxTeacher无需复杂的操作就能达到35.0的mask AP和36.5的mask AP,显著超过了之前最先进的方法,缩小了边界框监督和掩码监督之间的差距。

Labeling objects with pixel-wise segmentation requires a huge amount of human labor compared to bounding boxes. Most existing methods for weakly supervised instance segmentation focus on designing heuristic losses with priors from bounding boxes. While, we find that box-supervised methods can produce some fine segmentation masks and we wonder whether the detectors could learn from these fine masks while ignoring low-quality masks. To answer this question, we present BoxTeacher, an efficient and end-to-end training framework for high-performance weakly supervised instance segmentation, which leverages a sophisticated teacher to generate high-quality masks as pseudo labels. Considering the massive noisy masks hurt the training, we present a mask-aware confidence score to estimate the quality of pseudo masks and propose the noise-aware pixel loss and noise-reduced affinity loss to adaptively optimize the student with pseudo masks. Extensive experiments can demonstrate the effectiveness of the proposed BoxTeacher. Without bells and whistles, BoxTeacher remarkably achieves 35.0 mask AP and 36.5 mask AP with ResNet-50 and ResNet-101 respectively on the challenging COCO dataset, which outperforms the previous state-of-the-art methods by a significant margin and bridges the gap between box-supervised and mask-supervised methods. The code and models will be available later.

Devil Is in the Queries: Advancing Mask Transformers for Real-World Medical Image Segmentation and Out-of-Distribution Localization
Yuan, MingzeandXia, YingdaandDong, HexinandChen, ZifanandYao, JiawenandQiu, MingyanandYan, KeandYin, XiaoliandShi, YuandChen, XinandLiu, ZaiyiandDong, BinandZhou, JingrenandLu, LeandZhang, LingandZhang, Li



研究问题:如何提高医疗图像分割算法在长尾复杂对象(即罕见疾病)上的有效性,避免在分布外(OOD)情况下产生临床危险的损害。
动机:现实世界的医疗图像分割具有巨大的长尾复杂性,尾部条件与相对罕见的疾病相关,并具有临床意义。一个值得信赖的医疗AI算法应该在尾部条件下展示其效果,以避免在这些分布外(OOD)情况下造成临床危险的损害。
方法:采用Mask Transformers中的对象查询概念将语义分割 formulate 为软聚类分配。查询在训练过程中拟合内群特征级的聚类中心。因此,在现实世界的医疗图像上进行推理时,像素与查询之间的相似性可以检测和定位OOD区域。我们称这种OOD定位为MaxQuery。此外,现实世界的医疗图像前景,无论是OOD对象还是内群,都是病变。它们之间的差异明显小于前景和背景之间的差异,导致对象查询可能会冗余地集中在背景上。因此,我们提出了一种查询分布(QD)损失来强制在查询级别上明确分割目标和其他区域之间的边界,提高内群分割和OOD指示。
效果:我们的框架在两个现实世界的分割任务上进行了测试,即胰腺和肝脏肿瘤的分割,比之前的领先算法平均提高了7.39%的AUROC,14.69%的AUPR和13.79%的FPR95 OOD定位。另一方面,我们的框架通过与nnUNet相比,平均提高了5.27%的DSC,从而提高了内群分割的性能。

Real-world medical image segmentation has tremendous long-tailed complexity of objects, among which tail conditions correlate with relatively rare diseases and are clinically significant. A trustworthy medical AI algorithm should demonstrate its effectiveness on tail conditions to avoid clinically dangerous damage in these out-of-distribution (OOD) cases. In this paper, we adopt the concept of object queries in Mask transformers to formulate semantic segmentation as a soft cluster assignment. The queries fit the feature-level cluster centers of inliers during training. Therefore, when performing inference on a medical image in real-world scenarios, the similarity between pixels and the queries detects and localizes OOD regions. We term this OOD localization as MaxQuery. Furthermore, the foregrounds of real-world medical images, whether OOD objects or inliers, are lesions. The difference between them is obviously less than that between the foreground and background, resulting in the object queries may focus redundantly on the background. Thus, we propose a query-distribution (QD) loss to enforce clear boundaries between segmentation targets and other regions at the query level, improving the inlier segmentation and OOD indication. Our proposed framework is tested on two real-world segmentation tasks, i.e., segmentation of pancreatic and liver tumors, outperforming previous leading algorithms by an average of 7.39% on AUROC, 14.69% on AUPR, and 13.79% on FPR95 for OOD localization. On the other hand, our framework improves the performance of inlier segmentation by an average of 5.27% DSC compared with nnUNet.

Token Contrast for Weakly-Supervised Semantic Segmentation
Ru, LixiangandZheng, HeliangandZhan, YibingandDu, Bo



研究问题:本文旨在解决弱监督语义分割中,图像级标签通常使用类激活映射(CAM)生成伪标签的问题。
动机:由于卷积神经网络(CNN)对局部结构感知的限制,CAM通常无法识别完整的目标区域。虽然最近的视觉变换器(ViT)可以弥补这一缺陷,但作者观察到它也带来了过度平滑的问题,即最终的补丁令牌倾向于均匀。
方法:作者提出了令牌对比(ToCo)来解决这个问题,并进一步探索了ViT在弱监督语义分割中的应用价值。首先,受到ViT中间层仍能保留语义多样性的观察启发,设计了一个补丁令牌对比模块(PTC)。PTC使用从中间层派生的伪令牌关系来监督最终的补丁令牌,使它们能够对齐语义区域,从而产生更准确的CAM。其次,为了进一步区分CAM中的低置信度区域,作者设计了一个类令牌对比模块(CTC),灵感来自于ViT中的类令牌可以捕获高级语义的事实。CTC通过对比其类令牌来促进不确定局部区域和全局对象之间的表示一致性。
效果:在PASCAL VOC和MS COCO数据集上的实验表明,所提出的ToCo可以显著超越其他单阶段竞争对手,并达到与最先进的多阶段方法相当的性能。代码可在https://github.com/rulixiang/ToCo获取。

Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels typically utilizes Class Activation Map (CAM) to generate the pseudo labels. Limited by the local structure perception of CNN, CAM usually cannot identify the integral object regions. Though the recent Vision Transformer (ViT) can remedy this flaw, we observe it also brings the over-smoothing issue, ie, the final patch tokens incline to be uniform. In this work, we propose Token Contrast (ToCo) to address this issue and further explore the virtue of ViT for WSSS. Firstly, motivated by the observation that intermediate layers in ViT can still retain semantic diversity, we designed a Patch Token Contrast module (PTC). PTC supervises the final patch tokens with the pseudo token relations derived from intermediate layers, allowing them to align the semantic regions and thus yield more accurate CAM. Secondly, to further differentiate the low-confidence regions in CAM, we devised a Class Token Contrast module (CTC) inspired by the fact that class tokens in ViT can capture high-level semantics. CTC facilitates the representation consistency between uncertain local regions and global objects by contrasting their class tokens. Experiments on the PASCAL VOC and MS COCO datasets show the proposed ToCo can remarkably surpass other single-stage competitors and achieve comparable performance with state-of-the-art multi-stage methods. Code is available at https://github.com/rulixiang/ToCo.

MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation
Hoyer, LukasandDai, DengxinandWang, HaoranandVanGool, Luc



研究问题:如何让预训练的语言模型更好地利用结构化知识,以提高语言理解能力?
动机:现有的预训练语言模型往往忽视了知识图谱中的有信息量的实体,而这些实体可以增强语言表示。
方法:本文提出了一种增强的语言表示模型ERNIE,该模型同时利用大规模文本语料库和知识图谱进行训练,以充分利用词汇、句法和知识信息。
效果:实验结果显示,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

In unsupervised domain adaptation (UDA), a model trained on source data (e.g. synthetic) is adapted to target data (e.g. real-world) without access to target annotation. Most previous UDA methods struggle with classes that have a similar visual appearance on the target domain as no ground truth is available to learn the slight appearance differences. To address this problem, we propose a Masked Image Consistency (MIC) module to enhance UDA by learning spatial context relations of the target domain as additional clues for robust visual recognition. MIC enforces the consistency between predictions of masked target images, where random patches are withheld, and pseudo-labels that are generated based on the complete image by an exponential moving average teacher. To minimize the consistency loss, the network has to learn to infer the predictions of the masked regions from their context. Due to its simple and universal concept, MIC can be integrated into various UDA methods across different visual recognition tasks such as image classification, semantic segmentation, and object detection. MIC significantly improves the state-of-the-art performance across the different recognition tasks for synthetic-to-real, day-to-nighttime, and clear-to-adverse-weather UDA. For instance, MIC achieves an unprecedented UDA performance of 75.9 mIoU and 92.8% on GTA-to-Cityscapes and VisDA-2017, respectively, which corresponds to an improvement of +2.1 and +3.0 percent points over the previous state of the art. The implementation is available at https://github.com/lhoyer/MIC.

SkyEye: Self-Supervised Bird's-Eye-View Semantic Mapping Using Monocular Frontal View Images
Gosala, NikhilandPetek, K\"ursatandDrews-Jr, PauloL.J.andBurgard, WolframandValada, Abhinav



研究问题:如何利用单目前视图像生成BEV语义地图,而无需依赖大量标注的BEV数据。
动机:现有的生成BEV语义地图的方法仍遵循全监督训练范式,需要大量的标注数据。
方法:提出首个使用单目前视图像进行自我监督生成BEV语义地图的方法,即SkyEye架构。通过两种自我监督模式——隐式监督和显式监督进行训练。隐式监督根据FV语义序列在时间上保持场景的空间一致性;显式监督则利用从FV语义标注和自我监督深度估计生成的BEV伪标签。
效果:在KITTI-360数据集上的广泛评估表明,该方法与最先进的全监督方法表现相当,并且仅使用1%的直接监督在BEV上就达到了竞争性的结果。

Bird's-Eye-View (BEV) semantic maps have become an essential component of automated driving pipelines due to the rich representation they provide for decision-making tasks. However, existing approaches for generating these maps still follow a fully supervised training paradigm and hence rely on large amounts of annotated BEV data. In this work, we address this limitation by proposing the first self-supervised approach for generating a BEV semantic map using a single monocular image from the frontal view (FV). During training, we overcome the need for BEV ground truth annotations by leveraging the more easily available FV semantic annotations of video sequences. Thus, we propose the SkyEye architecture that learns based on two modes of self-supervision, namely, implicit supervision and explicit supervision. Implicit supervision trains the model by enforcing spatial consistency of the scene over time based on FV semantic sequences, while explicit supervision exploits BEV pseudolabels generated from FV semantic annotations and self-supervised depth estimates. Extensive evaluations on the KITTI-360 dataset demonstrate that our self-supervised approach performs on par with the state-of-the-art fully supervised methods and achieves competitive results using only 1% of direct supervision in BEV compared to fully supervised approaches. Finally, we publicly release both our code and the BEV datasets generated from the KITTI-360 and Waymo datasets.

Boosting Weakly-Supervised Temporal Action Localization With Text Information
Li, GuozhangandCheng, DeandDing, XinpengandWang, NannanandWang, XiaoyuandGao, Xinbo



研究问题:由于缺乏时间标注,当前的弱监督时序动作定位(WTAL)方法通常陷入过度完整或不完整的定位。
动机:本文旨在利用文本信息从两个方面提升WTAL,即扩大类别间差异的判别目标和增强类别内完整性的生成目标。
方法:对于判别目标,提出了一种文本段挖掘(TSM)机制,根据动作类别标签构建一个文本描述,并将文本视为查询以挖掘所有与类别相关的段。在没有动作的时间标注的情况下,TSM将文本查询与整个数据集中的视频进行比较,挖掘最佳匹配的段,同时忽略无关的段。
效果:在THUMOS14和ActivityNet1.3上取得了最先进的性能。令人惊讶的是,我们发现该方法可以无缝地应用于现有的方法,并以明显的幅度提高其性能。代码可在https://github.com/lgzlIlIlI/Boosting-WTAL获取。

Due to the lack of temporal annotation, current Weakly-supervised Temporal Action Localization (WTAL) methods are generally stuck into over-complete or incomplete localization. In this paper, we aim to leverage the text information to boost WTAL from two aspects, i.e., (a) the discriminative objective to enlarge the inter-class difference, thus reducing the over-complete; (b) the generative objective to enhance the intra-class integrity, thus finding more complete temporal boundaries. For the discriminative objective, we propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments. Without the temporal annotation of actions, TSM compares the text query with the entire videos across the dataset to mine the best matching segments while ignoring irrelevant ones. Due to the shared sub-actions in different categories of videos, merely applying TSM is too strict to neglect the semantic-related segments, which results in incomplete localization. We further introduce a generative objective named Video-text Language Completion (VLC), which focuses on all semantic-related segments from videos to complete the text sentence. We achieve the state-of-the-art performance on THUMOS14 and ActivityNet1.3. Surprisingly, we also find our proposed method can be seamlessly applied to existing methods, and improve their performances with a clear margin. The code is available at https://github.com/lgzlIlIlI/Boosting-WTAL.

Weakly Supervised Class-Agnostic Motion Prediction for Autonomous Driving
Li, RuiboandShi, HanyuandFu, ZiangandWang, ZheandLin, Guosheng



研究问题:理解动态环境中的运动行为对于自动驾驶至关重要,这在LiDAR点云中对类别无关的运动预测引起了越来越多的关注。
动机:户外场景通常可以被分解为移动前景和静态背景,这使得我们可以将运动理解与场景解析联系起来。
方法:我们提出了一种新的弱监督运动预测范式,其中使用完全或部分(1%,0.1%)注释的前景/背景二进制掩码进行监督,而不是昂贵的运动注释。为此,我们设计了一个两阶段弱监督方法,其中第一阶段使用不完整的二进制掩码训练的分割模型将通过预先估计可能的移动前景来促进第二阶段的自我监督运动预测网络的学习。
效果:实验表明,使用完全或部分二进制掩码作为监督,我们的弱监督模型大大超过了自我监督模型,并且与一些有监督的模型表现相当。这进一步证明了我们的方法在注释努力和性能之间取得了良好的平衡。

Understanding the motion behavior of dynamic environments is vital for autonomous driving, leading to increasing attention in class-agnostic motion prediction in LiDAR point clouds. Outdoor scenes can often be decomposed into mobile foregrounds and static backgrounds, which enables us to associate motion understanding with scene parsing. Based on this observation, we study a novel weakly supervised motion prediction paradigm, where fully or partially (1%, 0.1%) annotated foreground/background binary masks rather than expensive motion annotations are used for supervision. To this end, we propose a two-stage weakly supervised approach, where the segmentation model trained with the incomplete binary masks in Stage1 will facilitate the self-supervised learning of the motion prediction network in Stage2 by estimating possible moving foregrounds in advance. Furthermore, for robust self-supervised motion learning, we design a Consistency-aware Chamfer Distance loss by exploiting multi-frame information and explicitly suppressing potential outliers. Comprehensive experiments show that, with fully or partially binary masks as supervision, our weakly supervised models surpass the self-supervised models by a large margin and perform on par with some supervised ones. This further demonstrates that our approach achieves a good compromise between annotation effort and performance.

Mask DINO: Towards a Unified Transformer-Based Framework for Object Detection and Segmentation
Li, FengandZhang, HaoandXu, HuaizheandLiu, ShilongandZhang, LeiandNi, LionelM.andShum, Heung-Yeung



研究问题:本文提出了一种统一的物体检测和分割框架Mask DINO。
动机:通过添加一个支持所有图像分割任务(实例、全景和语义)的掩码预测分支,扩展了DINO(具有改进的去噪锚框的DETR)。
方法:使用来自DINO的查询嵌入点积高分辨率像素嵌入地图来预测一组二进制掩码。通过共享架构和训练过程,将DINO中的一些关键组件扩展到分割。
效果:实验表明,Mask DINO在ResNet-50主干和预训练模型SwinL主干上,显著优于所有现有的专用分割方法。特别是在实例分割(COCO上54.5 AP)、全景分割(COCO上59.4 PQ)和语义分割(ADE20K上60.8 mIoU)方面,Mask DINO在十亿参数以下的模型中建立了迄今为止最好的结果。

In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, scalable, and benefits from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K) among models under one billion parameters. We will release the code after the blind review.

Self-Supervised AutoFlow
Huang, Hsin-PingandHerrmann, CharlesandHur, JunhwaandLu, ErikaandSargent, KyleandStone, AustinandYang, Ming-HsuanandSun, Deqing



研究问题:如何利用自我监督学习在没有目标领域真实标签的情况下,训练光学流的训练集。
动机:观察到真实标签搜索度量与自我监督损失之间的强相关性,提出自我监督的AutoFlow来处理没有真实标签的真实世界视频。
方法:使用自我监督损失作为搜索度量,我们的自我监督AutoFlow在Sintel和KITTI上的表现与有真实标签的AutoFlow相当,并在真实的DAVIS数据集上表现更好。
效果:在半监督设置中使用自我监督AutoFlow,并与最先进的技术进行比较,得到了具有竞争力的结果。

Recently, AutoFlow has shown promising results on learning a training set for optical flow, but requires ground truth labels in the target domain to compute its search metric. Observing a strong correlation between the ground truth search metric and self-supervised losses, we introduce self-supervised AutoFlow to handle real-world videos without ground truth labels. Using self-supervised loss as the search metric, our self-supervised AutoFlow performs on par with AutoFlow on Sintel and KITTI where ground truth is available, and performs better on the real-world DAVIS dataset. We further explore using self-supervised AutoFlow in the (semi-)supervised setting and obtain competitive results against the state of the art.

MagicNet: Semi-Supervised Multi-Organ Segmentation via Magic-Cube Partition and Recovery
Chen, DuowenandBai, YunhaoandShen, WeiandLi, QingliandYu, LequanandWang, Yan



研究问题:提出一种新的教师-学生模型用于半监督多器官分割。
动机:利用先验解剖结构作为强大的工具来引导数据增强,减少标记和未标记图像之间的不匹配,以进行半监督学习。
方法:提出了一种基于分区-恢复N^3立方体交叉和跨标注与未标注图像的数据增强策略。
效果:在两个公共CT多器官数据集上进行的大量实验表明,MagicNet的有效性,并显著优于最先进的半监督医学图像分割方法,在10%标记图像的MACT数据集上提高了+7%的DSC。

We propose a novel teacher-student model for semi-supervised multi-organ segmentation. In the teacher-student model, data augmentation is usually adopted on unlabeled data to regularize the consistent training between teacher and student. We start from a key perspective that fixed relative locations and variable sizes of different organs can provide distribution information where a multi-organ CT scan is drawn. Thus, we treat the prior anatomy as a strong tool to guide the data augmentation and reduce the mismatch between labeled and unlabeled images for semi-supervised learning. More specifically, we propose a data augmentation strategy based on partition-and-recovery N^3 cubes cross- and within- labeled and unlabeled images. Our strategy encourages unlabeled images to learn organ semantics in relative locations from the labeled images (cross-branch) and enhances the learning ability for small organs (within-branch). For within-branch, we further propose to refine the quality of pseudo labels by blending the learned representations from small cubes to incorporate local attributes. Our method is termed as MagicNet, since it treats the CT volume as a magic-cube and N^3-cube partition-and-recovery process matches with the rule of playing a magic-cube. Extensive experiments on two public CT multi-organ datasets demonstrate the effectiveness of MagicNet, and noticeably outperforms state-of-the-art semi-supervised medical image segmentation approaches, with +7% DSC improvement on MACT dataset with 10% labeled images.

Few-Shot Geometry-Aware Keypoint Localization
He, XingzheandBharaj, GauravandFerman, DavidandRhodin, HelgeandGarrido, Pablo



研究问题:目前的监督关键点定位方法依赖于大量手动标注的图像数据集,但创建这样的大型关键点标签既耗时又昂贵,且由于标签不一致而容易出错。
动机:因此,我们希望有一种方法可以在较少但一致标注的图像上学习关键点定位。
方法:我们提出了一种新的方法,通过自我监督使用更大的未标注数据集来扩展用户标记的2D图像,学习定位语义一致的关键点定义,甚至对遮挡区域也可以进行定位。
效果:我们的方法在几个数据集上取得了竞争或最先进的准确性,包括人脸、眼睛、动物、汽车以及从未尝试过的嘴部内部(牙齿)定位任务。

Supervised keypoint localization methods rely on large manually labeled image datasets, where objects can deform, articulate, or occlude. However, creating such large keypoint labels is time-consuming and costly, and is often error-prone due to inconsistent labeling. Thus, we desire an approach that can learn keypoint localization with fewer yet consistently annotated images. To this end, we present a novel formulation that learns to localize semantically consistent keypoint definitions, even for occluded regions, for varying object categories. We use a few user-labeled 2D images as input examples, which are extended via self-supervision using a larger unlabeled dataset. Unlike unsupervised methods, the few-shot images act as semantic shape constraints for object localization. Furthermore, we introduce 3D geometry-aware constraints to uplift keypoints, achieving more accurate 2D localization. Our general-purpose formulation paves the way for semantically conditioned generative modeling and attains competitive or state-of-the-art accuracy on several datasets, including human faces, eyes, animals, cars, and never-before-seen mouth interior (teeth) localization tasks, not attempted by the previous few-shot methods. Project page: https://xingzhehe.github.io/FewShot3DKP/

PEFAT: Boosting Semi-Supervised Medical Image Classification via Pseudo-Loss Estimation and Feature Adversarial Training
Zeng, QingjieandXie, YutongandLu, ZilinandXia, Yong



研究问题:如何提高半监督学习在计算机视觉和医学影像分类任务中的性能。
动机:现有的半监督学习方法主要从模型预测概率的角度寻找高置信度的伪标签样本,但这种方法可能会引入错误伪标签数据,且常常忽视低置信度样本的潜力。
方法:提出一种新颖的Pseudo-loss Estimation and Feature Adversarial Training(PEFAT)半监督框架,通过损失分布建模和对抗训练来提升多类别和多标签的医疗图像分类性能。具体包括开发一个可信赖的数据选择方案来分割高质量的伪标签集,以及通过在特征级别注入对抗性噪声来学习区分信息,从而平滑决策边界。
效果:在三个医疗和两个自然图像基准测试集上的实验结果表明,PEFAT能够取得良好的性能并超越其他最先进的方法。

Pseudo-labeling approaches have been proven beneficial for semi-supervised learning (SSL) schemes in computer vision and medical imaging. Most works are dedicated to finding samples with high-confidence pseudo-labels from the perspective of model predicted probability. Whereas this way may lead to the inclusion of incorrectly pseudo-labeled data if the threshold is not carefully adjusted. In addition, low-confidence probability samples are frequently disregarded and not employed to their full potential. In this paper, we propose a novel Pseudo-loss Estimation and Feature Adversarial Training semi-supervised framework, termed as PEFAT, to boost the performance of multi-class and multi-label medical image classification from the point of loss distribution modeling and adversarial training. Specifically, we develop a trustworthy data selection scheme to split a high-quality pseudo-labeled set, inspired by the dividable pseudo-loss assumption that clean data tend to show lower loss while noise data is the opposite. Instead of directly discarding these samples with low-quality pseudo-labels, we present a novel regularization approach to learn discriminate information from them via injecting adversarial noises at the feature-level to smooth the decision boundary. Experimental results on three medical and two natural image benchmarks validate that our PEFAT can achieve a promising performance and surpass other state-of-the-art methods. The code is available at https://github.com/maxwell0027/PEFAT.

Learning To Segment Every Referring Object Point by Point
Qu, MengxueandWu, YuandWei, YunchaoandLiu, WuandLiang, XiaodanandZhao, Yao



研究问题:本文旨在解决参照表达式分割(RES)的问题,即如何实现视觉和语言在像素级别的语义对齐。
动机:现有的RES方法大多需要大量的像素级标注,这既昂贵又繁琐。
方法:本文提出了一种新的部分监督训练模式,即使用丰富的参照边界框和少量(如1%)的像素级参照掩码进行训练。为了最大限度地提高REC模型的可转移性,我们构建了一个基于点序列预测模型的模型。我们还提出了共同内容教师强制策略,使模型明确地将点坐标(尺度值)与被参照的空间特征关联起来,以减轻由于有限的分割掩码引起的暴露偏差。
效果:实验表明,当我们只使用1%的掩码标注时,我们的模型在RefCOCO+@testA上达到了52.06%的准确率(全监督设置下为58.93%)。

Referring Expression Segmentation (RES) can facilitate pixel-level semantic alignment between vision and language. Most of the existing RES approaches require massive pixel-level annotations, which are expensive and exhaustive. In this paper, we propose a new partially supervised training paradigm for RES, i.e., training using abundant referring bounding boxes and only a few (e.g., 1%) pixel-level referring masks. To maximize the transferability from the REC model, we construct our model based on the point-based sequence prediction model. We propose the co-content teacher-forcing to make the model explicitly associate the point coordinates (scale values) with the referred spatial features, which alleviates the exposure bias caused by the limited segmentation masks. To make the most of referring bounding box annotations, we further propose the resampling pseudo points strategy to select more accurate pseudo-points as supervision. Extensive experiments show that our model achieves 52.06% in terms of accuracy (versus 58.93% in fully supervised setting) on RefCOCO+@testA, when only using 1% of the mask annotations.

Bootstrapping Objectness From Videos by Relaxed Common Fate and Visual Grouping
Lian, LongandWu, ZhirongandYu, StellaX.



研究问题:如何从无标签视频中学习物体分割。
动机:人类可以轻易地分割移动的物体,而无需知道它们是什么。基于运动的分割启发了基于共同命运的无监督物体发现。然而,共同命运并不是对象性的可靠指标。
方法:首先通过放松的共同命运学习图像特征,然后根据图像本身和跨图像的视觉外观分组进行统计细化。具体来说,我们首先在近似光学流与常数段流加上小段内残差流的循环中学习图像分割器,然后通过更连贯的外观和统计形状-背景相关性进行细化。
效果:在无监督视频物体分割上,仅使用ResNet和卷积头,我们的模型在DAVIS16 / STv2 / FBMS59上分别以绝对增益7/9/5%超越现有技术,证明了我们的想法的有效性。

We study learning object segmentation from unlabeled videos. Humans can easily segment moving objects without knowing what they are. The Gestalt law of common fate, i.e., what move at the same speed belong together, has inspired unsupervised object discovery based on motion segmentation. However, common fate is not a reliable indicator of objectness: Parts of an articulated / deformable object may not move at the same speed, whereas shadows / reflections of an object always move with it but are not part of it. Our insight is to bootstrap objectness by first learning image features from relaxed common fate and then refining them based on visual appearance grouping within the image itself and across images statistically. Specifically, we learn an image segmenter first in the loop of approximating optical flow with constant segment flow plus small within-segment residual flow, and then by refining it for more coherent appearance and statistical figure-ground relevance. On unsupervised video object segmentation, using only ResNet and convolutional heads, our model surpasses the state-of-the-art by absolute gains of 7/9/5% on DAVIS16 / STv2 / FBMS59 respectively, demonstrating the effectiveness of our ideas. Our code is publicly available.

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification \& Segmentation
Kang, DahyunandKoniusz, PiotrandCho, MinsuandMurray, Naila



研究问题:本文旨在解决弱监督的少样本图像分类和分割任务。
动机:利用预训练的自我监督视觉转换器(ViT)进行弱监督的少样本图像分类和分割。
方法:通过自我注意力机制,利用自我监督ViT的标记表示及其相关性,通过单独的任务头产生分类和分割预测。
效果:在各种监督设置下,特别是在几乎没有像素级标签的情况下,实验结果在Pascal-5i和COCO-20i上表现出显著的性能提升。

We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with "mixed" supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we propose to improve the pseudo-labels using a pseudo-label enhancer that was trained using the available ground-truth pixel-level labels. Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings, and in particular when little-to-no pixel-level labels are available.

Collaborative Noisy Label Cleaner: Learning Scene-Aware Trailers for Multi-Modal Highlight Detection in Movies
Gan, BeiandShu, XiujunandQiao, RuizhiandWu, HaoqianandChen, KeyuandLi, HanjunandRen, Bo



研究问题:如何有效地从电影预告片中检测出电影的精彩片段,并处理不同标注者在标注过程中的不确定性。
动机:目前的电影精彩片段检测方法需要大量的手动标注,且存在标注不准确和耗时的问题。此外,现有的视频语料库虽然可以用于训练,但往往信息杂乱且不完整。
方法:本文提出了一种新的学习方式,即将精彩片段检测视为“学习有噪声的标签”。首先,利用场景分割从电影预告片中获取完整的镜头作为有噪声的标签。然后,提出一个协作式噪声标签清理(CLC)框架来从有噪声的精彩片段中学习。CLC包括两个模块:增强型交叉传播(ACP)和多模态清理(MMC)。前者旨在利用紧密相关的视听信号并融合它们以学习统一的多模态表示。后者旨在通过观察不同模态之间的损失变化来实现更清晰的精彩片段标签。
效果:在MovieLights和YouTube Highlights数据集上进行的全面实验证明了该方法的有效性。

Movie highlights stand out of the screenplay for efficient browsing and play a crucial role on social media platforms. Based on existing efforts, this work has two observations: (1) For different annotators, labeling highlight has uncertainty, which leads to inaccurate and time-consuming annotations. (2) Besides previous supervised or unsupervised settings, some existing video corpora can be useful, e.g., trailers, but they are often noisy and incomplete to cover the full highlights. In this work, we study a more practical and promising setting, i.e., reformulating highlight detection as "learning with noisy labels". This setting does not require time-consuming manual annotations and can fully utilize existing abundant video corpora. First, based on movie trailers, we leverage scene segmentation to obtain complete shots, which are regarded as noisy labels. Then, we propose a Collaborative noisy Label Cleaner (CLC) framework to learn from noisy highlight moments. CLC consists of two modules: augmented cross-propagation (ACP) and multi-modality cleaning (MMC). The former aims to exploit the closely related audio-visual signals and fuse them to learn unified multi-modal representations. The latter aims to achieve cleaner highlight labels by observing the changes in losses among different modalities. To verify the effectiveness of CLC, we further collect a large-scale highlight dataset named MovieLights. Comprehensive experiments on MovieLights and YouTube Highlights datasets demonstrate the effectiveness of our approach. Code has been made available at: https://github.com/TencentYoutuResearch/HighlightDetection-CLC

Contrastive Mean Teacher for Domain Adaptive Object Detectors
Cao, ShengcaoandJoshi, DhirajandGui, Liang-YanandWang, Yu-Xiong



研究问题:目标检测器在训练(源领域)和现实世界应用(目标领域)之间存在域差距。
动机:均值教师自我训练是对象检测中强大的无监督领域适应范例,但难以处理低质量的伪标签。
方法:我们提出了对比性均值教师(CMT),这是一个统一的通用框架,自然地整合了两种范例,以最大化有益的学习信号。
效果:当与最新的均值教师自我训练方法结合使用时,CMT在目标领域的性能达到了新的最高水平,例如在Foggy Cityscapes上达到51.9% mAP,比之前的最佳水平高出2.1% mAP。

Object detectors often suffer from the domain gap between training (source domain) and real-world applications (target domain). Mean-teacher self-training is a powerful paradigm in unsupervised domain adaptation for object detection, but it struggles with low-quality pseudo-labels. In this work, we identify the intriguing alignment and synergy between mean-teacher self-training and contrastive learning. Motivated by this, we propose Contrastive Mean Teacher (CMT) -- a unified, general-purpose framework with the two paradigms naturally integrated to maximize beneficial learning signals. Instead of using pseudo-labels solely for final predictions, our strategy extracts object-level features using pseudo-labels and optimizes them via contrastive learning, without requiring labels in the target domain. When combined with recent mean-teacher self-training methods, CMT leads to new state-of-the-art target-domain performance: 51.9% mAP on Foggy Cityscapes, outperforming the previously best by 2.1% mAP. Notably, CMT can stabilize performance and provide more significant gains as pseudo-label noise increases.

Primitive Generation and Semantic-Related Alignment for Universal Zero-Shot Segmentation
He, ShutingandDing, HenghuiandJiang, Wei



研究问题:本研究旨在实现无需任何训练样本的通用零样本分割,以实现对新类别的全景、实例和语义分割。
动机:现有的零样本分割能力依赖于语义空间中的类间关系,将从已见类别中学习到的视觉知识转移到未见类别上。因此,需要很好地连接语义-视觉空间,并将语义关系应用于视觉特征学习。
方法:我们引入了一个生成模型来为未见类别合成特征,该模型连接了语义和视觉空间,并解决了缺乏未见训练数据的问题。此外,为了缓解语义和视觉空间之间的领域差距,我们首先通过学习与类别相关的细粒度属性来增强基本的生成器,然后通过选择性地组装这些原语来合成未见特征。其次,我们提出将视觉特征分解为与语义相关部分和与语义无关部分,后者包含有用的视觉分类线索,但对语义表示的相关性较低。然后需要使与语义相关的视觉特征的类间关系与语义空间中的那些对齐,从而将语义知识转移到视觉特征学习中。所提出的方法在零样本全景分割、实例分割和语义分割方面取得了令人印象深刻的最先进的性能。
效果:当与最新的均值教师自我训练方法结合使用时,CMT在目标领域的性能达到了新的最高水平,例如在Foggy Cityscapes上达到51.9% mAP,比之前的最佳水平高出2.1% mAP。

We study universal zero-shot segmentation in this work to achieve panoptic, instance, and semantic segmentation for novel categories without any training samples. Such zero-shot segmentation ability relies on inter-class relationships in semantic space to transfer the visual knowledge learned from seen categories to unseen ones. Thus, it is desired to well bridge semantic-visual spaces and apply the semantic relationships to visual feature learning. We introduce a generative model to synthesize features for unseen categories, which links semantic and visual spaces as well as address the issue of lack of unseen training data. Furthermore, to mitigate the domain gap between semantic and visual spaces, firstly, we enhance the vanilla generator with learned primitives, each of which contains fine-grained attributes related to categories, and synthesize unseen features by selectively assembling these primitives. Secondly, we propose to disentangle the visual feature into the semantic-related part and the semantic-unrelated part that contains useful visual classification clues but is less relevant to semantic representation. The inter-class relationships of semantic-related visual features are then required to be aligned with those in semantic space, thereby transferring semantic knowledge to visual feature learning. The proposed approach achieves impressively state-of-the-art performance on zero-shot panoptic segmentation, instance segmentation, and semantic segmentation.

HandsOff: Labeled Dataset Generation With No Additional Human Annotations
Xu, AustinandVasileva, MariyaI.andDave, AchalandSeshadri, Arjun



研究问题:如何有效地生成大量带标签的合成数据集,同时避免依赖人工标注和保证生成标签的质量。
动机:现有的基于生成对抗网络(GANs)的数据集生成方法需要新的合成图像标注,这限制了其应用。
方法:提出HandsOff框架,该框架通过在少量预先标记的图像上进行训练,可以生成无限数量的合成图像和相应的标签。
效果:该方法在多个具有挑战性的领域(如人脸、汽车、全身人体姿态和城市驾驶场景)中生成了带有丰富像素级标签的数据集,并在语义分割、关键点检测和深度估计等任务上取得了优于现有数据集生成方法和迁移学习基线的性能。

Recent work leverages the expressive power of genera- tive adversarial networks (GANs) to generate labeled syn- thetic datasets. These dataset generation methods often require new annotations of synthetic images, which forces practitioners to seek out annotators, curate a set of synthetic images, and ensure the quality of generated labels. We in- troduce the HandsOff framework, a technique capable of producing an unlimited number of synthetic images and cor- responding labels after being trained on less than 50 pre- existing labeled images. Our framework avoids the practi- cal drawbacks of prior work by unifying the field of GAN in- version with dataset generation. We generate datasets with rich pixel-wise labels in multiple challenging domains such as faces, cars, full-body human poses, and urban driving scenes. Our method achieves state-of-the-art performance in semantic segmentation, keypoint detection, and depth es- timation compared to prior dataset generation approaches and transfer learning baselines. We additionally showcase its ability to address broad challenges in model develop- ment which stem from fixed, hand-annotated datasets, such as the long-tail problem in semantic segmentation. Project page: austinxu87.github.io/handsoff.

Semi-Supervised 2D Human Pose Estimation Driven by Position Inconsistency Pseudo Label Correction Module
Huang, LinzhiandLi, YulongandTian, HongboandYang, YueandLi, XiangangandDeng, WeihongandYe, Jieping



研究问题:本文旨在解决半监督二维人体姿态估计中的问题,包括大研究问题:本文旨在解决半监督二维人体姿态估计中的问题,包括大模型和小模型交互训练时小模型的伪标签对大模型的引导问题,以及噪声伪标签对训练的负面影响。
动机:目前的2D人体姿态估计方法忽视了这两个问题,并且其使用的标签(关键点类别和位置)相对复杂。
方法:本文提出了一个由位置不一致伪标签修正模块驱动的半监督二维人体姿态估计框架(SSPCM)。通过引入额外的辅助教师和使用两个教师在不同时期生成的伪标签来计算不一致性分数并去除异常值,然后通过交互式训练更新两个教师模型,使用两个教师生成的伪标签更新学生模型。此外,还使用了基于伪关键点感知的半监督切割遮挡来生成更多困难且有效的样本。
效果:实验表明,该方法优于之前的最好的半监督二维人体姿态估计方法。

In this paper, we delve into semi-supervised 2D human pose estimation. The previous method ignored two problems: (i) When conducting interactive training between large model and lightweight model, the pseudo label of lightweight model will be used to guide large models. (ii) The negative impact of noise pseudo labels on training. Moreover, the labels used for 2D human pose estimation are relatively complex: keypoint category and keypoint position. To solve the problems mentioned above, we propose a semi-supervised 2D human pose estimation framework driven by a position inconsistency pseudo label correction module (SSPCM). We introduce an additional auxiliary teacher and use the pseudo labels generated by the two teacher model in different periods to calculate the inconsistency score and remove outliers. Then, the two teacher models are updated through interactive training, and the student model is updated using the pseudo labels generated by two teachers. To further improve the performance of the student model, we use the semi-supervised Cut-Occlude based on pseudo keypoint perception to generate more hard and effective samples. In addition, we also proposed a new indoor overhead fisheye human keypoint dataset WEPDTOF-Pose. Extensive experiments demonstrate that our method outperforms the previous best semi-supervised 2D human pose estimation method. We will release the code and dataset at https://github.com/hlz0606/SSPCM.

X3KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection
Klingner, MarvinandBorse, ShubhankarandKumar, VarunRaviandRezaei, BehnazandNarayanan, VenkatramanandYogamani, SenthilandPorikli, Fatih



研究问题:本文旨在解决基于多相机图像的环视3D物体检测(3DOD)在特征视图转换过程中由于缺失深度信息导致的模糊问题。
动机:目前的多相机3DOD模型在将特征从透视视图转换为3D世界表示时,由于缺失深度信息,导致结果模糊不清。
方法:本文提出了一种跨模态、任务和阶段的全面知识蒸馏框架X3KD,包括跨任务蒸馏、跨模态特征蒸馏、对抗训练和跨模态输出蒸馏等步骤。
效果:实验结果表明,X3KD模型在nuScenes和Waymo数据集上的表现优于先前最先进的方法,并能推广到雷达基的3DOD。

Recent advances in 3D object detection (3DOD) have obtained remarkably strong results for LiDAR-based models. In contrast, surround-view 3DOD models based on multiple camera images underperform due to the necessary view transformation of features from perspective view (PV) to a 3D world representation which is ambiguous due to missing depth information. This paper introduces X3KD, a comprehensive knowledge distillation framework across different modalities, tasks, and stages for multi-camera 3DOD. Specifically, we propose cross-task distillation from an instance segmentation teacher (X-IS) in the PV feature extraction stage providing supervision without ambiguous error backpropagation through the view transformation. After the transformation, we apply cross-modal feature distillation (X-FD) and adversarial training (X-AT) to improve the 3D world representation of multi-camera features through the information contained in a LiDAR-based 3DOD teacher. Finally, we also employ this teacher for cross-modal output distillation (X-OD), providing dense supervision at the prediction stage. We perform extensive ablations of knowledge distillation at different stages of multi-camera 3DOD. Our final X3KD model outperforms previous state-of-the-art approaches on the nuScenes and Waymo datasets and generalizes to RADAR-based 3DOD. Qualitative results video at https://youtu.be/1do9DPFmr38.

Optimal Transport Minimization: Crowd Localization on Density Maps for Semi-Supervised Counting
Lin, WeiandChan, AntoniB.



研究问题:如何提高人群密度地图中人群定位的准确性。
动机:尽管深度神经网络在预测人群密度图方面取得了显著进步,但大多数方法并未进一步探索在密度图中定位人群的能力。
方法:本文提出了一种基于最优传输最小化(OT-M)算法的人群密度图定位方法。该方法的目标是找到与输入密度图具有最小Sinkhorn距离的目标点图,并提出了计算解决方案的迭代算法。
效果:通过将OT-M应用于生成硬伪标签(点图),而不是先前方法中使用的软伪标签(密度图),我们的方法在人群定位和半监督计数方面都取得了出色的性能。

The accuracy of crowd counting in images has improved greatly in recent years due to the development of deep neural networks for predicting crowd density maps. However, most methods do not further explore the ability to localize people in the density map, with those few works adopting simple methods, like finding the local peaks in the density map. In this paper, we propose the optimal transport minimization (OT-M) algorithm for crowd localization with density maps. The objective of OT-M is to find a target point map that has the minimal Sinkhorn distance with the input density map, and we propose an iterative algorithm to compute the solution. We then apply OT-M to generate hard pseudo-labels (point maps) for semi-supervised counting, rather than the soft pseudo-labels (density maps) used in previous methods. Our hard pseudo-labels provide stronger supervision, and also enable the use of recent density-to-point loss functions for training. We also propose a confidence weighting strategy to give higher weight to the more reliable unlabeled data. Extensive experiments show that our methods achieve outstanding performance on both crowd localization and semi-supervised counting. Code is available at https://github.com/Elin24/OT-M.

L-CoIns: Language-Based Colorization With Instance Awareness
Chang, ZhengandWeng, ShuchenandZhang, PeixuanandLi, YuandLi, SiandShi, Boxin



研究问题:如何通过语言描述自动生成与图像内容一致的颜色,并解决颜色-对象耦合和不匹配的问题。
动机:现有的方法在处理同一对象词时仍存在困难,需要一种能够实现实例感知的方法。
方法:提出一种基于变压器的框架,通过自动聚合相似的图像块来实现实例感知,同时应用亮度增强和反色损失来打破亮度和颜色词之间的统计相关性。
效果:收集了一个具有独特视觉特征和详细语言描述的数据集,实验证明该方法可以生成视觉上令人愉悦且与描述一致的实例感知颜色化结果。

Language-based colorization produces plausible colors consistent with the language description provided by the user. Recent studies introduce additional annotation to prevent color-object coupling and mismatch issues, but they still have difficulty in distinguishing instances corresponding to the same object words. In this paper, we propose a transformer-based framework to automatically aggregate similar image patches and achieve instance awareness without any additional knowledge. By applying our presented luminance augmentation and counter-color loss to break down the statistical correlation between luminance and color words, our model is driven to synthesize colors with better descriptive consistency. We further collect a dataset to provide distinctive visual characteristics and detailed language descriptions for multiple instances in the same image. Extensive experiments demonstrate our advantages of synthesizing visually pleasing and description-consistent results of instance-aware colorization.

MixTeacher: Mining Promising Labels With Mixed Scale Teacher for Semi-Supervised Object Detection
Liu, LiangandZhang, BoshenandZhang, JiangningandZhang, WuhaoandGan, ZhenyeandTian, GuanzhongandZhu, WenbingandWang, YabiaoandWang, Chengjie



研究问题:本文旨在解决目标检测中物体实例尺度变化的关键挑战,特别是在半监督情况下。
动机:虽然现代检测模型在处理尺度变化方面取得了显著进展,但在半监督情况下,尺度变化仍然会带来麻烦。大多数现有的半监督目标检测方法依赖于严格的条件来从网络预测中筛选出高质量的伪标签。然而,我们发现尺度极端的物体往往具有低置信度,这使得这些物体缺乏积极的监督。
方法:我们深入研究了尺度变化问题,并提出了一个新颖的框架,通过引入混合尺度教师来改进伪标签生成和尺度不变学习。此外,受益于混合尺度特征的更好预测,我们提出了通过跨尺度预测的得分提升来挖掘伪标签。
效果:我们在MS COCO和PASCAL VOC基准测试集上进行了大量实验,结果表明我们的方法在各种半监督设置下都达到了新的最先进的性能。代码和模型将公开发布。

Scale variation across object instances is one of the key challenges in object detection. Although modern detection models have achieved remarkable progress in dealing with the scale variation, it still brings trouble in the semi-supervised case. Most existing semi-supervised object detection methods rely on strict conditions to filter out high-quality pseudo labels from the network predictions. However, we observe that objects with extreme scale tend to have low confidence, which makes the positive supervision missing for these objects. In this paper, we delve into the scale variation problem, and propose a novel framework by introducing a mixed scale teacher to improve the pseudo labels generation and scale invariant learning. In addition, benefiting from the better predictions from mixed scale features, we propose to mine pseudo labels with the score promotion of predictions across scales. Extensive experiments on MS COCO and PASCAL VOC benchmarks under various semi-supervised settings demonstrate that our method achieves new state-of-the-art performance. The code and models will be made publicly available.

Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation
Bai, YunhaoandChen, DuowenandLi, QingliandShen, WeiandWang, Yan



研究问题:半监督医疗图像分割中存在标记和未标记数据分布的经验不匹配问题。
动机:如果将标记和未标记的数据分开处理,或者以不一致的方式训练标记和未标记的数据,那么从标记数据中学到的知识可能会被大量丢弃。
方法:提出一种直接的方法来缓解这个问题——在简单的Mean Teacher架构中,双向复制粘贴标记和未标记的数据。这种方法鼓励未标记的数据从标记的数据中学习全面的共同语义。
效果:实验结果表明,与其他最先进的半监督医疗图像分割数据集相比,该方法可以获得坚实的收益(例如,在ACDC数据集上获得超过21%的Dice改进,仅使用5%的标记数据)。

In semi-supervised medical image segmentation, there exist empirical mismatch problems between labeled and unlabeled data distribution. The knowledge learned from the labeled data may be largely discarded if treating labeled and unlabeled data separately or training labeled and unlabeled data in an inconsistent manner. We propose a straightforward method for alleviating the problem -- copy-pasting labeled and unlabeled data bidirectionally, in a simple Mean Teacher architecture. The method encourages unlabeled data to learn comprehensive common semantics from the labeled data in both inward and outward directions. More importantly, the consistent learning procedure for labeled and unlabeled data can largely reduce the empirical distribution gap. In detail, we copy-paste a random crop from a labeled image (foreground) onto an unlabeled image (background) and an unlabeled image (foreground) onto a labeled image (background), respectively. The two mixed images are fed into a Student network. It is trained by the generated supervisory signal via bidirectional copy-pasting between the predictions of the unlabeled images from the Teacher and the label maps of the labeled images. We explore several design choices of how to copy-paste to make it more effective for minimizing empirical distribution gaps between labeled and unlabeled data. We reveal that the simple mechanism of copy-pasting bidirectionally between labeled and unlabeled data is good enough and the experiments show solid gains (e.g., over 21% Dice improvement on ACDC dataset with 5% labeled data) compared with other state-of-the-arts on various semi-supervised medical image segmentation datasets.

Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation
He, ShutingandDing, HenghuiandJiang, Wei



研究问题:如何提高零样本实例分割的性能,特别是在未见过的类别上。
动机:现有的模型在训练中存在对已见类别的强烈偏向,并且难以区分背景和未见过的对象。
方法:提出D^2Zero模型,结合语义提升去偏和背景消歧技术,利用类间语义关系进行视觉特征训练,并学习基于输入图像的动态分类器,同时产生适应图像的背景表示以避免将新对象误认为背景。
效果:实验表明,该方法显著优于现有最先进的方法,例如在COCO数据集上提高了16.86%的性能。

Zero-shot instance segmentation aims to detect and precisely segment objects of unseen categories without any training samples. Since the model is trained on seen categories, there is a strong bias that the model tends to classify all the objects into seen categories. Besides, there is a natural confusion between background and novel objects that have never shown up in training. These two challenges make novel objects hard to be raised in the final instance segmentation results. It is desired to rescue novel objects from background and dominated seen categories. To this end, we propose D^2Zero with Semantic-Promoted Debiasing and Background Disambiguation to enhance the performance of Zero-shot instance segmentation. Semantic-promoted debiasing utilizes inter-class semantic relationships to involve unseen categories in visual feature training and learns an input-conditional classifier to conduct dynamical classification based on the input image. Background disambiguation produces image-adaptive background representation to avoid mistaking novel objects for background. Extensive experiments show that we significantly outperform previous state-of-the-art methods by a large margin, e.g., 16.86% improvement on COCO.

CoMFormer: Continual Learning in Semantic and Panoptic Segmentation
Cermelli, FabioandCord, MatthieuandDouillard, Arthur



研究问题:本文旨在解决语义分割和全景分割的持续学习问题。
动机:现有的研究主要关注于语义分割,忽视了具有实际影响的全景分割。
方法:提出CoMFormer模型,利用转换器架构的特性进行类别学习,并设计了一种新的自适应蒸馏损失函数和基于掩码的伪标签技术来有效防止遗忘。
效果:在ADE20K数据集上进行的实验表明,CoMFormer在遗忘旧类和学习新类方面都优于现有方法,并在大规模持续语义分割场景中也显著优于最先进的方法。

Continual learning for segmentation has recently seen increasing interest. However, all previous works focus on narrow semantic segmentation and disregard panoptic segmentation, an important task with real-world impacts. In this paper, we present the first continual learning model capable of operating on both semantic and panoptic segmentation. Inspired by recent transformer approaches that consider segmentation as a mask-classification problem, we design CoMFormer. Our method carefully exploits the properties of transformer architectures to learn new classes over time. Specifically, we propose a novel adaptive distillation loss along with a mask-based pseudo-labeling technique to effectively prevent forgetting. To evaluate our approach, we introduce a novel continual panoptic segmentation benchmark on the challenging ADE20K dataset. Our CoMFormer outperforms all the existing baselines by forgetting less old classes but also learning more effectively new classes. In addition, we also report an extensive evaluation in the large-scale continual semantic segmentation scenario showing that CoMFormer also significantly outperforms state-of-the-art methods.

Towards Effective Visual Representations for Partial-Label Learning
Xia, ShiyuandLv, JiaqiandXu, NingandNiu, GangandGeng, Xin



研究问题:如何在部分标签学习(PLL)中提高视觉任务的性能,特别是在没有真实标签的情况下。
动机:在部分标签学习中,由于只有模糊的候选标签集包含未知的真实标签,因此需要通过对比相同/不同类别的实体来学习表示以提高性能。
方法:本文提出了一种名为PaPi的新框架,该框架通过共享特征编码器的线性分类器来指导原型分类器的优化,从而明确鼓励表示反映类别之间的视觉相似性。
效果:实验结果表明,PaPi在各种图像分类任务上显著优于其他PLL方法。

Under partial-label learning (PLL) where, for each training instance, only a set of ambiguous candidate labels containing the unknown true label is accessible, contrastive learning has recently boosted the performance of PLL on vision tasks, attributed to representations learned by contrasting the same/different classes of entities. Without access to true labels, positive points are predicted using pseudolabels that are inherently noisy, and negative points often require large batches or momentum encoders, resulting in unreliable similarity information and a high computational overhead. In this paper, we rethink a state-of-the-art contrastive PLL method PiCO [24], inspiring the design of a simple framework termed PaPi (Partial-label learning with a guided Prototypical classifier), which demonstrates significant scope for improvement in representation learning, thus contributing to label disambiguation. PaPi guides the optimization of a prototypical classifier by a linear classifier with which they share the same feature encoder, thus explicitly encouraging the representation to reflect visual similarity between categories. It is also technically appealing, as PaPi requires only a few components in PiCO with the opposite direction of guidance, and directly eliminates the contrastive learning module that would introduce noise and consume computational resources. We empirically demonstrate that PaPi significantly outperforms other PLL methods on various image classification tasks.

A Loopback Network for Explainable Microvascular Invasion Classification
Zhang, ShengxumingandShi, TianqiandJiang, YangandZhang, XiumingandLei, JieandFeng, ZunleiandSong, Mingli



研究问题:本文旨在开发一种准确、客观和可解释的微血管侵犯(MVI)诊断工具。
动机:目前,MVI的诊断依赖于病理学家手动从数百条血管中找出癌细胞,这既耗时又繁琐,且具有主观性。深度学习在医学影像分析任务上取得了令人鼓舞的成果,但其黑箱模型的不透明性和对大量标注样本的需求限制了基于深度学习的诊断方法在临床上的应用。
方法:本文提出了一种名为Loopback Network(LoopNet)的循环网络,用于高效地分类MVI。通过收集的病理血管图像数据集(PVID)的图像级类别注释,LoopNet被设计为由二进制分类分支和细胞定位分支组成。后者被设计来定位癌细胞区域、正常非癌细胞区域和背景。对于健康样本,细胞的定位分支受到细胞伪掩模的监督,以区分正常非癌细胞区域和背景。对于每个MVI样本,细胞定位分支预测癌细胞的掩模。然后,同一样本的掩蔽癌细胞和非癌细胞区域分别输入到二元分类分支中。两个分支之间的回环使得类别标签能够监督细胞定位分支学习癌细胞区域的定位能力。
效果:实验结果表明,提出的LoopNet在MVI分类上实现了97.5%的准确率。令人惊讶的是,提出的回环机制不仅使LoopNet能够预测癌细胞区域,还有助于分类主干实现更好的分类性能。

Microvascular invasion (MVI) is a critical factor for prognosis evaluation and cancer treatment. The current diagnosis of MVI relies on pathologists to manually find out cancerous cells from hundreds of blood vessels, which is time-consuming, tedious, and subjective. Recently, deep learning has achieved promising results in medical image analysis tasks. However, the unexplainability of black box models and the requirement of massive annotated samples limit the clinical application of deep learning based diagnostic methods. In this paper, aiming to develop an accurate, objective, and explainable diagnosis tool for MVI, we propose a Loopback Network (LoopNet) for classifying MVI efficiently. With the image-level category annotations of the collected Pathologic Vessel Image Dataset (PVID), LoopNet is devised to be composed binary classification branch and cell locating branch. The latter is devised to locate the area of cancerous cells, regular non-cancerous cells, and background. For healthy samples, the pseudo masks of cells supervise the cell locating branch to distinguish the area of regular non-cancerous cells and background. For each MVI sample, the cell locating branch predicts the mask of cancerous cells. Then the masked cancerous and non-cancerous areas of the same sample are inputted back to the binary classification branch separately. The loopback between two branches enables the category label to supervise the cell locating branch to learn the locating ability for cancerous areas. Experiment results show that the proposed LoopNet achieves 97.5% accuracy on MVI classification. Surprisingly, the proposed loopback mechanism not only enables LoopNet to predict the cancerous area but also facilitates the classification backbone to achieve better classification performance.

Revisiting Weak-to-Strong Consistency in Semi-Supervised Semantic Segmentation
Yang, LiheandQi, LeiandFeng, LitongandZhang, WayneandShi, Yinghuan



研究问题:本文旨在改进弱-强一致性框架,并将其应用于图像分割任务。
动机:目前的弱-强一致性框架在图像分割任务中的效果有待提高,且其依赖于手动设计的强大数据增强方法,限制了其在更广泛的扰动空间中的探索。
方法:作者提出了一种辅助特征扰动流作为补充,以扩大扰动空间,并提出了双流扰动技术,使得两个强大的视图可以同时由一个共同的弱视图指导。
效果:这种统一的双流扰动方法(UniMatch)在所有评估协议上均显著超越了现有的所有方法,并在遥感解释和医学图像分析中也表现出优越性。

In this work, we revisit the weak-to-strong consistency framework, popularized by FixMatch from semi-supervised classification, where the prediction of a weakly perturbed image serves as supervision for its strongly perturbed version. Intriguingly, we observe that such a simple pipeline already achieves competitive results against recent advanced works, when transferred to our segmentation scenario. Its success heavily relies on the manual design of strong data augmentations, however, which may be limited and inadequate to explore a broader perturbation space. Motivated by this, we propose an auxiliary feature perturbation stream as a supplement, leading to an expanded perturbation space. On the other, to sufficiently probe original image-level augmentations, we present a dual-stream perturbation technique, enabling two strong views to be simultaneously guided by a common weak view. Consequently, our overall Unified Dual-Stream Perturbations approach (UniMatch) surpasses all existing methods significantly across all evaluation protocols on the Pascal, Cityscapes, and COCO benchmarks. Its superiority is also demonstrated in remote sensing interpretation and medical image analysis. We hope our reproduced FixMatch and our results can inspire more future works. Code and logs are available at https://github.com/LiheYoung/UniMatch.

MP-Former: Mask-Piloted Transformer for Image Segmentation
Zhang, HaoandLi, FengandXu, HuaizheandHuang, ShijiaandLiu, ShilongandNi, LionelM.andZhang, Lei



研究问题:改进Mask2Former在图像分割中的遮罩注意力,解决连续解码器层之间遮罩预测不一致的问题。
动机:由于Mask2Former在连续解码器层之间的遮罩预测不一致,导致优化目标不一致和解码器查询利用率低。
方法:提出一种遮罩引导的训练方法,额外引入噪声的地面真值遮罩到遮罩注意力中,训练模型重构原始遮罩。
效果:与遮罩注意力中使用的预测遮罩相比,地面真值遮罩作为引导可以有效缓解Mask2Former中不准确遮罩预测的负面影响。基于此技术,MP-Former在所有三项图像分割任务(实例、全景和语义)上取得了显著的性能提升,使用ResNet-50主干网络在Cityscapes实例和语义分割任务上获得了+2.3 AP和+1.6 mIoU。该方法还大大加快了训练速度,在ADE20K上使用ResNet-50和Swin-L主干网络,一半的训练周期就超过了Mask2Former。此外,该方法在训练期间仅引入少量计算,推理期间没有额外计算。代码将在https://github.com/IDEA-Research/MP-Former上发布。

We present a mask-piloted Transformer which improves masked-attention in Mask2Former for image segmentation. The improvement is based on our observation that Mask2Former suffers from inconsistent mask predictions between consecutive decoder layers, which leads to inconsistent optimization goals and low utilization of decoder queries. To address this problem, we propose a mask-piloted training approach, which additionally feeds noised ground-truth masks in masked-attention and trains the model to reconstruct the original ones. Compared with the predicted masks used in mask-attention, the ground-truth masks serve as a pilot and effectively alleviate the negative impact of inaccurate mask predictions in Mask2Former. Based on this technique, our MP-Former achieves a remarkable performance improvement on all three image segmentation tasks (instance, panoptic, and semantic), yielding +2.3 AP and +1.6 mIoU on the Cityscapes instance and semantic segmentation tasks with a ResNet-50 backbone. Our method also significantly speeds up the training, outperforming Mask2Former with half of the number of training epochs on ADE20K with both a ResNet-50 and a Swin-L backbones. Moreover, our method only introduces little computation during training and no extra computation during inference. Our code will be released at https://github.com/IDEA-Research/MP-Former.

Learning Orthogonal Prototypes for Generalized Few-Shot Semantic Segmentation
Liu, Sun-AoandZhang, YihengandQiu, ZhaofanandXie, HongtaoandZhang, YongdongandYao, Ting



研究问题:如何同时区分基本类别和新颖类别的像素,而不牺牲基础类别的性能。
动机:目前的一般化少镜头语义分割方法在更新过程中往往会损害已学习的特征,导致基础类别的性能下降。
方法:提出一种新的投影到正交原型(POP)方法,通过构建一组代表每个语义类别的正交原型,并在其上进行特征投影来进行预测,从而在不损害基础类别性能的情况下更新特征以识别新颖类别。
效果:实验结果表明,POP在新颖类别上取得了优异的性能,同时对基础类别的性能影响不大。在PASCAL-5i的5次拍摄场景中,POP的整体mIoU比最先进的微调方法高出3.93%。

Generalized few-shot semantic segmentation (GFSS) distinguishes pixels of base and novel classes from the background simultaneously, conditioning on sufficient data of base classes and a few examples from novel class. A typical GFSS approach has two training phases: base class learning and novel class updating. Nevertheless, such a stand-alone updating process often compromises the well-learnt features and results in performance drop on base classes. In this paper, we propose a new idea of leveraging Projection onto Orthogonal Prototypes (POP), which updates features to identify novel classes without compromising base classes. POP builds a set of orthogonal prototypes, each of which represents a semantic class, and makes the prediction for each class separately based on the features projected onto its prototype. Technically, POP first learns prototypes on base data, and then extends the prototype set to novel classes. The orthogonal constraint of POP encourages the orthogonality between the learnt prototypes and thus mitigates the influence on base class features when generalizing to novel prototypes. Moreover, we capitalize on the residual of feature projection as the background representation to dynamically fit semantic shifting (i.e., background no longer includes the pixels of novel classes in updating phase). Extensive experiments on two benchmarks demonstrate that our POP achieves superior performances on novel classes without sacrificing much accuracy on base classes. Notably, POP outperforms the state-of-the-art fine-tuning by 3.93% overall mIoU on PASCAL-5i in 5-shot scenario.

Few-Shot Semantic Image Synthesis With Class Affinity Transfer
Careil, Marl\`eneandVerbeek, JakobandLathuili\`ere, St\'ephane



研究问题:语义图像合成旨在根据语义分割图生成逼真的图像,但需要大量带有逐像素标签图的图像数据集进行训练,获取这些数据非常繁琐。
动机:为了减轻高标注成本,我们提出了一种转移学习方法,利用大型源数据集训练的模型通过估计源类别和目标类别之间的成对关系来提高在小目标数据集上的学习能力。
方法:我们将类亲和矩阵作为源模型的第一层,使其与目标标签图兼容,然后对目标领域进一步微调源模型。为了估计类别亲和性,我们考虑了不同的方法来利用先验知识:源领域的语义分割、文本标签嵌入和自监督视觉特征。我们将这种方法应用于基于GAN和扩散的语义合成架构。
效果:实验表明,不同的方式来估计类别亲和性可以有效地结合,我们的方法显著改善了现有的最先进的生成式图像模型的转移学习方法。

Semantic image synthesis aims to generate photo realistic images given a semantic segmentation map. Despite much recent progress, training them still requires large datasets of images annotated with per-pixel label maps that are extremely tedious to obtain. To alleviate the high annotation cost, we propose a transfer method that leverages a model trained on a large source dataset to improve the learning ability on small target datasets via estimated pairwise relations between source and target classes. The class affinity matrix is introduced as a first layer to the source model to make it compatible with the target label maps, and the source model is then further fine-tuned for the target domain. To estimate the class affinities we consider different approaches to leverage prior knowledge: semantic segmentation on the source domain, textual label embeddings, and self-supervised vision features. We apply our approach to GAN-based and diffusion-based architectures for semantic synthesis. Our experiments show that the different ways to estimate class affinity can effectively combined, and that our approach significantly improves over existing state-of-the-art transfer approaches for generative image models.

One-to-Few Label Assignment for End-to-End Dense Detection
Li, ShuaiandLi, MinghanandLi, RuihuangandHe, ChenhangandZhang, Lei



研究问题:现有的一对一(o2o)标签分配在完全卷积的端到端密集检测中,由于正样本数量有限,可能会降低特征学习性能。
动机:尽管可以通过引入额外的正样本来缓解这个问题,但在锚点之间的自我和交叉注意力计算阻止了其在密集和全卷积检测器中的实际应用。
方法:我们提出了一种简单而有效的一对多(o2f)标签分配策略,除了为每个对象定义一个正锚和多个负锚外,我们还定义了几个软锚点,它们同时作为正负样本。这些软锚点的正负权重在训练过程中会动态调整,以便它们在早期阶段更多地参与“表示学习”,在后期阶段更多地参与“重复预测消除”。
效果:在COCO和CrowdHuman数据集上的实验表明,所提出的o2f方案是有效的。

One-to-one (o2o) label assignment plays a key role for transformer based end-to-end detection, and it has been recently introduced in fully convolutional detectors for lightweight end-to-end dense detection. However, o2o can largely degrade the feature learning performance due to the limited number of positive samples. Though extra positive samples can be introduced to mitigate this issue, the computation of self- and cross- attentions among anchors prevents its practical application to dense and fully convolutional detectors. In this work, we propose a simple yet effective one-to-few (o2f) label assignment strategy for end-to-end dense detection. Apart from defining one positive and many negative anchors for each object, we define several soft anchors, which serve as positive and negative samples simultaneously. The positive and negative weights of these soft anchors are dynamically adjusted during training so that they can contribute more to 'representation learning' in the early training stage and contribute more to 'duplicated prediction removal' in the later stage. The detector trained in this way can not only learn a strong feature representation but also perform end-to-end detection. Experiments on COCO and CrowdHuman datasets demonstrate the effectiveness of the proposed o2f scheme.

Hierarchical Supervision and Shuffle Data Augmentation for 3D Semi-Supervised Object Detection
Liu, ChuandongandGao, ChenqiangandLiu, FangcenandLi, PengchengandMeng, DeyuandGao, Xinbo



研究问题:如何利用有限的标注样本和大量的未标注样本进行半监督学习,提高三维物体检测的性能。
动机:现有的三维物体检测模型通常需要大量高质量的三维标注数据进行训练,但这种标注过程既昂贵又耗时,不适合实际应用。
方法:提出一种新的分层监督和洗牌数据增强(HSSDA)方法,通过设计动态的双阈值策略,使教师网络为学生网络生成更合理的监督信号,同时通过洗牌数据增强策略强化学生网络的特征表示能力。
效果:实验结果表明,HSSDA在各种数据集上始终优于最新的最先进技术。

State-of-the-art 3D object detectors are usually trained on large-scale datasets with high-quality 3D annotations. However, such 3D annotations are often expensive and time-consuming, which may not be practical for real applications. A natural remedy is to adopt semi-supervised learning (SSL) by leveraging a limited amount of labeled samples and abundant unlabeled samples. Current pseudo-labeling-based SSL object detection methods mainly adopt a teacher-student framework, with a single fixed threshold strategy to generate supervision signals, which inevitably brings confused supervision when guiding the student network training. Besides, the data augmentation of the point cloud in the typical teacher-student framework is too weak, and only contains basic down sampling and flip-and-shift (i.e., rotate and scaling), which hinders the effective learning of feature information. Hence, we address these issues by introducing a novel approach of Hierarchical Supervision and Shuffle Data Augmentation (HSSDA), which is a simple yet effective teacher-student framework. The teacher network generates more reasonable supervision for the student network by designing a dynamic dual-threshold strategy. Besides, the shuffle data augmentation strategy is designed to strengthen the feature representation ability of the student network. Extensive experiments show that HSSDA consistently outperforms the recent state-of-the-art methods on different datasets. The code will be released at https://github.com/azhuantou/HSSDA.

ZegCLIP: Towards Adapting CLIP for Zero-Shot Semantic Segmentation
Zhou, ZiqinandLei, YinjieandZhang, BowenandLiu, LingqiaoandLiu, Yifan



研究问题:如何将CLIP的零样本学习能力从图像扩展到像素级别,同时避免复杂性和高计算成本。
动机:现有的两阶段方法需要两个图像编码器,导致流程复杂且计算成本高。
方法:提出了一种更简单、更高效的单阶段解决方案,直接扩展CLIP的零样本预测能力。通过比较文本和CLIP提取的补丁嵌入之间的相似性生成语义掩码。
效果:在公开的三个基准测试中,ZegCLIP表现出优越的性能,在"归纳式"和"转导式"零样本设置下均大幅超过现有最佳方法。与两阶段方法相比,ZegCLIP的推理速度提高了约5倍。

Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a wo-stage scheme. The general idea is to first generate class-agnostic region proposals and then feed the cropped proposal regions to CLIP to utilize its image-level zero-shot classification capability. While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost. In this work, we pursue a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level. Our investigation starts with a straightforward extension as our baseline that generates semantic masks by comparing the similarity between text and patch embeddings extracted from CLIP. However, such a paradigm could heavily overfit the seen classes and fail to generalize to unseen classes. To handle this issue, we propose three simple-but-effective designs and figure out that they can significantly retain the inherent zero-shot capacity of CLIP and improve pixel-level generalization ability. Incorporating those modifications leads to an efficient zero-shot semantic segmentation system called ZegCLIP. Through extensive experiments on three public benchmarks, ZegCLIP demonstrates superior performance, outperforming the state-of-the-art methods by a large margin under both "inductive" and "transductive" zero-shot settings. In addition, compared with the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times faster during inference. We release the code at https://github.com/ZiqinZhou66/ZegCLIP.git.

CLIP Is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
Lin, YuqiandChen, MinghaoandWang, WenxiaoandWu, BoxiandLi, KeandLin, BinbinandLiu, HaifengandHe, Xiaofei



研究问题:如何利用对比语言-图像预训练模型(CLIP)进行弱监督语义分割,仅使用图像级别标签,无需进一步训练。
动机:主流的多阶段框架在弱监督语义分割中成本高昂,我们探索了使用CLIP进行无额外训练的图像分类的可能性。
方法:我们提出了一种新的弱监督语义分割框架CLIP-ES,通过引入softmax函数到GradCAM和使用CLIP的零样本能力来抑制非目标类别和背景引起的混淆。同时,我们还重新探索了文本输入在弱监督语义分割设置下的应用,并定制了两种基于文本的策略:锐度为基础的提示选择和同义词融合。
效果:我们的CLIP-ES在Pascal VOC 2012和MS COCO 2014上取得了最先进的性能,同时生成伪蒙版的所需时间仅为以前方法的10%。

Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Meanwhile, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP-ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) focus on confident regions. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES.

Mask-Free Video Instance Segmentation
Ke, LeiandDanelljan, MartinandDing, HenghuiandTai, Yu-WingandTang, Chi-KeungandYu, Fisher



研究问题:视频实例分割(VIS)的发展主要受到深度和数据驱动的转换器模型的推动,但视频遮罩的标注既繁琐又昂贵,限制了现有VIS数据集的规模和多样性。
动机:本文旨在消除遮罩标注的需求,提出了MaskFreeVIS,仅使用边界框注释即可实现高度竞争性的VIS性能。
方法:通过引入时间KNN-patch损失(TK-Loss),利用视频中丰富的时间遮罩一致性约束,无需任何标签即可提供强大的遮罩监督。
效果:在YouTube-VIS 2019/2021、OVIS和BDD100K MOTS基准测试中验证了MaskFreeVIS,结果清楚地表明了该方法的有效性,大大缩小了全监督和弱监督VIS性能之间的差距。

The recent advancement in Video Instance Segmentation (VIS) has largely been driven by the use of deeper and increasingly data-hungry transformer-based models. However, video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. In this work, we aim to remove the mask-annotation requirement. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. We leverage the rich temporal mask consistency constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong mask supervision without any labels. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection. A consistency loss is then enforced on the found matches. Our mask-free objective is simple to implement, has no trainable parameters, is computationally efficient, yet outperforms baselines employing, e.g., state-of-the-art optical flow to enforce temporal mask consistency. We validate MaskFreeVIS on the YouTube-VIS 2019/2021, OVIS and BDD100K MOTS benchmarks. The results clearly demonstrate the efficacy of our method by drastically narrowing the gap between fully and weakly-supervised VIS performance. Our code and trained models are available at http://vis.xyz/pub/maskfreevis.

Continual Detection Transformer for Incremental Object Detection
Liu, YaoyaoandSchiele, BerntandVedaldi, AndreaandRupprecht, Christian



研究问题:如何训练一种增量物体检测器,使其能够逐步学习新的物体类别,同时避免灾难性遗忘?
动机:现有的增量物体检测方法在处理最新的基于变换器的物体检测器时,如Deformable DETR和UP-DETR,知识蒸馏和示例重放等技术效果不佳。
方法:提出一种新的基于变换器的增量物体检测方法ContinuaL DEtection TRansformer(CL-DETR),该方法能有效利用知识蒸馏和示例重放。具体包括引入Detector Knowledge Distillation(DKD)损失,关注旧模型中最具信息性和可靠性的预测,忽略冗余的背景预测,并与可用的地面真值标签兼容;同时改进示例重放策略,通过校准策略保持训练集的标签分布,使训练和测试统计更好地匹配。
效果:在COCO 2017数据集上进行大量实验,结果表明CL-DETR在增量物体检测设置中取得了最先进的结果。

Incremental object detection (IOD) aims to train an object detector in phases, each with annotations for new object categories. As other incremental settings, IOD is subject to catastrophic forgetting, which is often addressed by techniques such as knowledge distillation (KD) and exemplar replay (ER). However, KD and ER do not work well if applied directly to state-of-the-art transformer-based object detectors such as Deformable DETR and UP-DETR. In this paper, we solve these issues by proposing a ContinuaL DEtection TRansformer (CL-DETR), a new method for transformer-based IOD which enables effective usage of KD and ER in this context. First, we introduce a Detector Knowledge Distillation (DKD) loss, focusing on the most informative and reliable predictions from old versions of the model, ignoring redundant background predictions, and ensuring compatibility with the available ground-truth labels. We also improve ER by proposing a calibration strategy to preserve the label distribution of the training set, therefore better matching training and testing statistics. We conduct extensive experiments on COCO 2017 and demonstrate that CL-DETR achieves state-of-the-art results in the IOD setting.

HyperMatch: Noise-Tolerant Semi-Supervised Learning via Relaxed Contrastive Constraint
Zhou, BeitongandLu, JingandLiu, KeruiandXu, YunluandCheng, ZhanzhanandNiu, Yi



研究问题:现有的半监督学习方法存在由于不准确的伪标签导致的实例对不匹配问题,这会加剧半监督学习的知名确认偏差。
动机:为了解决这个问题,研究人员提出了一种新的半监督学习方法——HyperMatch,该方法能够容忍噪声并有效利用未标记的数据。
方法:通过结合置信度预测和语义相似性生成更客观的类别分布,然后使用高斯混合模型将伪标签分为“可信”和“不太可信”的子集。此外,还引入了放松对比损失,将“不太可信”的样本分配给一个超类(即前K个最近邻类的并集),从而有效地规范不正确伪标签的干扰,甚至提高将“不太可信”的样本拉向其真实类别的概率。
效果:实验和深入研究证明,HyperMatch表现出卓越的最先进性能,在CIFAR100上以400和2500个标记样本分别比FixMatch高出11.86%和4.88%。

Recent developments of the application of Contrastive Learning in Semi-Supervised Learning (SSL) have demonstrated significant advancements, as a result of its exceptional ability to learn class-aware cluster representations and the full exploitation of massive unlabeled data. However, mismatched instance pairs caused by inaccurate pseudo labels would assign an unlabeled instance to the incorrect class in feature space, hence exacerbating SSL's renowned confirmation bias. To address this issue, we introduced a novel SSL approach, HyperMatch, which is a plug-in to several SSL designs enabling noise-tolerant utilization of unlabeled data. In particular, confidence predictions are combined with semantic similarities to generate a more objective class distribution, followed by a Gaussian Mixture Model to divide pseudo labels into a 'confident' and a 'less confident' subset. Then, we introduce Relaxed Contrastive Loss by assigning the 'less-confident' samples to a hyper-class, i.e. the union of top-K nearest classes, which effectively regularizes the interference of incorrect pseudo labels and even increases the probability of pulling a 'less confident' sample close to its true class. Experiments and in-depth studies demonstrate that HyperMatch delivers remarkable state-of-the-art performance, outperforming FixMatch on CIFAR100 with 400 and 2500 labeled samples by 11.86% and 4.88%, respectively.

Mask-Free OVIS: Open-Vocabulary Instance Segmentation Without Manual Mask Annotations
VS, VibashanandYu, NingandXing, ChenandQin, CanandGao, MingfeiandNiebles, JuanCarlosandPatel, VishalM.andXu, Ran



研究问题:现有的实例分割模型需要通过人工标注从基础类别中学习特定任务信息,这需要大量的人力,限制了对新类别的可扩展性。
动机:为了解决这个问题,开放词汇(OV)方法利用大规模的图像-标题对和视觉语言模型来学习新类别。然而,这种强监督和弱监督之间的差异会导致对基础类别的过拟合,从而影响对新类别的泛化能力。
方法:我们提出了一种无掩码OVIS管道,通过使用预训练的视觉语言模型生成的伪掩码注释进行弱监督学习,来克服这个问题。这种方法自动生成伪掩码注释,然后使用这些注释来监督实例分割模型,使整个流程无需任何劳动密集型的实例级注释和过拟合。
效果:我们的大量实验表明,与最近使用手动掩码训练的最新方法相比,仅使用伪掩码训练的方法在MS-COCO数据集和OpenImages数据集上的mAP得分显著提高。代码和模型可以在https://vibashan.github.io/ovis-web/获取。

Existing instance segmentation models learn task-specific information using manual mask annotations from base (training) categories. These mask annotations require tremendous human effort, limiting the scalability to annotate novel (new) categories. To alleviate this problem, Open-Vocabulary (OV) methods leverage large-scale image-caption pairs and vision-language models to learn novel categories. In summary, an OV method learns task-specific information using strong supervision from base annotations and novel category information using weak supervision from image-captions pairs. This difference between strong and weak supervision leads to overfitting on base categories, resulting in poor generalization towards novel categories. In this work, we overcome this issue by learning both base and novel categories from pseudo-mask annotations generated by the vision-language model in a weakly supervised manner using our proposed Mask-free OVIS pipeline. Our method automatically generates pseudo-mask annotations by leveraging the localization ability of a pre-trained vision-language model for objects present in image-caption pairs. The generated pseudo-mask annotations are then used to supervise an instance segmentation model, freeing the entire pipeline from any labour-expensive instance-level annotations and overfitting. Our extensive experiments show that our method trained with just pseudo-masks significantly improves the mAP scores on the MS-COCO dataset and OpenImages dataset compared to the recent state-of-the-art methods trained with manual masks. Codes and models are provided in https://vibashan.github.io/ovis-web/.

Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud Sequence Representation Learning
Zhang, ZhuoyangandDong, YuhaoandLiu, YunzeandYi, Li



研究问题:如何有效利用未标记的原始数据进行4D点云序列的学习。
动机:获取全面标注的4D数据集既昂贵又费力,因此研究如何利用未标记的原始数据至关重要。
方法:提出一种名为“从完整到部分的4D蒸馏”的新型4D自监督预训练方法,将4D自监督表示学习构建为教师-学生的知识蒸馏框架,让学生在教师的指导下学习有用的4D表示。
效果:实验证明,这种方法在各种4D点云序列理解任务上显著优于先前的预训练方法。

Recent work on 4D point cloud sequences has attracted a lot of attention. However, obtaining exhaustively labeled 4D datasets is often very expensive and laborious, so it is especially important to investigate how to utilize raw unlabeled data. However, most existing self-supervised point cloud representation learning methods only consider geometry from a static snapshot omitting the fact that sequential observations of dynamic scenes could reveal more comprehensive geometric details. To overcome such issues, this paper proposes a new 4D self-supervised pre-training method called Complete-to-Partial 4D Distillation. Our key idea is to formulate 4D self-supervised representation learning as a teacher-student knowledge distillation framework and let the student learn useful 4D representations with the guidance of the teacher. Experiments show that this approach significantly outperforms previous pre-training approaches on a wide range of 4D point cloud sequence understanding tasks. Code is available at: https://github.com/dongyh20/C2P.

Texture-Guided Saliency Distilling for Unsupervised Salient Object Detection
Zhou, HuajunandQiao, BoandYang, LingxiaoandLai, JianhuangandXie, Xiaohua



研究问题:如何利用深度学习进行无监督显著性目标检测,特别是在存在噪声标签的情况下。
动机:现有的方法主要依赖传统手工制作的方法或预训练网络生成的噪声显著性伪标签,忽视了困难样本中有价值的知识。
方法:提出了一种新的显著性目标检测方法,该方法从易和难的样本中挖掘丰富和准确的显著性知识。首先,提出了一种基于置信度的显著性蒸馏策略,该策略根据样本的置信度对样本进行评分,指导模型逐步将显著性知识从易样本蒸馏到难样本。其次,提出了一种边界感知纹理匹配策略,通过匹配预测边界周围的纹理来细化噪声标签的边界。
效果:在RGB、RGB-D、RGB-T和视频SOD基准测试上的大量实验证明,该方法实现了最先进的USOD性能。

Deep Learning-based Unsupervised Salient Object Detection (USOD) mainly relies on the noisy saliency pseudo labels that have been generated from traditional handcraft methods or pre-trained networks. To cope with the noisy labels problem, a class of methods focus on only easy samples with reliable labels but ignore valuable knowledge in hard samples. In this paper, we propose a novel USOD method to mine rich and accurate saliency knowledge from both easy and hard samples. First, we propose a Confidence-aware Saliency Distilling (CSD) strategy that scores samples conditioned on samples' confidences, which guides the model to distill saliency knowledge from easy samples to hard samples progressively. Second, we propose a Boundary-aware Texture Matching (BTM) strategy to refine the boundaries of noisy labels by matching the textures around the predicted boundaries. Extensive experiments on RGB, RGB-D, RGB-T, and video SOD benchmarks prove that our method achieves state-of-the-art USOD performance. Code is available at www.github.com/moothes/A2S-v2.

Network-Free, Unsupervised Semantic Segmentation With Synthetic Images
Feng, QianliandGadde, RaghudeepandLiao, WentongandRamon, EduardandMartinez, Aleix



研究问题:如何不使用任何额外的神经网络、层、手动标注的训练数据或监督训练,生成高度准确的语义分割图。
动机:观察到在生成图像的GAN风格混合方法中,同一语义分割的一组像素的相关性不会改变。
方法:利用GAN反转,对合成和真实的照片进行准确的语义分割,并为后续任务生成大量的训练图像-语义分割掩码对。
效果:实验证明该方法能准确进行语义分割,并可为后续任务生成大量的训练数据。

We derive a method that yields highly accurate semantic segmentation maps without the use of any additional neural network, layers, manually annotated training data, or supervised training. Our method is based on the observation that the correlation of a set of pixels belonging to the same semantic segment do not change when generating synthetic variants of an image using the style mixing approach in GANs. We show how we can use GAN inversion to accurately semantically segment synthetic and real photos as well as generate large training image-semantic segmentation mask pairs for downstream tasks.

Hierarchical Dense Correlation Distillation for Few-Shot Segmentation
Peng, BohaoandTian, ZhuotaoandWu, XiaoyangandWang, ChengyaoandLiu, ShuandSu, JingyongandJia, Jiaya



研究问题:本文旨在解决小样本语义分割问题,即仅用少量标注来对未见过的类别进行分割。
动机:现有的方法受限于语义特征和原型表示,导致分割粒度粗糙和训练集过拟合的问题。
方法:设计了一种基于变压器架构的层次解耦匹配网络(HDMNet),利用自我注意力模块建立层次密集特征,实现查询和支持特征之间的级联匹配。同时,提出了一种匹配模块以减少训练集过拟合,并引入相关性蒸馏,利用粗分辨率的语义对应关系提升细粒度分割。
效果:在实验中表现良好,COCO-5i数据集上的一阶段设置达到了50.0%的mIoU,五阶段分割达到了56.0%。代码可在项目网站上获取。

Few-shot semantic segmentation (FSS) aims to form class-agnostic models segmenting unseen classes with only a handful of annotations. Previous methods limited to the semantic feature and prototype representation suffer from coarse segmentation granularity and train-set overfitting. In this work, we design Hierarchically Decoupled Matching Network (HDMNet) mining pixel-level support correlation based on the transformer architecture. The self-attention modules are used to assist in establishing hierarchical dense features, as a means to accomplish the cascade matching between query and support features. Moreover, we propose a matching module to reduce train-set overfitting and introduce correlation distillation leveraging semantic correspondence from coarse resolution to boost fine-grained segmentation. Our method performs decently in experiments. We achieve 50.0% mIoU on COCO-5i dataset one-shot setting and 56.0% on five-shot segmentation, respectively. The code is available on the project website.

PVO: Panoptic Visual Odometry
Ye, WeicaiandLan, XinyueandChen, ShuoandMing, YuhangandYu, XingyuanandBao, HujunandCui, ZhaopengandZhang, Guofeng



研究问题:本文旨在提出一种全新的全景视觉里程计框架PVO,以实现对场景运动、几何和全景分割信息的更全面建模。
动机:目前的视觉里程计和视频全景分割方法无法充分利用彼此的信息,导致性能受限。
方法:PVO将视觉里程计(VO)和视频全景分割(VPS)在统一视图中进行建模,使两个任务相互受益。具体来说,我们在图像全景分割的指导下,将全景更新模块引入到VO模块中,通过全景感知动态掩码减轻动态对象对相机位姿估计的影响。同时,VO增强的VPS模块通过融合当前帧的全景分割结果到相邻帧,利用从VO模块获得的相机位姿、深度和光流等几何信息提高分割精度。这两个模块通过递归迭代优化相互促进。
效果:实验结果表明,PVO在视觉里程计和视频全景分割任务上都优于最先进的方法。

We present PVO, a novel panoptic visual odometry framework to achieve more comprehensive modeling of the scene motion, geometry, and panoptic segmentation information. Our PVO models visual odometry (VO) and video panoptic segmentation (VPS) in a unified view, which makes the two tasks mutually beneficial. Specifically, we introduce a panoptic update module into the VO Module with the guidance of image panoptic segmentation. This Panoptic-Enhanced VO Module can alleviate the impact of dynamic objects in the camera pose estimation with a panoptic-aware dynamic mask. On the other hand, the VO-Enhanced VPS Module also improves the segmentation accuracy by fusing the panoptic segmentation result of the current frame on the fly to the adjacent frames, using geometric information such as camera pose, depth, and optical flow obtained from the VO Module. These two modules contribute to each other through recurrent iterative optimization. Extensive experiments demonstrate that PVO outperforms state-of-the-art methods in both visual odometry and video panoptic segmentation tasks.

ISBNet: A 3D Point Cloud Instance Segmentation Network With Instance-Aware Sampling and Box-Aware Dynamic Convolution
Ngo, TuanDucandHua, Binh-SonandNguyen, Khoi



研究问题:现有的3D实例分割方法主要依赖于自下而上的设计,即手动微调算法将点分组为簇,然后通过细化网络进行优化。然而,这种方法在处理具有相同语义类别的邻近对象被打包在一起或大对象具有松散连接区域的情况时,会产生不稳定的结果。
动机:为了解决现有方法的局限性,我们提出了ISBNet,一种新的无聚类方法,将实例表示为内核,并通过动态卷积解码实例掩码。
方法:我们提出了一种名为“实例感知的最远点采样”的简单策略,以生成高召回率和判别性的内核。同时,我们利用受PointNet++启发的局部聚合层来编码候选特征。此外,我们还展示了在动态卷积中预测和利用3D轴对齐边界框可以进一步提高性能。
效果:在ScanNetV2、S3DIS和STPLS3D数据集上,我们的ISBNet方法在AP方面取得了新的最先进的结果(分别为55.9、60.8和49.2),并且保持了快速的推理速度(在ScanNetV2上每个场景为237ms)。

Existing 3D instance segmentation methods are predominated by the bottom-up design -- manually fine-tuned algorithm to group points into clusters followed by a refinement network. However, by relying on the quality of the clusters, these methods generate susceptible results when (1) nearby objects with the same semantic class are packed together, or (2) large objects with loosely connected regions. To address these limitations, we introduce ISBNet, a novel cluster-free method that represents instances as kernels and decodes instance masks via dynamic convolution. To efficiently generate high-recall and discriminative kernels, we propose a simple strategy named Instance-aware Farthest Point Sampling to sample candidates and leverage the local aggregation layer inspired by PointNet++ to encode candidate features. Moreover, we show that predicting and leveraging the 3D axis-aligned bounding boxes in the dynamic convolution further boosts performance. Our method set new state-of-the-art results on ScanNetV2 (55.9), S3DIS (60.8), and STPLS3D (49.2) in terms of AP and retains fast inference time (237ms per scene on ScanNetV2). The source code and trained models are available at https://github.com/VinAIResearch/ISBNet.

CLIP-S4: Language-Guided Self-Supervised Semantic Segmentation
He, WenbinandJamonnak, SuphanutandGou, LiangandRen, Liu



研究问题:现有的语义分割方法通常受限于昂贵的像素级标注和预定义的类别,本文提出了一种新的方法来解决这些问题。
动机:本文的动机是利用自我监督的像素表示学习和视觉语言模型来进行各种语义分割任务,而无需任何人工标注和未知类别信息。
方法:首先,我们使用像素段对比学习从图像的不同增强视图中学习像素嵌入。然后,为了进一步提高像素嵌入并实现语言驱动的语义分割,我们设计了两种类型的一致性指导,一种是嵌入一致性,将我们的像素嵌入对齐到预训练的视觉语言模型CLIP的联合特征空间;另一种是语义一致性,强制我们的模型在一组精心设计的目标类别上做出与CLIP相同的预测,这些目标类别既有已知的也有未知的原型。
效果:实验结果表明,我们的方法在四个流行的基准测试中表现出一致且显著的性能改进,超过了最先进的无监督和语言驱动的语义分割方法。更重要的是,我们的方法在这些方法上未知类别识别的性能提高了很大一截。

Existing semantic segmentation approaches are often limited by costly pixel-wise annotations and predefined classes. In this work, we present CLIP-S^4 that leverages self-supervised pixel representation learning and vision-language models to enable various semantic segmentation tasks (e.g., unsupervised, transfer learning, language-driven segmentation) without any human annotations and unknown class information. We first learn pixel embeddings with pixel-segment contrastive learning from different augmented views of images. To further improve the pixel embeddings and enable language-driven semantic segmentation, we design two types of consistency guided by vision-language models: 1) embedding consistency, aligning our pixel embeddings to the joint feature space of a pre-trained vision-language model, CLIP; and 2) semantic consistency, forcing our model to make the same predictions as CLIP over a set of carefully designed target classes with both known and unknown prototypes. Thus, CLIP-S^4 enables a new task of class-free semantic segmentation where no unknown class information is needed during training. As a result, our approach shows consistent and substantial performance improvement over four popular benchmarks compared with the state-of-the-art unsupervised and language-driven semantic segmentation methods. More importantly, our method outperforms these methods on unknown class recognition by a large margin.

Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation
Li, LiuleiandWang, WenguanandZhou, TianfeiandLi, JianwuandYang, Yi



研究问题:本文旨在通过自监督学习实现视频对象分割。
动机:现有的方法通常依赖于像素级的相关性来“复制”标签,而我们的方法则直接从无标签的视频中学习执行掩码引导的序列分割。
方法:我们开发了一个统一的框架,同时对局部判别特征学习和目标掩码解码进行跨帧密集对应建模,并嵌入对象级上下文。具体来说,我们的算法交替进行i) 对视频像素进行聚类以创建伪分割标签;ii) 利用伪标签学习VOS的掩码编码和解码。此外,我们还将无监督对应学习融入到这种自我教授的掩码嵌入方案中,以确保学习到的表示的通用性并避免聚类退化。
效果:实验结果表明,我们的方法在两个标准基准(即DAVIS17和YouTube-VOS)上设置了最先进的技术,无论在性能还是网络架构设计方面,都在自我监督和全监督VOS之间缩小了差距。我们的完整代码将被发布。

The objective of this paper is self-supervised learning of video object segmentation. We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning and embeds object-level context for target-mask decoding. As a result, it is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos, in contrast to previous efforts usually relying on an oblique solution --- cheaply "copying" labels according to pixel-wise correlations. Concretely, our algorithm alternates between i) clustering video pixels for creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo labels to learn mask encoding and decoding for VOS. Unsupervised correspondence learning is further incorporated into this self-taught, mask embedding scheme, so as to ensure the generic nature of the learnt representation and avoid cluster degeneracy. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS), narrowing the gap between self- and fully-supervised VOS, in terms of both performance and network architecture design. Our full code will be released.

Knowledge Combination To Learn Rotated Detection Without Rotated Annotation
Zhu, TianyuandFerenczi, BryceandPurkait, PulakandDrummond, TomandRezatofighi, HamidandvandenHengel, Anton



研究问题:如何通过更经济的轴对齐标注来预测精确的旋转框,以提高旋转检测器的使用率。
动机:虽然旋转边界框可以显著减少细长对象的输出歧义,但由于标注过程繁琐,并未被广泛采用。
方法:本文提出了一个框架,该框架利用神经网络在目标领域的丰富表示能力,结合源数据集的任务知识和目标数据集的领域知识,通过新颖的分配过程和投影损失进行联合训练,使模型能够在目标领域中解决更详细任务,而无需额外的计算开销。
效果:实验结果表明,该方法在各种目标数据集上的表现与全监督方法相当,证明了其有效性。

Rotated bounding boxes drastically reduce output ambiguity of elongated objects, making it superior to axis-aligned bounding boxes. Despite the effectiveness, rotated detectors are not widely employed. Annotating rotated bounding boxes is such a laborious process that they are not provided in many detection datasets where axis-aligned annotations are used instead. In this paper, we propose a framework that allows the model to predict precise rotated boxes only requiring cheaper axis-aligned annotation of the target dataset. To achieve this, we leverage the fact that neural networks are capable of learning richer representation of the target domain than what is utilized by the task. The under-utilized representation can be exploited to address a more detailed task. Our framework combines task knowledge of an out-of-domain source dataset with stronger annotation and domain knowledge of the target dataset with weaker annotation. A novel assignment process and projection loss are used to enable the co-training on the source and target datasets. As a result, the model is able to solve the more detailed task in the target domain, without additional computation overhead during inference. We extensively evaluate the method on various target datasets including fresh-produce dataset, HRSC2016 and SSDD. Results show that the proposed method consistently performs on par with the fully supervised approach.

Contrastive Grouping With Transformer for Referring Image Segmentation
Tang, JiajinandZheng, GeandShi, ChengandYang, Sibei



研究问题:本文旨在解决图像分割中目标参照物的问题,通过自然语言表达式来对图像中的参照物进行分割。
动机:现有的一阶段方法采用像素级的分类框架,直接在像素级别上对视觉和语言进行对齐,无法捕捉到关键的物体级信息。
方法:提出了一种掩码分类框架CGFormer,通过基于令牌的查询和分组策略显式地捕获物体级信息。具体来说,CGFormer首先引入了可学习的查询令牌来表示物体,然后交替查询语言特征并将视觉特征分组到查询令牌中进行物体感知的跨模态推理。此外,CGFormer通过每两层连续层共同更新查询令牌和解码掩码来实现跨层级交互。最后,CGFormer将对比学习与分组策略相结合,以识别参照物对应的令牌和掩码。
效果:实验结果表明,CGFormer在分割和泛化设置中始终显著且一致地优于最先进的方法。代码可在https://github.com/Toneyaya/CGFormer获取。

Referring image segmentation aims to segment the target referent in an image conditioning on a natural language expression. Existing one-stage methods employ per-pixel classification frameworks, which attempt straightforwardly to align vision and language at the pixel level, thus failing to capture critical object-level information. In this paper, we propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer), which explicitly captures object-level information via token-based querying and grouping strategy. Specifically, CGFormer first introduces learnable query tokens to represent objects and then alternately queries linguistic features and groups visual features into the query tokens for object-aware cross-modal reasoning. In addition, CGFormer achieves cross-level interaction by jointly updating the query tokens and decoding masks in every two consecutive layers. Finally, CGFormer cooperates contrastive learning to the grouping strategy to identify the token and its mask corresponding to the referent. Experimental results demonstrate that CGFormer outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly. Code is available at https://github.com/Toneyaya/CGFormer.

Cascade Evidential Learning for Open-World Weakly-Supervised Temporal Action Localization
Chen, MengyuanandGao, JunyuandXu, Changsheng



研究问题:如何在训练中仅使用视频级别标签来识别和定位动作实例,特别是在动态变化的开放世界中,未知动作不断出现的情况下。
动机:现有的弱监督时序动作定位(WTAL)方法基于封闭集假设,但在开放世界中,这种假设无效。此外,传统开放集识别任务的标注是未知的,而已知动作实例的精细标注只能从视频类别标签中模糊推断出来。
方法:我们首次提出了一种级联证据学习框架,针对开放世界弱监督时序动作定位(OWTAL)。该方法联合利用多尺度时间上下文和知识引导原型信息,逐步收集级联和增强的证据,用于已知动作、未知动作和背景分离。
效果:我们在THUMOS-14和ActivityNet-v1.3上进行了大量实验,验证了我们的方法的有效性。除了以前开放集识别方法采用的分类指标外,我们还在更符合OWTAL的本地化指标上评估了我们的方法。

Targeting at recognizing and localizing action instances with only video-level labels during training, Weakly-supervised Temporal Action Localization (WTAL) has achieved significant progress in recent years. However, living in the dynamically changing open world where unknown actions constantly spring up, the closed-set assumption of existing WTAL methods is invalid. Compared with traditional open-set recognition tasks, Open-world WTAL (OWTAL) is challenging since not only are the annotations of unknown samples unavailable, but also the fine-grained annotations of known action instances can only be inferred ambiguously from the video category labels. To address this problem, we propose a Cascade Evidential Learning framework at an evidence level, which targets at OWTAL for the first time. Our method jointly leverages multi-scale temporal contexts and knowledge-guided prototype information to progressively collect cascade and enhanced evidence for known action, unknown action, and background separation. Extensive experiments conducted on THUMOS-14 and ActivityNet-v1.3 verify the effectiveness of our method. Besides the classification metrics adopted by previous open-set recognition methods, we also evaluate our method on localization metrics which are more reasonable for OWTAL.

Reducing the Label Bias for Timestamp Supervised Temporal Action Segmentation
Liu, KaiyuanandLi, YunhengandLiu, ShenglanandTan, ChenweiandShao, Zihang



研究问题:本文旨在解决时间监督的动作分割中标签偏差严重的问题。
动机:由于过度依赖稀疏的时间戳标注,现有的方法存在严重的标签偏差,导致性能不佳。
方法:提出了去偏时间监督动作分割(D-TSTAS)框架,通过利用未标注的帧从两个阶段减轻这种偏差:1)初始化。为了减少对标注帧的依赖,提出了掩码时间预测(MTP)以确保初始化的模型捕获更多的上下文信息。2)精炼。为了克服稀疏标注时间戳的表达性限制,提出了一种以中心为导向的时间戳扩展(CTE)方法,逐步扩展包含动作段语义丰富的运动表示的伪时间戳组。然后,使用这些伪时间戳组和模型输出迭代生成伪标签,以在全监督设置中精炼模型。我们还引入了分段置信度损失,使模型能够在伪时间戳组内具有高置信度的预测以及更准确的动作边界。
效果:我们的D-TSTAS方法优于最先进的TSTAS方法,并在三个基准数据集上与全监督方法取得了竞争性的结果。

Timestamp supervised temporal action segmentation (TSTAS) is more cost-effective than fully supervised counterparts. However, previous approaches suffer from severe label bias due to over-reliance on sparse timestamp annotations, resulting in unsatisfactory performance. In this paper, we propose the Debiasing-TSTAS (D-TSTAS) framework by exploiting unannotated frames to alleviate this bias from two phases: 1) Initialization. To reduce the dependencies on annotated frames, we propose masked timestamp predictions (MTP) to ensure that initialized model captures more contextual information. 2) Refinement. To overcome the limitation of the expressiveness from sparsely annotated timestamps, we propose a center-oriented timestamp expansion (CTE) approach to progressively expand pseudo-timestamp groups which contain semantic-rich motion representation of action segments. Then, these pseudo-timestamp groups and the model output are used to iteratively generate pseudo-labels for refining the model in a fully supervised setup. We further introduce segmental confidence loss to enable the model to have high confidence predictions within the pseudo-timestamp groups and more accurate action boundaries. Our D-TSTAS outperforms the state-of-the-art TSTAS method as well as achieves competitive results compared with fully supervised approaches on three benchmark datasets.

SimpSON: Simplifying Photo Cleanup With Single-Click Distracting Object Segmentation Network
Huynh, ChuongandZhou, YuqianandLin, ZheandBarnes, ConnellyandShechtman, EliandAmirghodsi, SohrabandShrivastava, Abhinav



研究问题:如何通过单次点击,有效地选择和移除图像中的视觉干扰物,提高图像质量并突出主要主题。
动机:手动选择和移除密集的干扰区域既耗时又费力。因此,需要一种交互式的方法来简化这个过程。
方法:提出了一种交互式干扰物选择方法,只需一次点击即可完成。该方法优于传统的运行全景分割然后选择包含点击的片段的方法。我们还展示了如何使用基于变压器的模块来识别与用户点击位置相似的更多干扰区域。
效果:实验证明,该模型可以有效地、准确地交互式地分割未知的干扰物体并进行分组。通过大大简化照片清理和修饰过程,我们提出的模型为探索使用单次点击进行稀有物体分割和分组提供了启示。

In photo editing, it is common practice to remove visual distractions to improve the overall image quality and highlight the primary subject. However, manually selecting and removing these small and dense distracting regions can be a laborious and time-consuming task. In this paper, we propose an interactive distractor selection method that is optimized to achieve the task with just a single click. Our method surpasses the precision and recall achieved by the traditional method of running panoptic segmentation and then selecting the segments containing the clicks. We also showcase how a transformer-based module can be used to identify more distracting regions similar to the user's click position. Our experiments demonstrate that the model can effectively and accurately segment unknown distracting objects interactively and in groups. By significantly simplifying the photo cleaning and retouching process, our proposed model provides inspiration for exploring rare object segmentation and group selection with a single click.

Discriminating Known From Unknown Objects via Structure-Enhanced Recurrent Variational AutoEncoder
Wu, AmingandDeng, Cheng



研究问题:如何利用已知的分布内数据提高模型对未知对象的识别能力。
动机:模拟人类区分已知和未知对象的能力,提出一种无监督的分布外物体检测任务,有助于推动物体检测器的安全部署。
方法:提出了一种结构增强循环变分自编码器(SR-VAE)的方法,主要包括两个专门的循环变分自编码器分支。通过使用经典的拉普拉斯高斯(LoG)算子来增强提取的低层特征的结构信息,以提高物体定位的性能。同时设计了一个生成分类特征增强的变分自编码器分支,以加强物体分类器的判别能力。为了缓解缺乏未知数据的影响,还提出了一个循环一致的条件变分自编码器分支,用于合成偏离分布内特征分布的虚拟分布外特征,以提高区分分布外对象的能力。
效果:在分布外物体检测、开放词汇检测和增量物体检测等实验中,该方法显著优于基线方法,显示出优越性。代码将在https://github.com/AmingWu/SR-VAE上发布。

Discriminating known from unknown objects is an important essential ability for human beings. To simulate this ability, a task of unsupervised out-of-distribution object detection (OOD-OD) is proposed to detect the objects that are never-seen-before during model training, which is beneficial for promoting the safe deployment of object detectors. Due to lacking unknown data for supervision, for this task, the main challenge lies in how to leverage the known in-distribution (ID) data to improve the detector's discrimination ability. In this paper, we first propose a method of Structure-Enhanced Recurrent Variational AutoEncoder (SR-VAE), which mainly consists of two dedicated recurrent VAE branches. Specifically, to boost the performance of object localization, we explore utilizing the classical Laplacian of Gaussian (LoG) operator to enhance the structure information in the extracted low-level features. Meanwhile, we design a VAE branch that recurrently generates the augmentation of the classification features to strengthen the discrimination ability of the object classifier. Finally, to alleviate the impact of lacking unknown data, another cycle-consistent conditional VAE branch is proposed to synthesize virtual OOD features that deviate from the distribution of ID features, which improves the capability of distinguishing OOD objects. In the experiments, our method is evaluated on OOD-OD, open-vocabulary detection, and incremental object detection. The significant performance gains over baselines show the superiorities of our method. The code will be released at https://github.com/AmingWu/SR-VAE.

Fuzzy Positive Learning for Semi-Supervised Semantic Segmentation
Qiao, PengchongandWei, ZhidanandWang, YuandWang, ZhennanandSong, GuoliandXu, FanandJi, XiangyangandLiu, ChangandChen, Jie



研究问题:本文旨在解决半监督学习中对人工标注的依赖性,以及错误标签带来的干扰问题。
动机:通过充分利用多个正确候选标签中的有信息量的语义,以减少对人工标注的依赖并降低错误标签的影响。
方法:提出模糊正类学习(FPL)方法,包括模糊正类分配(FPA)和模糊正类正则化(FPR),以实现自适应数量的标签并为每个像素提供模糊正类预测。
效果:在Cityscapes和VOC 2012数据集上的实验表明,该方法可以显著减轻错误标签的干扰,并逐步实现清晰的像素级语义判别。

Semi-supervised learning (SSL) essentially pursues class boundary exploration with less dependence on human annotations. Although typical attempts focus on ameliorating the inevitable error-prone pseudo-labeling, we think differently and resort to exhausting informative semantics from multiple probably correct candidate labels. In this paper, we introduce Fuzzy Positive Learning (FPL) for accurate SSL semantic segmentation in a plug-and-play fashion, targeting adaptively encouraging fuzzy positive predictions and suppressing highly-probable negatives. Being conceptually simple yet practically effective, FPL can remarkably alleviate interference from wrong pseudo labels and progressively achieve clear pixel-level semantic discrimination. Concretely, our FPL approach consists of two main components, including fuzzy positive assignment (FPA) to provide an adaptive number of labels for each pixel and fuzzy positive regularization (FPR) to restrict the predictions of fuzzy positive categories to be larger than the rest under different perturbations. Theoretical analysis and extensive experiments on Cityscapes and VOC 2012 with consistent performance gain justify the superiority of our approach. Codes are available in https://github.com/qpc1611094/FPL.

Improving Weakly Supervised Temporal Action Localization by Bridging Train-Test Gap in Pseudo Labels
Zhou, JingqiuandHuang, LinjiangandWang, LiangandLiu, SiandLi, Hongsheng



研究问题:本文旨在解决弱监督的时间动作定位任务,即生成目标动作的时序边界并分类其类别。
动机:现有的伪标签方法在训练和测试阶段使用不同的流程或设置,导致训练和测试之间的差距。
方法:提出从预测的动作边界生成高质量的伪标签的方法。设计了高斯加权融合模块来保留动作实例的信息并获得高质量的动作边界;将伪标签生成定义为一个优化问题,考虑动作实例的置信度得分约束;引入Delta伪标签的概念,使模型具有自我校正的能力。
效果:在THUMOS14和ActivityNet1.3两个基准测试中,该方法的表现优于现有方法,平均mAP分别提高了1.9%和3.7%。

The task of weakly supervised temporal action localization targets at generating temporal boundaries for actions of interest, meanwhile the action category should also be classified. Pseudo-label-based methods, which serve as an effective solution, have been widely studied recently. However, existing methods generate pseudo labels during training and make predictions during testing under different pipelines or settings, resulting in a gap between training and testing. In this paper, we propose to generate high-quality pseudo labels from the predicted action boundaries. Nevertheless, we note that existing post-processing, like NMS, would lead to information loss, which is insufficient to generate high-quality action boundaries. More importantly, transforming action boundaries into pseudo labels is quite challenging, since the predicted action instances are generally overlapped and have different confidence scores. Besides, the generated pseudo-labels can be fluctuating and inaccurate at the early stage of training. It might repeatedly strengthen the false predictions if there is no mechanism to conduct self-correction. To tackle these issues, we come up with an effective pipeline for learning better pseudo labels. Firstly, we propose a Gaussian weighted fusion module to preserve information of action instances and obtain high-quality action boundaries. Second, we formulate the pseudo-label generation as an optimization problem under the constraints in terms of the confidence scores of action instances. Finally, we introduce the idea of Delta pseudo labels, which enables the model with the ability of self-correction. Our method achieves superior performance to existing methods on two benchmarks, THUMOS14 and ActivityNet1.3, achieving gains of 1.9% on THUMOS14 and 3.7% on ActivityNet1.3 in terms of average mAP.

SOOD: Towards Semi-Supervised Oriented Object Detection
Hua, WeiandLiang, DingkangandLi, JingyuandLiu, XiaolongandZou, ZhikangandYe, XiaoqingandBai, Xiang



研究问题:现有的半监督目标检测方法主要关注水平方向的目标,对航空图像中常见的多方向目标尚未进行探索。
动机:为了利用未标记的数据提升目标检测器的性能,本文提出了一种新的半监督有向目标检测模型SOOD。
方法:在主流的伪标签框架上构建SOOD,设计了两种损失函数来提供更好的监督。第一种损失函数根据对象的方向差距调整自适应权重,以保持每个伪标签预测对的一致性;第二种损失函数则通过建立伪标签集和预测集之间的多对多关系,显式地考虑图像布局,增强了全局一致性约束。
效果:实验表明,使用这两种提出的损失函数训练后,SOOD在DOTA-v1.5基准测试的各种设置下超越了现有的半监督目标检测方法。

Semi-Supervised Object Detection (SSOD), aiming to explore unlabeled data for boosting object detectors, has become an active task in recent years. However, existing SSOD approaches mainly focus on horizontal objects, leaving multi-oriented objects that are common in aerial images unexplored. This paper proposes a novel Semi-supervised Oriented Object Detection model, termed SOOD, built upon the mainstream pseudo-labeling framework. Towards oriented objects in aerial scenes, we design two loss functions to provide better supervision. Focusing on the orientations of objects, the first loss regularizes the consistency between each pseudo-label-prediction pair (includes a prediction and its corresponding pseudo label) with adaptive weights based on their orientation gap. Focusing on the layout of an image, the second loss regularizes the similarity and explicitly builds the many-to-many relation between the sets of pseudo-labels and predictions. Such a global consistency constraint can further boost semi-supervised learning. Our experiments show that when trained with the two proposed losses, SOOD surpasses the state-of-the-art SSOD methods under various settings on the DOTA-v1.5 benchmark. The code will be available at https://github.com/HamPerdredes/SOOD.

Semi-DETR: Semi-Supervised Object Detection With Detection Transformers
Zhang, JiachengandLin, XiangruandZhang, WeiandWang, KuoandTan, XiaoandHan, JunyuandDing, ErruiandWang, JingdongandLi, Guanbin



研究问题:现有的基于DETR的半监督目标检测(SSOD)框架存在一些问题,如一对一匹配策略在伪真值边界框不准确时会产生错误的匹配,导致训练效率低下;以及基于DETR的检测器在其输入查询和预测输出之间缺乏确定性对应关系,阻碍了一致性正则化方法在当前SSOD方法中的广泛应用。
动机:为了解决上述问题,我们提出了一种新的基于变压器的端到端半监督目标检测器——Semi-DETR。
方法:我们设计了一种阶段混合匹配策略,结合一对一和一对多分配策略,以提高第一阶段的训练效率,并为第二阶段的训练提供高质量的伪标签。此外,我们还引入了跨视图查询一致性方法,以学习来自不同视图的目标查询的语义特征不变性,同时避免了寻找确定性查询对应关系的需求。最后,我们提出了一种基于成本的伪标签挖掘模块,根据伪真值边界框的匹配成本动态挖掘更多的伪框进行一致性训练。
效果:我们在COCO和Pascal VOC基准数据集的所有SSOD设置上进行了广泛的实验,结果显示我们的Semi-DETR方法在所有最先进的方法上都取得了明显的优势。

We analyze the DETR-based framework on semi-supervised object detection (SSOD) and observe that (1) the one-to-one assignment strategy generates incorrect matching when the pseudo ground-truth bounding box is inaccurate, leading to training inefficiency; (2) DETR-based detectors lack deterministic correspondence between the input query and its prediction output, which hinders the applicability of the consistency-based regularization widely used in current SSOD methods. We present Semi-DETR, the first transformer-based end-to-end semi-supervised object detector, to tackle these problems. Specifically, we propose a Stage-wise Hybrid Matching strategy that com- bines the one-to-many assignment and one-to-one assignment strategies to improve the training efficiency of the first stage and thus provide high-quality pseudo labels for the training of the second stage. Besides, we introduce a Cross-view Query Consistency method to learn the semantic feature invariance of object queries from different views while avoiding the need to find deterministic query correspondence. Furthermore, we propose a Cost-based Pseudo Label Mining module to dynamically mine more pseudo boxes based on the matching cost of pseudo ground truth bounding boxes for consistency training. Extensive experiments on all SSOD settings of both COCO and Pascal VOC benchmark datasets show that our Semi-DETR method outperforms all state-of-the-art methods by clear margins.

Hierarchical Semantic Contrast for Scene-Aware Video Anomaly Detection
Sun, ShengyangandGong, Xiaojin



研究问题:视频异常检测中场景感知性增强的关键挑战。
动机:现有的视频异常检测模型缺乏对场景的理解和利用,通过引入预训练的视频解析模型和对比学习,可以更好地捕捉视频中的语义信息。
方法:提出一种分层语义对比(HSC)方法,通过自动编码器重建框架,结合前景物体和背景场景特征进行高级别的语义整合,同时在场景级别和物体级别进行对比学习,以提高模型的判别能力。
效果:实验结果表明,该方法在多个公共数据集和场景依赖混合数据集上均取得了良好的效果,能有效提高视频异常检测的性能。

Increasing scene-awareness is a key challenge in video anomaly detection (VAD). In this work, we propose a hierarchical semantic contrast (HSC) method to learn a scene-aware VAD model from normal videos. We first incorporate foreground object and background scene features with high-level semantics by taking advantage of pre-trained video parsing models. Then, building upon the autoencoder-based reconstruction framework, we introduce both scene-level and object-level contrastive learning to enforce the encoded latent features to be compact within the same semantic classes while being separable across different classes. This hierarchical semantic contrast strategy helps to deal with the diversity of normal patterns and also increases their discrimination ability. Moreover, for the sake of tackling rare normal activities, we design a skeleton-based motion augmentation to increase samples and refine the model further. Extensive experiments on three public datasets and scene-dependent mixture datasets validate the effectiveness of our proposed method.

Self-Guided Diffusion Models
Hu, VincentTaoandZhang, DavidW.andAsano, YukiM.andBurghouts, GertjanJ.andSnoek, CeesG.M.



研究问题:本文旨在解决传统扩散模型在图像生成过程中需要大量标注对的问题。
动机:传统的扩散模型在图像生成质量上取得了显著的进步,但需要大量的图像标注对进行训练,且依赖于其可用性和准确性。
方法:本文提出了一种自我指导的扩散模型框架,利用特征提取函数和自我标注函数,在不同的图像粒度上提供指导信号,包括整体图像、对象框甚至分割掩码。
效果:实验结果表明,自我标注的指导在所有情况下都优于没有指导的扩散模型,甚至可能超过基于地面真值标签的指导。当配备自我监督的框或掩码建议时,该方法可以生成视觉上多样而语义一致的图像,无需任何类别、框或分割标签的标注。自我指导的扩散模型简单、灵活,并有望在大规模部署中获益。

Diffusion models have demonstrated remarkable progress in image generation quality, especially when guidance is used to control the generative process. However, guidance requires a large amount of image-annotation pairs for training and is thus dependent on their availability and correctness. In this paper, we eliminate the need for such annotation by instead exploiting the flexibility of self-supervision signals to design a framework for self-guided diffusion models. By leveraging a feature extraction function and a self-annotation function, our method provides guidance signals at various image granularities: from the level of holistic images to object boxes and even segmentation masks. Our experiments on single-label and multi-label image datasets demonstrate that self-labeled guidance always outperforms diffusion models without guidance and may even surpass guidance based on ground-truth labels. When equipped with self-supervised box or mask proposals, our method further generates visually diverse yet semantically consistent images, without the need for any class, box, or segment label annotation. Self-guided diffusion is simple, flexible and expected to profit from deployment at scale.

Dense Distinct Query for End-to-End Object Detection
Zhang, ShilongandWang, XinjiangandWang, JiaqiandPang, JiangmiaoandLyu, ChengqiandZhang, WenweiandLuo, PingandChen, Kai



研究问题:如何在对象检测中进行端到端的一对一标签分配,同时解决现有稀疏查询无法保证高召回率和密集查询带来的相似查询过多和优化困难的问题。
动机:现有的稀疏查询和密集查询在对象检测中都有其问题,因此需要寻找一种新的查询方式来提高检测性能。
方法:提出了一种称为Dense Distinct Queries(DDQ)的方法。首先像传统检测器一样放置密集查询,然后选择不同的查询进行一对一的分配。DDQ结合了传统方法和最新端到端检测器的优点,显著提高了各种检测器的性能。
效果:实验结果表明,DDQ-DETR在MS-COCO数据集上使用ResNet-50主干网络在12个epoch内达到了52.1的AP,超过了同一设置下的所有现有检测器。DDQ还具有在拥挤场景中的优势,并在CrowdHuman数据集上实现了93.8的AP。

One-to-one label assignment in object detection has successfully obviated the need of non-maximum suppression (NMS) as a postprocessing and makes the pipeline end-to-end. However, it triggers a new dilemma as the widely used sparse queries cannot guarantee a high recall, while dense queries inevitably bring more similar queries and encounters optimization difficulty. As both sparse and dense queries are problematic, then what are the expected queries in end-to-end object detection? This paper shows that the solution should be Dense Distinct Queries (DDQ). Concretely, we first lay dense queries like traditional detectors and then select distinct ones for one-to-one assignments. DDQ blends the advantages of traditional and recent end-to-end detectors and significantly improves the performance of various detectors including FCN, R-CNN, and DETRs. Most impressively, DDQ-DETR achieves 52.1 AP on MS-COCO dataset within 12 epochs using a ResNet-50 backbone, outperforming all existing detectors in the same setting. DDQ also shares the benefit of end-to-end detectors in crowded scenes and achieves 93.8 AP on CrowdHuman. We hope DDQ can inspire researchers to consider the complementarity between traditional methods and end-to-end detectors. The source code can be found at https://github.com/jshilong/DDQ.

DETR With Additional Global Aggregation for Cross-Domain Weakly Supervised Object Detection
Tang, ZonghengandSun, YifanandLiu, SiandYang, Yi



研究问题:本文提出了一种基于DETR的跨领域弱监督物体检测(CDWSOD)方法,旨在通过弱监督将检测器从源领域适应到目标领域。
动机:由于DETR的编码器和解码器都基于注意力机制,因此能够在整个图像上聚合语义,这使DETR具有在CDWSOD中应用的巨大潜力。
方法:我们提出了带有额外全局聚合的DETR(DETR-GA),这是一种同时进行“实例级别+图像级别”预测并利用“强+弱”监督的CDWSOD检测器。关键思想非常简单:对于编码器/解码器,我们分别添加多个类别查询/前景查询以将语义聚合成图像级别的预测。
效果:在四个流行的跨领域基准测试集上的大量实验表明,DETR-GA显著提高了CSWSOD的性能,并在几个数据集上超越了现有技术。例如,在PASCAL VOC到Clipart_all数据集上,mAP从29.0%提高到79.4%。

This paper presents a DETR-based method for cross-domain weakly supervised object detection (CDWSOD), aiming at adapting the detector from source to target domain through weak supervision. We think DETR has strong potential for CDWSOD due to an insight: the encoder and the decoder in DETR are both based on the attention mechanism and are thus capable of aggregating semantics across the entire image. The aggregation results, i.e., image-level predictions, can naturally exploit the weak supervision for domain alignment. Such motivated, we propose DETR with additional Global Aggregation (DETR-GA), a CDWSOD detector that simultaneously makes "instance-level + image-level" predictions and utilizes "strong + weak" supervisions. The key point of DETR-GA is very simple: for the encoder / decoder, we respectively add multiple class queries / a foreground query to aggregate the semantics into image-level predictions. Our query-based aggregation has two advantages. First, in the encoder, the weakly-supervised class queries are capable of roughly locating the corresponding positions and excluding the distraction from non-relevant regions. Second, through our design, the object queries and the foreground query in the decoder share consensus on the class semantics, therefore making the strong and weak supervision mutually benefit each other for domain alignment. Extensive experiments on four popular cross-domain benchmarks show that DETR-GA significantly improves CSWSOD and advances the states of the art (e.g., 29.0% --> 79.4% mAP on PASCAL VOC --> Clipart_all dataset).

Multiple Instance Learning via Iterative Self-Paced Supervised Contrastive Learning
Liu, KangningandZhu, WeichengandShen, YiqiuandLiu, ShengandRazavian, NargesandGeras, KrzysztofJ.andFernandez-Granda, Carlos



研究问题:如何在只有包级别标签的情况下学习单个实例的表示,这是多实例学习(MIL)的基本挑战。
动机:在现实世界的应用中,如医学图像分类,通常存在类别不平衡的问题,导致随机选择的实例大多属于同一多数类,这阻碍了对比自监督学习(CSSL)学习类别间差异。
方法:提出一种新的框架——迭代自我步调监督对比学习用于MIL表示(ItS2CLR),通过利用从包级别标签派生的实例级伪标签来改进学习的表示。该框架采用一种新的自我步调采样策略以确保伪标签的准确性。
效果:在三个医疗数据集上评估ItS2CLR,结果显示,它提高了实例级伪标签和表示的质量,并在包级别和实例级别准确性方面优于现有的MIL方法。代码可在https://github.com/Kangningthu/ItS2CLR获取。

Learning representations for individual instances when only bag-level labels are available is a fundamental challenge in multiple instance learning (MIL). Recent works have shown promising results using contrastive self-supervised learning (CSSL), which learns to push apart representations corresponding to two different randomly-selected instances. Unfortunately, in real-world applications such as medical image classification, there is often class imbalance, so randomly-selected instances mostly belong to the same majority class, which precludes CSSL from learning inter-class differences. To address this issue, we propose a novel framework, Iterative Self-paced Supervised Contrastive Learning for MIL Representations (ItS2CLR), which improves the learned representation by exploiting instance-level pseudo labels derived from the bag-level labels. The framework employs a novel self-paced sampling strategy to ensure the accuracy of pseudo labels. We evaluate ItS2CLR on three medical datasets, showing that it improves the quality of instance-level pseudo labels and representations, and outperforms existing MIL methods in terms of both bag and instance level accuracy. Code is available at https://github.com/Kangningthu/ItS2CLR

Weak-Shot Object Detection Through Mutual Knowledge Transfer
Du, XuanyiandWan, WeitaoandSun, ChongandLi, Chen



研究问题:本文旨在解决弱目标检测中只有图像级标签的目标任务集的问题。
动机:通过在源数据集和目标任务集之间双向转移对象知识,提高目标任务集的检测性能。
方法:提出了一种新颖的知识转移损失函数,同时从源数据集上训练的建议生成器中提炼出对象性和类别熵的知识,优化目标任务集上的多实例学习模块。
效果:通过共同优化分类损失和提出的知识转移损失,多实例学习模块有效地学习将建议划分为目标任务集中的新类别,并利用源数据集中的基本类别的知识。实验表明,该方法在不增加模型参数或推理计算复杂度的情况下,显著提高了目标任务集的检测性能。

Weak-shot Object Detection methods exploit a fully-annotated source dataset to facilitate the detection performance on the target dataset which only contains image-level labels for novel categories. To bridge the gap between these two datasets, we aim to transfer the object knowledge between the source (S) and target (T) datasets in a bi-directional manner. We propose a novel Knowledge Transfer (KT) loss which simultaneously distills the knowledge of objectness and class entropy from a proposal generator trained on the S dataset to optimize a multiple instance learning module on the T dataset. By jointly optimizing the classification loss and the proposed KT loss, the multiple instance learning module effectively learns to classify object proposals into novel categories in the T dataset with the transferred knowledge from base categories in the S dataset. Noticing the predicted boxes on the T dataset can be regarded as an extension for the original annotations on the S dataset to refine the proposal generator in return, we further propose a novel Consistency Filtering (CF) method to reliably remove inaccurate pseudo labels by evaluating the stability of the multiple instance learning module upon noise injections. Via mutually transferring knowledge between the S and T datasets in an iterative manner, the detection performance on the target dataset is significantly improved. Extensive experiments on public benchmarks validate that the proposed method performs favourably against the state-of-the-art methods without increasing the model parameters or inference computational complexity.

CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model
Liang, DingkangandXie, JiahaoandZou, ZhikangandYe, XiaoqingandXu, WeiandBai, Xiang



研究问题:本文旨在解决监督式人群计数需要大量昂贵手动标注的问题。
动机:监督式人群计数方法在高密度场景中,由于需要大量的人工标注,既困难又昂贵。
方法:提出一种名为CrowdCLIP的新型无监督框架进行人群计数。利用对比预训练的视觉语言模型(CLIP)和人群补丁与计数文本之间的自然映射关系,通过构建排名文本提示来匹配大小排序的人群补丁,引导图像编码器学习。
效果:实验结果表明,CrowdCLIP在五个具有挑战性的数据集上的表现优于先前的无监督最先进计数方法,甚至在某些跨数据集设置下超过了一些流行的全监督方法。

Supervised crowd counting relies heavily on costly manual labeling, which is difficult and expensive, especially in dense scenes. To alleviate the problem, we propose a novel unsupervised framework for crowd counting, named CrowdCLIP. The core idea is built on two observations: 1) the recent contrastive pre-trained vision-language model (CLIP) has presented impressive performance on various downstream tasks; 2) there is a natural mapping between crowd patches and count text. To the best of our knowledge, CrowdCLIP is the first to investigate the vision-language knowledge to solve the counting problem. Specifically, in the training stage, we exploit the multi-modal ranking loss by constructing ranking text prompts to match the size-sorted crowd patches to guide the image encoder learning. In the testing stage, to deal with the diversity of image patches, we propose a simple yet effective progressive filtering strategy to first select the highly potential crowd patches and then map them into the language space with various counting intervals. Extensive experiments on five challenging datasets demonstrate that the proposed CrowdCLIP achieves superior performance compared to previous unsupervised state-of-the-art counting methods. Notably, CrowdCLIP even surpasses some popular fully-supervised methods under the cross-dataset setting. The source code will be available at https://github.com/dk-liang/CrowdCLIP.

Geometric Visual Similarity Learning in 3D Medical Image Self-Supervised Pre-Training
He, YutingandYang, GuanyuandGe, RongjunandChen, YangandCoatrieux, Jean-LouisandWang, BoyuandLi, Shuo



研究问题:学习图像间的相似性对于3D医疗图像的自监督预训练至关重要,但由于缺乏语义先验和语义无关的3D医疗图像变化,获取可靠的图像间相似性测量变得困难,阻碍了相同语义的学习。
动机:由于3D医疗图像中存在大量相同的语义区域,因此学习图像间的相似性对于其自我监督预训练至关重要。然而,现有的方法在度量标准上缺乏语义先验,且3D医疗图像中的语义无关变化使得获取可靠的图像间相似性测量变得困难,这阻碍了相同语义的学习。
方法:我们提出了一种新的视觉相似性学习范式——几何视觉相似性学习(GVSL),它将拓扑不变性的先验嵌入到图像间相似性的测量中,以实现相同语义区域的一致表示。为了推动这一范式,我们还构建了一个新颖的几何匹配头——Z匹配头,它协同学习了语义区域的全局和局部相似性,引导了不同尺度级别的图像间语义特征的有效表示学习。
效果:我们的实验表明,使用我们的图像间相似性学习方法进行预训练,在四个具有挑战性的3D医疗图像任务上获得了更强的内部场景、场景间和全局-局部转移能力。我们的代码和预训练模型将在https://github.com/YutingHe-list/GVSL上公开发布。

Learning inter-image similarity is crucial for 3D medical images self-supervised pre-training, due to their sharing of numerous same semantic regions. However, the lack of the semantic prior in metrics and the semantic-independent variation in 3D medical images make it challenging to get a reliable measurement for the inter-image similarity, hindering the learning of consistent representation for same semantics. We investigate the challenging problem of this task, i.e., learning a consistent representation between images for a clustering effect of same semantic features. We propose a novel visual similarity learning paradigm, Geometric Visual Similarity Learning, which embeds the prior of topological invariance into the measurement of the inter-image similarity for consistent representation of semantic regions. To drive this paradigm, we further construct a novel geometric matching head, the Z-matching head, to collaboratively learn the global and local similarity of semantic regions, guiding the efficient representation learning for different scale-level inter-image semantic features. Our experiments demonstrate that the pre-training with our learning of inter-image similarity yields more powerful inner-scene, inter-scene, and global-local transferring ability on four challenging 3D medical image tasks. Our codes and pre-trained models will be publicly available in https://github.com/YutingHe-list/GVSL.

Enhanced Training of Query-Based Object Detection via Selective Query Recollection
Chen, FangyiandZhang, HanandHu, KaiandHuang, Yu-KaiandZhu, ChenchenandSavvides, Marios



研究问题:本文调查了基于查询的目标检测器在预测的最后解码阶段误判,但在中间阶段预测正确的现象。
动机:作者认为这种现象是由于训练过程中的两个限制造成的:缺乏训练重点和解码序列的级联错误。
方法:作者设计并提出了选择性查询回收(SQR)策略,这是一种简单而有效的训练策略。它会随着解码阶段的深入逐步收集中间查询,并选择性地将查询转发到下游阶段,而不是按照序列结构进行。这样,SQR就可以将训练重点放在后期阶段,并允许后期阶段直接使用早期阶段的中间查询。
效果:作者将SQR应用于Adamixer、DAB-DETR和Deformable-DETR等不同的基于查询的目标检测器,并在各种设置(主干网络、查询数量、调度)下进行了测试,结果普遍提高了1.4至2.8个AP值。

This paper investigates a phenomenon where query-based object detectors mispredict at the last decoding stage while predicting correctly at an intermediate stage. We review the training process and attribute the overlooked phenomenon to two limitations: lack of training emphasis and cascading errors from decoding sequence. We design and present Selective Query Recollection (SQR), a simple and effective training strategy for query-based object detectors. It cumulatively collects intermediate queries as decoding stages go deeper and selectively forwards the queries to the downstream stages aside from the sequential structure. Such-wise, SQR places training emphasis on later stages and allows later stages to work with intermediate queries from earlier stages directly. SQR can be easily plugged into various query-based object detectors and significantly enhances their performance while leaving the inference pipeline unchanged. As a result, we apply SQR on Adamixer, DAB-DETR, and Deformable-DETR across various settings (backbone, number of queries, schedule) and consistently brings 1.4 2.8 AP improvement.

Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection
Lv, HuiandYue, ZhongqiandSun, QianruandLuo, BinandCui, ZhenandZhang, Hanwang



研究问题:弱监督视频异常检测(WSVAD)面临的挑战是,虽然异常标签只在视频级别给出,但输出需要片段级别的预测。
动机:现有的多实例学习(MIL)方法在WSVAD中普遍存在,但因为片段级别的检测器容易受到简单上下文的异常片段的影响,对相同偏见的正常片段感到困惑,并错过具有不同模式的异常,因此会产生许多误报。
方法:为此,我们提出了一种新的MIL框架——无偏MIL(UMIL),以学习改善WSVAD的无偏异常特征。在每次MIL训练迭代中,我们使用当前的检测器将样本分为两组具有不同上下文偏见的样本:最有信心的异常/正常片段和其余的模糊片段。然后,通过寻找两个样本组之间的不变特征,我们可以消除可变的上下文偏见。
效果:我们在UCF-Crime和TAD基准测试上的大量实验表明了我们的UMIL的有效性。

Weakly Supervised Video Anomaly Detection (WSVAD) is challenging because the binary anomaly label is only given on the video level, but the output requires snippet-level predictions. So, Multiple Instance Learning (MIL) is prevailing in WSVAD. However, MIL is notoriously known to suffer from many false alarms because the snippet-level detector is easily biased towards the abnormal snippets with simple context, confused by the normality with the same bias, and missing the anomaly with a different pattern. To this end, we propose a new MIL framework: Unbiased MIL (UMIL), to learn unbiased anomaly features that improve WSVAD. At each MIL training iteration, we use the current detector to divide the samples into two groups with different context biases: the most confident abnormal/normal snippets and the rest ambiguous ones. Then, by seeking the invariant features across the two sample groups, we can remove the variant context biases. Extensive experiments on benchmarks UCF-Crime and TAD demonstrate the effectiveness of our UMIL. Our code is provided at https://github.com/ktr-hubrt/UMIL.

Rethinking the Correlation in Few-Shot Segmentation: A Buoys View
Wang, YuanandSun, RuiandZhang, Tianzhu



研究问题:如何减少由像素级相关性引起的错误匹配,以在给定的查询图像中仅使用少数标注的支持图像来分割新的对象。
动机:大多数先前的最佳方法(无论是原型学习还是亲和力学习方法)都忽视了由于其自身的像素级相关性而引起的错误匹配。
方法:从代表性参考特征(称为浮标)的角度重新思考如何减轻错误匹配,并提出一种新的自适应浮标关联网络(ABC)来纠正直接的像素级相关性,包括一个浮标挖掘模块和一个自适应关联模块。
效果:通过在两个具有挑战性的基准测试上使用两种不同的骨干网络进行广泛的实验,结果表明,我们的ABC作为一种通用插件,在1-shot和5-shot设置上均比几种领先方法取得了一致的改进。

Few-shot segmentation (FSS) aims to segment novel objects in a given query image with only a few annotated support images. However, most previous best-performing methods, whether prototypical learning methods or affinity learning methods, neglect to alleviate false matches caused by their own pixel-level correlation. In this work, we rethink how to mitigate the false matches from the perspective of representative reference features (referred to as buoys), and propose a novel adaptive buoys correlation (ABC) network to rectify direct pairwise pixel-level correlation, including a buoys mining module and an adaptive correlation module. The proposed ABC enjoys several merits. First, to learn the buoys well without any correspondence supervision, we customize the buoys mining module according to the three characteristics of representativeness, task awareness and resilience. Second, the proposed adaptive correlation module is responsible for further endowing buoy-correlation-based pixel matching with an adaptive ability. Extensive experimental results with two different backbones on two challenging benchmarks demonstrate that our ABC, as a general plugin, achieves consistent improvements over several leading methods on both 1-shot and 5-shot settings.

Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization
Ren, HuanandYang, WenfeiandZhang, TianzhuandZhang, Yongdong



研究问题:弱监督的时间动作定位旨在在训练期间仅使用视频级别的类别标签对未修剪的视频进行动作的定位和识别。
动机:大多数现有方法遵循基于片段的多实例学习(S-MIL)框架,其中片段的预测由视频的标签进行监督。然而,训练阶段获取片段级分数的目标与测试阶段获取提案级分数的目标不一致,导致结果不佳。
方法:我们提出了一种新的基于提案的多实例学习(P-MIL)框架,该框架直接在训练和测试阶段对候选提案进行分类,包括三个关键设计:1) 一个考虑周围对比信息的周边对比特征提取模块,以抑制具有区分性的短提案;2) 一个提案完整性评估模块,通过完整性伪标签的指导抑制低质量提案;3) 一个实例级别排名一致性损失,通过利用RGB和FLOW模态的互补性实现稳健检测。
效果:我们在THUMOS14和ActivityNet两个具有挑战性的基准上进行了广泛的实验,结果表明我们的方法性能优越。我们的代码可以在github.com/OpenSpaceAI/CVPR2023_P-MIL上找到。

Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training. Without instance-level annotations, most existing methods follow the Segment-based Multiple Instance Learning (S-MIL) framework, where the predictions of segments are supervised by the labels of videos. However, the objective for acquiring segment-level scores during training is not consistent with the target for acquiring proposal-level scores during testing, leading to suboptimal results. To deal with this problem, we propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages, which includes three key designs: 1) a surrounding contrastive feature extraction module to suppress the discriminative short proposals by considering the surrounding contrastive information, 2) a proposal completeness evaluation module to inhibit the low-quality proposals with the guidance of the completeness pseudo labels, and 3) an instance-level rank consistency loss to achieve robust detection by leveraging the complementarity of RGB and FLOW modalities. Extensive experimental results on two challenging benchmarks including THUMOS14 and ActivityNet demonstrate the superior performance of our method. Our code is available at github.com/OpenSpaceAI/CVPR2023_P-MIL.

Interventional Bag Multi-Instance Learning on Whole-Slide Pathological Images
Lin, TianchengandYu, ZhimiaoandHu, HongyuandXu, YiandChen, Chang-Wen



研究问题:如何改进现有的多实例学习(MIL)方法,以解决在处理全幅病理图像(WSIs)分类时,由于巨大的像素分辨率和幻灯片级别的标签导致的困难。
动机:现有的MIL方法主要关注于提高特征提取器和聚合器的性能,但一个主要的不足是袋子的上下文先验可能会误导模型捕获袋子和标签之间的虚假相关性,这限制了现有MIL方法的性能。
方法:提出了一种新的策略——干预性袋子多实例学习(IBMIL),通过后门调整实现干预训练,从而抑制由袋子的上下文先验引起的偏差。
效果:实验结果表明,IBMIL能够显著提升现有方案的性能,达到新的最先进的性能水平。

Multi-instance learning (MIL) is an effective paradigm for whole-slide pathological images (WSIs) classification to handle the gigapixel resolution and slide-level label. Prevailing MIL methods primarily focus on improving the feature extractor and aggregator. However, one deficiency of these methods is that the bag contextual prior may trick the model into capturing spurious correlations between bags and labels. This deficiency is a confounder that limits the performance of existing MIL methods. In this paper, we propose a novel scheme, Interventional Bag Multi-Instance Learning (IBMIL), to achieve deconfounded bag-level prediction. Unlike traditional likelihood-based strategies, the proposed scheme is based on the backdoor adjustment to achieve the interventional training, thus is capable of suppressing the bias caused by the bag contextual prior. Note that the principle of IBMIL is orthogonal to existing bag MIL methods. Therefore, IBMIL is able to bring consistent performance boosting to existing schemes, achieving new state-of-the-art performance. Code is available at https://github.com/HHHedo/IBMIL.

SegLoc: Learning Segmentation-Based Representations for Privacy-Preserving Visual Localization
Pietrantoni, MaximeandHumenberger, MartinandSattler, TorstenandCsurka, Gabriela



研究问题:本文旨在探讨如何利用强大的图像分割技术在保护隐私的视觉定位中发挥作用。
动机:现有的视觉定位方法缺乏对个人隐私的保护,而图像分割技术可以创建强大、紧凑且保护隐私的场景表示。
方法:提出一种新的定位框架SegLoc,利用图像分割技术创建稳健、紧凑且保护隐私的场景表示,即3D地图。通过学习一组具有判别性的聚类标签、额外的一致性正则化项以及联合学习全局和局部密集表示,使Larson等人(ICCV'19)的对应监督细粒度分割方法更具鲁棒性。
效果:实验结果表明,所提出的表示法能够在仅使用不包含足够信息以供攻击者重建个人信息的紧凑3D地图的情况下实现(接近)最先进的姿态估计结果。

Inspired by properties of semantic segmentation, in this paper we investigate how to leverage robust image segmentation in the context of privacy-preserving visual localization. We propose a new localization framework, SegLoc, that leverages image segmentation to create robust, compact, and privacy-preserving scene representations, i.e., 3D maps. We build upon the correspondence-supervised, fine-grained segmentation approach from Larsson et al (ICCV'19), making it more robust by learning a set of cluster labels with discriminative clustering, additional consistency regularization terms and we jointly learn a global image representation along with a dense local representation. In our localization pipeline, the former will be used for retrieving the most similar images, the latter to refine the retrieved poses by minimizing the label inconsistency between the 3D points of the map and their projection onto the query image. In various experiments, we show that our proposed representation allows to achieve (close-to) state-of-the-art pose estimation results while only using a compact 3D map that does not contain enough information about the original images for an attacker to reconstruct personal information.

RankMix: Data Augmentation for Weakly Supervised Learning of Classifying Whole Slide Images With Diverse Sizes and Imbalanced Categories
Chen, Yuan-ChihandLu, Chun-Shien



研究问题:如何对大规模、不平衡的全幻灯片图像(WSIs)进行分类,这是一类弱监督学习问题。
动机:全幻灯片图像(WSIs)通常具有千兆像素的大小,缺乏像素级别的注释,且类别分布极度不平衡,这给其分类带来了挑战。
方法:我们提出了一种名为RankMix的数据增强方法,该方法通过混合成对WSIs中的特征排名来提取关键区域,并引入了伪标签和排名的概念以提升模型性能。
效果:RankMix方法在处理缺乏训练数据和类别不平衡的WSI分类问题上表现出良好的效果,为解决此类问题提供了新的视角。

Whole Slide Images (WSIs) are usually gigapixel in size and lack pixel-level annotations. The WSI datasets are also imbalanced in categories. These unique characteristics, significantly different from the ones in natural images, pose the challenge of classifying WSI images as a kind of weakly supervise learning problems. In this study, we propose, RankMix, a data augmentation method of mixing ranked features in a pair of WSIs. RankMix introduces the concepts of pseudo labeling and ranking in order to extract key WSI regions in contributing to the WSI classification task. A two-stage training is further proposed to boost stable training and model performance. To our knowledge, the study of weakly supervised learning from the perspective of data augmentation to deal with the WSI classification problem that suffers from lack of training data and imbalance of categories is relatively unexplored.

Mask-Guided Matting in the Wild
Park, KwanyongandWoo, SanghyunandOh, SeoungWugandKweon, InSoandLee, Joon-Young



研究问题:如何将掩码引导的抠图技术扩展到实际场景中,处理各种复杂背景下的多种类别?
动机:传统的基于三值映射的方法在实际应用中表现不佳,而掩码引导的抠图方法虽然实用,但需要处理更广泛的类别和复杂的背景。
方法:提出了一种简单有效的学习框架,包括1) 学习一个能理解给定掩码指导的通用抠图模型;2) 利用弱监督数据集(如实例分割数据集)来缓解现有抠图数据集的多样性和规模限制。
效果:在多个基准测试上进行了大量实验,包括一个新的合成基准(Composition-Wild)和现有的自然数据集,证明了该方法的优越性。同时,在新的实际应用(如全景抠图和掩码引导的视频抠图)上也取得了良好的效果,显示出模型的强大通用性和潜力。

Mask-guided matting has shown great practicality compared to traditional trimap-based methods. The mask-guided approach takes an easily-obtainable coarse mask as guidance and produces an accurate alpha matte. To extend the success toward practical usage, we tackle mask-guided matting in the wild, which covers a wide range of categories in their complex context robustly. To this end, we propose a simple yet effective learning framework based on two core insights: 1) learning a generalized matting model that can better understand the given mask guidance and 2) leveraging weak supervision datasets (e.g., instance segmentation dataset) to alleviate the limited diversity and scale of existing matting datasets. Extensive experimental results on multiple benchmarks, consisting of a newly proposed synthetic benchmark (Composition-Wild) and existing natural datasets, demonstrate the superiority of the proposed method. Moreover, we provide appealing results on new practical applications (e.g., panoptic matting and mask-guided video matting), showing the great generality and potential of our model.

Dynamic Conceptional Contrastive Learning for Generalized Category Discovery
Pu, NanandZhong, ZhunandSebe, Nicu



研究问题:本文旨在解决广义类别发现(GCD)的问题,即如何自动对部分标记的数据进行聚类。
动机:传统的新类别发现(NCD)方法由于假设未标记的数据只来自新的类别,因此在面对GCD时显得力不从心。
方法:本文提出了一种动态概念对比学习(DCCL)框架,通过交替估计潜在的视觉概念和学习概念表示,以改善聚类准确性。同时设计了一种动态概念生成和更新机制,以确保概念学习的一致性,进一步优化DCCL。
效果:实验表明,DCCL在六个通用和细粒度的视觉识别数据集上取得了新的最先进的性能,特别是在细粒度数据集上表现突出。例如,在CUB-200数据集的新类别上,该方法比最佳竞争对手高出16.2%。

Generalized category discovery (GCD) is a recently proposed open-world problem, which aims to automatically cluster partially labeled data. The main challenge is that the unlabeled data contain instances that are not only from known categories of the labeled data but also from novel categories. This leads traditional novel category discovery (NCD) methods to be incapacitated for GCD, due to their assumption of unlabeled data are only from novel categories. One effective way for GCD is applying self-supervised learning to learn discriminate representation for unlabeled data. However, this manner largely ignores underlying relationships between instances of the same concepts (e.g., class, super-class, and sub-class), which results in inferior representation learning. In this paper, we propose a Dynamic Conceptional Contrastive Learning (DCCL) framework, which can effectively improve clustering accuracy by alternately estimating underlying visual conceptions and learning conceptional representation. In addition, we design a dynamic conception generation and update mechanism, which is able to ensure consistent conception learning and thus further facilitate the optimization of DCCL. Extensive experiments show that DCCL achieves new state-of-the-art performances on six generic and fine-grained visual recognition datasets, especially on fine-grained ones. For example, our method significantly surpasses the best competitor by 16.2% on the new classes for the CUB-200 dataset. Code is available at https://github.com/TPCD/DCCL

Zero-Shot Referring Image Segmentation With Global-Local Context Features
Yu, SeonghoonandSeo, PaulHongsuckandSon, Jeany



研究问题:如何通过利用大规模文本语料库和知识图谱训练一种增强的语言表示模型(ERNIE),以同时充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled datasets for this task, however, is notoriously costly and labor-intensive. To overcome this issue, we propose a simple yet effective zero-shot referring image segmentation method by leveraging the pre-trained cross-modal knowledge from CLIP. In order to obtain segmentation masks grounded to the input text, we propose a mask-guided visual encoder that captures global and local contextual information of an input image. By utilizing instance masks obtained from off-the-shelf mask proposal techniques, our method is able to segment fine-detailed instance-level groundings. We also introduce a global-local text encoder where the global feature captures complex sentence-level semantics of the entire input expression while the local feature focuses on the target noun phrase extracted by a dependency parser. In our experiments, the proposed method outperforms several zero-shot baselines of the task and even the weakly supervised referring expression segmentation method with substantial margins. Our code is available at https://github.com/Seonghoon-Yu/Zero-shot-RIS.

Weakly Supervised Monocular 3D Object Detection Using Multi-View Projection and Direction Consistency
Tao, RunzhouandHan, WenchengandQiu, ZhongyingandXu, Cheng-ZhongandShen, Jianbing



研究问题:本文旨在解决单目3D物体检测中训练和推断阶段数据不一致的问题,即研究问题:本文旨在解决单目3D物体检测中训练和推断阶段数据不一致的问题,即训练阶段需要依赖3D点云数据标注真值,而推断阶段不需要。
动机:目前大多数的单目3D物体检测方法在训练阶段依赖3D点云数据标注真值,这增加了数据收集成本,且与推断阶段的实际应用需求不符。
方法:提出一种新的弱监督单目3D物体检测方法,仅使用2D图像上的标签进行模型训练。通过探索投影、多视角和方向一致性三种类型的一致性,设计了一种基于这些一致性的弱监督架构。同时,提出了一种新的2D方向标签方法来指导模型进行准确的旋转方向预测。
效果:实验表明,该方法的性能与一些全监督方法相当。当用作预训练方法时,仅使用1/3的3D标签,就可以显著超越相应的全监督基线。

Monocular 3D object detection has become a mainstream approach in automatic driving for its easy application. A prominent advantage is that it does not need LiDAR point clouds during the inference. However, most current methods still rely on 3D point cloud data for labeling the ground truths used in the training phase. This inconsistency between the training and inference makes it hard to utilize the large-scale feedback data and increases the data collection expenses. To bridge this gap, we propose a new weakly supervised monocular 3D objection detection method, which can train the model with only 2D labels marked on images. To be specific, we explore three types of consistency in this task, i.e. the projection, multi-view and direction consistency, and design a weakly-supervised architecture based on these consistencies. Moreover, we propose a new 2D direction labeling method in this task to guide the model for accurate rotation direction prediction. Experiments show that our weakly-supervised method achieves comparable performance with some fully supervised methods. When used as a pre-training method, our model can significantly outperform the corresponding fully-supervised baseline with only 1/3 3D labels.

Towards Open-World Segmentation of Parts
Pan, Tai-YuandLiu, QingandChao, Wei-LunandPrice, Brian



研究问题:如何有效地进行对象部分分割,特别是在未见过的对象上?
动机:现有的最大数据集只包含200个物体类别,难以扩展到无约束的设置。
方法:提出一种看似简单但实用且可扩展的任务——类别无关的部分分割。在训练中忽略部分类别标签,将所有部分视为单个部分类别。同时,使模型具有对象感知能力,并利用模型提取的像素级特征对未见过的物体进行部分分割。
效果:通过在PartImageNet和Pascal-Part上的大量实验,证明了该方法的有效性,为开放世界的部分分割迈出了关键一步。

Segmenting object parts such as cup handles and animal bodies is important in many real-world applications but requires more annotation effort. The largest dataset nowadays contains merely two hundred object categories, implying the difficulty to scale up part segmentation to an unconstrained setting. To address this, we propose to explore a seemingly simplified but empirically useful and scalable task, class-agnostic part segmentation. In this problem, we disregard the part class labels in training and instead treat all of them as a single part class. We argue and demonstrate that models trained without part classes can better localize parts and segment them on objects unseen in training. We then present two further improvements. First, we propose to make the model object-aware, leveraging the fact that parts are "compositions" whose extents are bounded by objects, whose appearances are by nature not independent but bundled. Second, we introduce a novel approach to improve part segmentation on unseen objects, inspired by an interesting finding --- for unseen objects, the pixel-wise features extracted by the model often reveal high-quality part segments. To this end, we propose a novel self-supervised procedure that iterates between pixel clustering and supervised contrastive learning that pulls pixels closer or pushes them away. Via extensive experiments on PartImageNet and Pascal-Part, we show notable and consistent gains by our approach, essentially a critical step towards open-world part segmentation.

DualRel: Semi-Supervised Mitochondria Segmentation From a Prototype Perspective
Mai, HuayuandSun, RuiandZhang, TianzhuandXiong, ZhiweiandWu, Feng



研究问题:如何有效地进行半监督线粒体分割,降低手动标注成本。
动机:现有的线粒体图像分割方法严重依赖经验丰富的领域专家的手动收集,且简单地将自然图像领域的半监督分割方法应用于线粒体图像分割并不理想。
方法:我们分析了线粒体图像和自然图像之间的差距,并从可靠的原型级别监督的角度重新思考了如何实现有效的半监督线粒体分割。我们提出了一种新的端到端双可靠(DualRel)网络,包括一个可靠的像素聚合模块和一个可靠的原型选择模块。
效果:在三个具有挑战性的基准测试中,我们的方法表现优于最先进的半监督分割方法。重要的是,即使只使用极少的训练样本,DualRel也能与当前最先进的全监督方法相媲美。

Automatic mitochondria segmentation enjoys great popularity with the development of deep learning. However, existing methods rely heavily on the labor-intensive manual gathering by experienced domain experts. And naively applying semi-supervised segmentation methods in the natural image field to mitigate the labeling cost is undesirable. In this work, we analyze the gap between mitochondrial images and natural images and rethink how to achieve effective semi-supervised mitochondria segmentation, from the perspective of reliable prototype-level supervision. We propose a novel end-to-end dual-reliable (DualRel) network, including a reliable pixel aggregation module and a reliable prototype selection module. The proposed DualRel enjoys several merits. First, to learn the prototypes well without any explicit supervision, we carefully design the referential correlation to rectify the direct pairwise correlation. Second, the reliable prototype selection module is responsible for further evaluating the reliability of prototypes in constructing prototype-level consistency regularization. Extensive experimental results on three challenging benchmarks demonstrate that our method performs favorably against state-of-the-art semi-supervised segmentation methods. Importantly, with extremely few samples used for training, DualRel is also on par with current state-of-the-art fully supervised methods.

Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning
Liu, ManandLi, FengandZhang, ChunjieandWei, YunchaoandBai, HuihuiandZhao, Yao



研究问题:如何通过知识转移从已知领域识别未见过的种类,并解决视觉外观对应相同属性时产生的语义模糊问题。
动机:现有的工作主要定位共享属性对应的区域,当多种视觉外观对应同一属性时,共享属性会引入语义模糊,阻碍准确语义-视觉交互的探索。
方法:采用双重语义-视觉转换器模块(DSVTM)逐步对属性原型和视觉特征之间的对应关系进行建模,构建一个渐进的语义-视觉相互适应(PSVMA)网络进行语义消歧和提高知识可转移性。
效果:实验结果表明,PSVMA在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain, relying on the intrinsic interactions between visual and semantic information. Prior works mainly localize regions corresponding to the sharing attributes. When various visual appearances correspond to the same attribute, the sharing attributes inevitably introduce semantic ambiguity, hampering the exploration of accurate semantic-visual interactions. In this paper, we deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between attribute prototypes and visual features, constituting a progressive semantic-visual mutual adaption (PSVMA) network for semantic disambiguation and knowledge transferability improvement. Specifically, DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one. Then, a semantic-motivated instance decoder strengthens accurate cross-domain interactions between the matched pair for semantic-related instance adaption, encouraging the generation of unambiguous visual representations. Moreover, to mitigate the bias towards seen classes in GZSL, a debiasing loss is proposed to pursue response consistency between seen and unseen predictions. The PSVMA consistently yields superior performances against other state-of-the-art methods. Code will be available at: https://github.com/ManLiuCoder/PSVMA.

Unknown Sniffer for Object Detection: Don't Turn a Blind Eye to Unknown Objects
Liang, WentengandXue, FengandLiu, YihaoandZhong, GuofengandMing, Anlong



研究问题:如何提高开放世界对象和开放集检测在发现从未见过的对象以及区分它们与已知对象方面的能力。
动机:现有的开放世界对象和开放集检测方法对从已知类别到未知类别的知识转移研究不够深入,导致隐藏在背景中的未知对象检测能力不足。
方法:提出未知嗅探器(UnSniffer)来发现已知和未知的对象。首先引入通用对象置信度(GOC)得分,仅使用已知样本进行监督,避免抑制背景中的未知对象。然后提出负能量抑制损失进一步抑制背景中的非对象样本。最后,引入基于图的确定方案代替人工设计的非最大抑制(NMS)后处理,以解决训练中缺乏未知对象的语义信息的问题。
效果:实验表明,该方法远优于现有的最先进技术。

The recently proposed open-world object and open-set detection have achieved a breakthrough in finding never-seen-before objects and distinguishing them from known ones. However, their studies on knowledge transfer from known classes to unknown ones are not deep enough, resulting in the scanty capability for detecting unknowns hidden in the background. In this paper, we propose the unknown sniffer (UnSniffer) to find both unknown and known objects. Firstly, the generalized object confidence (GOC) score is introduced, which only uses known samples for supervision and avoids improper suppression of unknowns in the background. Significantly, such confidence score learned from known objects can be generalized to unknown ones. Additionally, we propose a negative energy suppression loss to further suppress the non-object samples in the background. Next, the best box of each unknown is hard to obtain during inference due to lacking their semantic information in training. To solve this issue, we introduce a graph-based determination scheme to replace hand-designed non-maximum suppression (NMS) post-processing. Finally, we present the Unknown Object Detection Benchmark, the first publicly benchmark that encompasses precision evaluation for unknown detection to our knowledge. Experiments show that our method is far better than the existing state-of-the-art methods. Code is available at: https://github.com/Went-Liang/UnSniffer.

Where Is My Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization
Xu, MengmengandLi, YanghaoandFu, Cheng-YangandGhanem, BernardandXiang, TaoandP\'erez-R\'ua, Juan-Manuel



研究问题:本文旨在解决从视觉范例中定位图像和视频数据集中的对象的问题,特别是关注具有挑战性的自我中心视觉查询定位问题。
动机:当前基于查询的条件模型设计和视觉查询数据集存在严重的隐含偏见。
方法:通过扩展有限的注释并在训练过程中动态丢弃对象提议来直接解决这些问题。此外,提出了一种新的基于变压器的模块,该模块在引入查询信息的同时考虑了对象提议集的上下文。
效果:实验表明,提出的调整提高了自我中心查询检测的性能,从而在2D和3D配置中都改进了视觉查询定位系统。因此,我们在AP上将帧级检测性能从26.28%提高到31.26%,相应地显著提高了VQ2D和VQ3D的定位分数。改进的上下文感知查询对象检测器在VQ2D和VQ3D任务中分别排名第一和第二。此外,我们还展示了所提出模型在少样本检测(FSD)任务中的相关性,并在那里也取得了最先进的结果。

This paper deals with the problem of localizing objects in image and video datasets from visual exemplars. In particular, we focus on the challenging problem of egocentric visual query localization. We first identify grave implicit biases in current query-conditioned model design and visual query datasets. Then, we directly tackle such biases at both frame and object set levels. Concretely, our method solves these issues by expanding limited annotations and dynamically dropping object proposals during training. Additionally, we propose a novel transformer-based module that allows for object-proposal set context to be considered while incorporating query information. We name our module Conditioned Contextual Transformer or CocoFormer. Our experiments show that the proposed adaptations improve egocentric query detection, leading to a better visual query localization system in both 2D and 3D configurations. Thus, we are able to improve frame-level detection performance from 26.28% to 31.26% in AP, which correspondingly improves the VQ2D and VQ3D localization scores by significant margins. Our improved context-aware query object detector ranked first and second in the VQ2D and VQ3D tasks in the 2nd Ego4D challenge. In addition, we showcase the relevance of our proposed model in the Few-Shot Detection (FSD) task, where we also achieve SOTA results.

Boosting Low-Data Instance Segmentation by Unsupervised Pre-Training With Saliency Prompt
Li, HaoandZhang, DingwenandLiu, NianandCheng, LechaoandDai, YalunandZhang, ChaoandWang, XinggangandHan, Junwei



研究问题:如何在小数据集环境下提升基于查询的端到端实例分割(QEIS)模型的性能。
动机:在小数据集环境中,现有的QEIS方法由于难以学习定位和形状先验,性能会下降。
方法:提出一种新的预训练方法,通过为查询/核提供显著性提示来增强QEIS模型。该方法包括三个部分:1) 显著性掩码提案,根据显著性机制从无标签图像生成伪掩码;2) 提示-核匹配,将伪掩码转换为提示,并将相应的定位和形状先验注入最佳匹配的核;3) 核监督,在核级别提供监督以进行稳健学习。
效果:实验结果表明,该方法在小数据集环境中显著提升了几种QEIS模型的性能,使其达到与CNN基模型相当的收敛速度和性能。

Recently, inspired by DETR variants, query-based end-to-end instance segmentation (QEIS) methods have outperformed CNN-based models on large-scale datasets. Yet they would lose efficacy when only a small amount of training data is available since it's hard for the crucial queries/kernels to learn localization and shape priors. To this end, this work offers a novel unsupervised pre-training solution for low-data regimes. Inspired by the recent success of the Prompting technique, we introduce a new pre-training method that boosts QEIS models by giving Saliency Prompt for queries/kernels. Our method contains three parts: 1) Saliency Masks Proposal is responsible for generating pseudo masks from unlabeled images based on the saliency mechanism. 2) Prompt-Kernel Matching transfers pseudo masks into prompts and injects the corresponding localization and shape priors to the best-matched kernels. 3) Kernel Supervision is applied to supply supervision at the kernel level for robust learning. From a practical perspective, our pre-training method helps QEIS models achieve a similar convergence speed and comparable performance with CNN-based models in low-data regimes. Experimental results show that our method significantly boosts several QEIS models on three datasets.

Exploring Intra-Class Variation Factors With Learnable Cluster Prompts for Semi-Supervised Image Synthesis
Zhang, YunfeiandHuo, XiaoyangandChen, TianyiandWu, SiandWong, HauSan



研究问题:现有的半监督条件图像合成方法通常通过推断和注入类标签到条件生成对抗网络(GAN)中进行,但这种形式的监督可能不足以对具有多样化视觉外观的类别进行建模。
动机:为了解决上述问题,本文提出了一种基于可学习聚类提示的GAN(LCP-GAN),以更广泛的监督源捕获类特征和类内变化因素。
方法:首先,对每个类别进行软分区,然后探索将类内聚类与预训练的语言-视觉模型(如CLIP)的特征空间中的可学习视觉概念关联的可能性。对于条件图像生成,设计了一种基于聚类条件的生成器,通过注入类内聚类标签嵌入的组合,并在CLIP的基础上进一步引入真实-假分类头,以区分真实实例和合成实例。
效果:实验结果表明,LCP-GAN不仅具有优越的生成能力,而且在多个标准基准上与基础模型BigGAN和StyleGAN2-ADA的全监督版本相匹配。

Semi-supervised class-conditional image synthesis is typically performed by inferring and injecting class labels into a conditional Generative Adversarial Network (GAN). The supervision in the form of class identity may be inadequate to model classes with diverse visual appearances. In this paper, we propose a Learnable Cluster Prompt-based GAN (LCP-GAN) to capture class-wise characteristics and intra-class variation factors with a broader source of supervision. To exploit partially labeled data, we perform soft partitioning on each class, and explore the possibility of associating intra-class clusters with learnable visual concepts in the feature space of a pre-trained language-vision model, e.g., CLIP. For class-conditional image generation, we design a cluster-conditional generator by injecting a combination of intra-class cluster label embeddings, and further incorporate a real-fake classification head on top of CLIP to distinguish real instances from the synthesized ones, conditioned on the learnable cluster prompts. This significantly strengthens the generator with more semantic language supervision. LCP-GAN not only possesses superior generation capability but also matches the performance of the fully supervised version of the base models: BigGAN and StyleGAN2-ADA, on multiple standard benchmarks.

Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding
Meng, LingchenandDai, XiyangandChen, YinpengandZhang, PengchuanandChen, DongdongandLiu, MengchenandWang, JianfengandWu, ZuxuanandYuan, LuandJiang, Yu-Gang



研究问题:本文旨在解决对象检测中多个数据集组合时存在的分类差异和领域差距问题。
动机:由于检测数据集的分类差异和领域差距,多数据集组合在对象检测中并未产生显著效果提升。
方法:本文提出了一种名为“检测中心”的新设计,该设计具有数据集意识和类别对齐的特点。它通过学习一个数据集嵌入来适应对象查询以及检测头的卷积核,以减轻数据集的不一致性。同时,通过将一维热编码类别表示替换为词嵌入并利用语言嵌入的语义连贯性,将跨数据集的类别在统一空间中进行语义对齐。
效果:实验证明,多数据集联合训练比单独在每个数据集上训练取得了显著的性能提升。"检测中心"在具有广泛多样数据集的UODB基准测试中实现了最先进的性能。

Combining multiple datasets enables performance boost on many computer vision tasks. But similar trend has not been witnessed in object detection when combining multiple datasets due to two inconsistencies among detection datasets: taxonomy difference and domain gap. In this paper, we address these challenges by a new design (named Detection Hub) that is dataset-aware and category-aligned. It not only mitigates the dataset inconsistency but also provides coherent guidance for the detector to learn across multiple datasets. In particular, the dataset-aware design is achieved by learning a dataset embedding that is used to adapt object queries as well as convolutional kernels in detection heads. The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding and leveraging the semantic coherence of language embedding. Detection Hub fulfills the benefits of large data on object detection. Experiments demonstrate that joint training on multiple datasets achieves significant performance gains over training on each dataset alone. Detection Hub further achieves SoTA performance on UODB benchmark with wide variety of datasets.

Referring Multi-Object Tracking
Wu, DongmingandHan, WenchengandWang, TiancaiandDong, XingpingandZhang, XiangyuandShen, Jianbing



研究问题:本文提出了一种新的通用指代理解任务,即多对象追踪的指代理解(RMOT)。
动机:现有的指代理解任务通常只涉及单个文本指代对象的检测,而新的RMOT任务则通过语言表达作为语义线索来预测多个对象的追踪。
方法:构建了一个基于KITTI的可扩展表达式基准Refer-KITTI,并开发了一种在线处理新任务的基于变压器的架构TransRMOT。
效果:实验结果表明,TransRMOT在Refer-KITTI数据集上取得了令人印象深刻的检测性能,优于其他同类模型。

Existing referring understanding tasks tend to involve the detection of a single text-referred object. In this paper, we propose a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos. To push forward RMOT, we construct one benchmark with scalable expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18 videos with 818 expressions, and each expression in a video is annotated with an average of 10.7 objects. Further, we develop a transformer-based architecture TransRMOT to tackle the new task in an online manner, which achieves impressive detection performance and outperforms other counterparts. The Refer-KITTI dataset and the code are released at https://referringmot.github.io.

Weakly Supervised Temporal Sentence Grounding With Uncertainty-Guided Self-Training
Huang, YifeiandYang, LijinandSato, Yoichi



研究问题:弱监督的时间句子基础任务旨在找到语言描述在视频中的对应时间点,只给出视频级别的视频-语言对应关系。
动机:由于视频的复杂时间结构,与负面样本不同的提议可能对应于几个视频片段,但并不一定是正确答案。
方法:提出一种不确定性引导的自我训练技术,通过教师-学生互学和弱强增强提供额外的自我监督信号来指导弱监督学习。设计了两个技巧:(1)构建贝叶斯教师网络,利用其不确定性作为权重抑制噪声教师监督信号;(2)利用时间数据增强带来的循环一致性在两个网络之间进行互学。
效果:实验证明该方法在Charades-STA和ActivityNet Captions数据集上表现优越,且可以应用于提高多种主干方法的性能。

The task of weakly supervised temporal sentence grounding aims at finding the corresponding temporal moments of a language description in the video, given video-language correspondence only at video-level. Most existing works select mismatched video-language pairs as negative samples and train the model to generate better positive proposals that are distinct from the negative ones. However, due to the complex temporal structure of videos, proposals distinct from the negative ones may correspond to several video segments but not necessarily the correct ground truth. To alleviate this problem, we propose an uncertainty-guided self-training technique to provide extra self-supervision signals to guide the weakly-supervised learning. The self-training process is based on teacher-student mutual learning with weak-strong augmentation, which enables the teacher network to generate relatively more reliable outputs compared to the student network, so that the student network can learn from the teacher's output. Since directly applying existing self-training methods in this task easily causes error accumulation, we specifically design two techniques in our self-training method: (1) we construct a Bayesian teacher network, leveraging its uncertainty as a weight to suppress the noisy teacher supervisory signals; (2) we leverage the cycle consistency brought by temporal data augmentation to perform mutual learning between the two networks. Experiments demonstrate our method's superiority on Charades-STA and ActivityNet Captions datasets. We also show in the experiment that our self-training method can be applied to improve the performance of multiple backbone methods.

AutoRecon: Automated 3D Object Discovery and Reconstruction
Wang, YuangandHe, XingyiandPeng, SidaandLin, HaotongandBao, HujunandZhou, Xiaowei



研究问题:如何从多视图图像中自动发现和重建物体?
动机:虽然3D重建领域有了重大发展,但通过手动劳动(如边界框标注、遮罩注释和网格操作)去除背景以获取干净的物体模型仍然需要。
方法:提出一个名为AutoRecon的新框架,利用自我监督的2D视觉转换器特征从SfM点云中鲁棒地定位和分割前景物体,然后使用分解点云提供的密集监督重建分解的神经场景表示,从而实现准确的物体重建和分割。
效果:在DTU、BlendedMVS和CO3D-V2数据集上的实验证明了AutoRecon的有效性和鲁棒性。

A fully automated object reconstruction pipeline is crucial for digital content creation. While the area of 3D reconstruction has witnessed profound developments, the removal of background to obtain a clean object model still relies on different forms of manual labor, such as bounding box labeling, mask annotations, and mesh manipulations. In this paper, we propose a novel framework named AutoRecon for the automated discovery and reconstruction of an object from multi-view images. We demonstrate that foreground objects can be robustly located and segmented from SfM point clouds by leveraging self-supervised 2D vision transformer features. Then, we reconstruct decomposed neural scene representations with dense supervision provided by the decomposed point clouds, resulting in accurate object reconstruction and segmentation. Experiments on the DTU, BlendedMVS and CO3D-V2 datasets demonstrate the effectiveness and robustness of AutoRecon. The code and supplementary material are available on the project page: https://zju3dv.github.io/autorecon/.

Two-Shot Video Object Segmentation
Yan, KunandLi, XiaoandWei, FangyunandWang, JingluandZhang, ChenbinandWang, PingandLu, Yan



研究问题:本文旨在解决视频对象分割(VOS)中需要密集标注的问题,即获取像素级标注既昂贵又耗时。
动机:现有的视频对象分割模型都需要在密集标注的视频上进行训练,但获取像素级标注的成本高且耗时长。
方法:本文提出了一种新的训练方式,称为两拍视频对象分割(two-shot VOS)。我们只需要每段训练视频中的两个标记帧,通过在训练过程中为未标记的帧生成伪标签,并将标记和伪标签的数据一起优化模型。
效果:实验结果表明,该方法可以在稀疏标注的视频上训练出满意的VOS模型,性能与全量标注的模型相当。

Previous works on video object segmentation (VOS) are trained on densely annotated videos. Nevertheless, acquiring annotations in pixel level is expensive and time-consuming. In this work, we demonstrate the feasibility of training a satisfactory VOS model on sparsely annotated videos--we merely require two labeled frames per training video while the performance is sustained. We term this novel training paradigm as two-shot video object segmentation, or two-shot VOS for short. The underlying idea is to generate pseudo labels for unlabeled frames during training and to optimize the model on the combination of labeled and pseudo-labeled data. Our approach is extremely simple and can be applied to a majority of existing frameworks. We first pre-train a VOS model on sparsely annotated videos in a semi-supervised manner, with the first frame always being a labeled one. Then, we adopt the pre-trained VOS model to generate pseudo labels for all unlabeled frames, which are subsequently stored in a pseudo-label bank. Finally, we retrain a VOS model on both labeled and pseudo-labeled data without any restrictions on the first frame. For the first time, we present a general way to train VOS models on two-shot VOS datasets. By using 7.3% and 2.9% labeled data of YouTube-VOS and DAVIS benchmarks, our approach achieves comparable results in contrast to the counterparts trained on fully labeled set. Code and models are available at https://github.com/yk-pku/Two-shot-Video-Object-Segmentation.

Enhanced Multimodal Representation Learning With Cross-Modal KD
Chen, MengxiandXing, LinyuandWang, YuandZhang, Ya



研究问题:本文旨在探索如何利用训练时可用的辅助模态通过跨模态知识蒸馏(KD)增强多模态表示学习。
动机:现有的基于互信息最大化的目标会导致弱教师的捷径解决方案,即通过使教师模型与学生模型一样弱来达到最大互信息。为了防止这种弱解决方案,我们引入了额外的目标项,即教师和辅助模态模型之间的互信息。此外,为了缩小学生和教师之间的信息差距,我们还进一步提出了最小化教师给定学生的条件熵。
方法:设计了基于对比学习和对抗性学习的新颖训练方案,以优化互信息和条件熵。
效果:在三个流行的多模态基准数据集上进行的实验结果表明,所提出的方法在视频识别、视频检索和情感分类等方面优于一系列最先进的方法。

This paper explores the tasks of leveraging auxiliary modalities which are only available at training to enhance multimodal representation learning through cross-modal Knowledge Distillation (KD). The widely adopted mutual information maximization-based objective leads to a short-cut solution of the weak teacher, i.e., achieving the maximum mutual information by simply making the teacher model as weak as the student model. To prevent such a weak solution, we introduce an additional objective term, i.e., the mutual information between the teacher and the auxiliary modality model. Besides, to narrow down the information gap between the student and teacher, we further propose to minimize the conditional entropy of the teacher given the student. Novel training schemes based on contrastive learning and adversarial learning are designed to optimize the mutual information and the conditional entropy, respectively. Experimental results on three popular multimodal benchmark datasets have shown that the proposed method outperforms a range of state-of-the-art approaches for video recognition, video retrieval and emotion classification.

Pseudo-Label Guided Contrastive Learning for Semi-Supervised Medical Image Segmentation
Basak, HritamandYin, Zhaozheng



研究问题:如何在医学图像分割任务中,利用有限的标注信息学习出具有判别性的特征表示。
动机:尽管半监督学习在自然图像分割上取得了显著的成功,但在医学图像分割上,如何从有限的标注信息中学习出判别性的特征表示仍是一个开放的问题。
方法:提出了一种新的基于补丁的对比学习框架用于医学图像分割,该框架结合了半监督学习和对比学习的优点,通过半监督学习生成的伪标签为对比学习提供额外的指导,同时通过对比学习学到的判别性类别信息实现准确的多类别分割。
效果:在三个公开的多模态数据集上的实验分析表明,该方法优于现有的最先进方法。

Although recent works in semi-supervised learning (SemiSL) have accomplished significant success in natural image segmentation, the task of learning discriminative representations from limited annotations has been an open problem in medical images. Contrastive Learning (CL) frameworks use the notion of similarity measure which is useful for classification problems, however, they fail to transfer these quality representations for accurate pixel-level segmentation. To this end, we propose a novel semi-supervised patch-based CL framework for medical image segmentation without using any explicit pretext task. We harness the power of both CL and SemiSL, where the pseudo-labels generated from SemiSL aid CL by providing additional guidance, whereas discriminative class information learned in CL leads to accurate multi-class segmentation. Additionally, we formulate a novel loss that synergistically encourages inter-class separability and intra-class compactness among the learned representations. A new inter-patch semantic disparity mapping using average patch entropy is employed for a guided sampling of positives and negatives in the proposed CL framework. Experimental analysis on three publicly available datasets of multiple modalities reveals the superiority of our proposed method as compared to the state-of-the-art methods. Code is available at: https://github.com/hritam-98/PatchCL-MedSeg.

CrOC: Cross-View Online Clustering for Dense Visual Representation Learning
Stegm\"uller, ThomasandLebailly, TimandBozorgtabar, BehzadandTuytelaars, TinneandThiran, Jean-Philippe



研究问题:如何从场景中心的数据中学习密集的无标签视觉表示。
动机:这是一个具有挑战性的问题,特别是在没有手工制作先验知识的情况下。
方法:提出一种跨视图一致性目标和在线聚类机制(CrOC),以发现和分割视图的语义。这种方法不需要繁琐的预处理步骤,并且可以更好地推广。
效果:在各种数据集上进行的线性和无监督分割转移任务以及视频对象分割任务中,该方法表现出色。

Learning dense visual representations without labels is an arduous task and more so from scene-centric data. We propose to tackle this challenging problem by proposing a Cross-view consistency objective with an Online Clustering mechanism (CrOC) to discover and segment the semantics of the views. In the absence of hand-crafted priors, the resulting method is more generalizable and does not require a cumbersome pre-processing step. More importantly, the clustering algorithm conjointly operates on the features of both views, thereby elegantly bypassing the issue of content not represented in both views and the ambiguous matching of objects from one crop to the other. We demonstrate excellent performance on linear and unsupervised segmentation transfer tasks on various datasets and similarly for video object segmentation. Our code and pre-trained models are publicly available at https://github.com/stegmuel/CrOC.

Consistent-Teacher: Towards Reducing Inconsistent Pseudo-Targets in Semi-Supervised Object Detection
Wang, XinjiangandYang, XingyiandZhang, ShilongandLi, YijiangandFeng, LitongandFang, ShijieandLyu, ChengqiandChen, KaiandZhang, Wayne



研究问题:本研究关注半监督目标检测中伪目标的不一致性问题。
动机:我们发现,不稳定的伪目标会破坏准确检测器的训练,给训练注入噪声,导致严重的过拟合问题。
方法:我们提出了一个名为NAME的系统解决方案来减少这种不一致性。首先,使用自适应锚点分配(ASA)替代静态IoU策略,使学生网络能够抵抗噪声伪边界框的影响。然后,通过设计3D特征对齐模块(FAM-3D),校准子任务预测,使每个分类特征能够在任意尺度和位置自适应地查询最优的特征向量进行回归任务。最后,使用高斯混合模型(GMM)动态修改伪边界框的得分阈值,以在早期阶段稳定真实目标的数量并解决训练过程中不可靠的监督信号问题。
效果:NAME在各种半监督目标检测评估中表现出色。在只有10%标注的MS-COCO数据上,使用ResNet-50主干网络,NAME实现了40.0 mAP,比之前使用伪标签的基线高出约3 mAP。当在完全标注的MS-COCO上添加未标注的数据进行训练时,性能进一步提高到47.7 mAP。我们的代码可以在https://github.com/Adamdad/ConsistentTeacher获取。

In this study, we dive deep into the inconsistency of pseudo targets in semi-supervised object detection (SSOD). Our core observation is that the oscillating pseudo-targets undermine the training of an accurate detector. It injects noise into the student's training, leading to severe overfitting problems. Therefore, we propose a systematic solution, termed NAME, to reduce the inconsistency. First, adaptive anchor assignment (ASA) substitutes the static IoU-based strategy, which enables the student network to be resistant to noisy pseudo-bounding boxes. Then we calibrate the subtask predictions by designing a 3D feature alignment module (FAM-3D). It allows each classification feature to adaptively query the optimal feature vector for the regression task at arbitrary scales and locations. Lastly, a Gaussian Mixture Model (GMM) dynamically revises the score threshold of pseudo-bboxes, which stabilizes the number of ground truths at an early stage and remedies the unreliable supervision signal during training. NAME provides strong results on a large range of SSOD evaluations. It achieves 40.0 mAP with ResNet-50 backbone given only 10% of annotated MS-COCO data, which surpasses previous baselines using pseudo labels by around 3 mAP. When trained on fully annotated MS-COCO with additional unlabeled data, the performance further increases to 47.7 mAP. Our code is available at https://github.com/Adamdad/ConsistentTeacher.

RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension
Sun, JiamuandLuo, GenandZhou, YiyiandSun, XiaoshuaiandJiang, GuannanandWang, ZhiyuandJi, Rongrong



研究问题:本文旨在解决参照表达式理解(REC)任务中,需要大量实例级标注的问题,这既费时又费力。
动机:受计算机视觉领域最新进展的启发,作者提出采用教师-学生学习范式进行半监督学习,以利用大量的未标注数据。
方法:提出了一种名为RefTeacher的强基线方法。该方法通过教师网络预测伪标签来优化学生网络,从而充分利用少量的已标注数据。同时,为了解决稀疏的监督信号和较差的伪标签噪声问题,设计了两种新的方法:基于注意力的模仿学习和自适应伪标签加权。
效果:在三个参照表达式理解基准数据集上进行了广泛的实验,结果表明,RefTeacher明显优于全监督方法。更重要的是,仅使用10%的标注数据,该方法就能实现接近100%的全监督性能。

Referring expression comprehension (REC) often requires a large number of instance-level annotations for fully supervised learning, which are laborious and expensive. In this paper, we present the first attempt of semi-supervised learning for REC and propose a strong baseline method called RefTeacher. Inspired by the recent progress in computer vision, RefTeacher adopts a teacher-student learning paradigm, where the teacher REC network predicts pseudo-labels for optimizing the student one. This paradigm allows REC models to exploit massive unlabeled data based on a small fraction of labeled. In particular, we also identify two key challenges in semi-supervised REC, namely, sparse supervision signals and worse pseudo-label noise. To address these issues, we equip RefTeacher with two novel designs called Attention-based Imitation Learning (AIL) and Adaptive Pseudo-label Weighting (APW). AIL can help the student network imitate the recognition behaviors of the teacher, thereby obtaining sufficient supervision signals. APW can help the model adaptively adjust the contributions of pseudo-labels with varying qualities, thus avoiding confirmation bias. To validate RefTeacher, we conduct extensive experiments on three REC benchmark datasets. Experimental results show that RefTeacher obtains obvious gains over the fully supervised methods. More importantly, using only 10% labeled data, our approach allows the model to achieve near 100% fully supervised performance, e.g., only -2.78% on RefCOCO.

Continuous Pseudo-Label Rectified Domain Adaptive Semantic Segmentation With Implicit Neural Representations
Gong, RuiandWang, QinandDanelljan, MartinandDai, DengxinandVanGool, Luc



研究问题:如何利用有标签的源领域来提高未标记的目标领域的语义分割模型性能。
动机:现有的无监督领域适应方法通过在未标记的目标领域图像上使用伪标签取得了显著的进步,但由领域差异产生的低质量伪标签不可避免地阻碍了适应过程。
方法:提出一种利用隐式神经表示来估计预测伪标签的修正值的方法。将修正值视为在连续空间域上定义的信号,以图像坐标和附近的深层特征作为输入,预测给定坐标处的修正值作为输出。
效果:该方法在包括合成到真实和白天到黑夜在内的不同无监督领域适应基准测试中表现出色,与最先进的方法相比,取得了优越的结果。

Unsupervised domain adaptation (UDA) for semantic segmentation aims at improving the model performance on the unlabeled target domain by leveraging a labeled source domain. Existing approaches have achieved impressive progress by utilizing pseudo-labels on the unlabeled target-domain images. Yet the low-quality pseudo-labels, arising from the domain discrepancy, inevitably hinder the adaptation. This calls for effective and accurate approaches to estimating the reliability of the pseudo-labels, in order to rectify them. In this paper, we propose to estimate the rectification values of the predicted pseudo-labels with implicit neural representations. We view the rectification value as a signal defined over the continuous spatial domain. Taking an image coordinate and the nearby deep features as inputs, the rectification value at a given coordinate is predicted as an output. This allows us to achieve high-resolution and detailed rectification values estimation, important for accurate pseudo-label generation at mask boundaries in particular. The rectified pseudo-labels are then leveraged in our rectification-aware mixture model (RMM) to be learned end-to-end and help the adaptation. We demonstrate the effectiveness of our approach on different UDA benchmarks, including synthetic-to-real and day-to-night. Our approach achieves superior results compared to state-of-the-art. The implementation is available at https://github.com/ETHRuiGong/IR2F.

UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration
Zhang, JingyiandHuang, JiaxingandZhang, XiaoqinandLu, Shijian



研究问题:本文旨在解决现有的领域自适应全景分割方法在实例分割和语义分割上需要两个独立网络,导致参数过多、训练和推理过程复杂且计算密集的问题。
动机:通过利用一个或多个相关源领域的现成标注数据,减轻数据标注挑战。
方法:设计了一个名为UniDAformer的统一领域自适应全景分割变换器,该变换器可以在单个网络中同时实现领域自适应实例分割和语义分割。UniDAformer引入了分层掩码校准(HMC),通过在线自我训练实时纠正区域、超像素和像素级别的不准确预测。
效果:实验表明,UniDAformer在多个公共基准测试中实现了优于现有技术的领域自适应全景分割。

Domain adaptive panoptic segmentation aims to mitigate data annotation challenge by leveraging off-the-shelf annotated data in one or multiple related source domains. However, existing studies employ two separate networks for instance segmentation and semantic segmentation which lead to excessive network parameters as well as complicated and computationally intensive training and inference processes. We design UniDAformer, a unified domain adaptive panoptic segmentation transformer that is simple but can achieve domain adaptive instance segmentation and semantic segmentation simultaneously within a single network. UniDAformer introduces Hierarchical Mask Calibration (HMC) that rectifies inaccurate predictions at the level of regions, superpixels and pixels via online self-training on the fly. It has three unique features: 1) it enables unified domain adaptive panoptic adaptation; 2) it mitigates false predictions and improves domain adaptive panoptic segmentation effectively; 3) it is end-to-end trainable with a much simpler training and inference pipeline. Extensive experiments over multiple public benchmarks show that UniDAformer achieves superior domain adaptive panoptic segmentation as compared with the state-of-the-art.

JacobiNeRF: NeRF Shaping With Mutual Information Gradients
Xu, XiaomengandYang, YanchaoandMo, KaichunandPan, BoxiaoandYi, LiandGuibas, Leonidas



研究问题:如何训练神经网络辐射场(NeRF)模型,使其不仅能编码场景的外观,还能编码场景点、区域或实体之间的语义关联性。
动机:传统的一阶光度重建目标无法捕捉高度相关的实体之间的相互协方差模式。因此,我们提出了一种新的方法,通过在随机场景扰动下最大化实体间的互信息来优化学习动态。
方法:我们的方法显式地规范了学习动态,以对齐高度相关的实体的雅可比矩阵。通过关注这种二阶信息,我们可以塑造一个NeRF,当网络权重沿着单个实体、区域甚至点的梯度改变时,表达出具有语义意义的协同效应。
效果:实验表明,与没有互信息塑造的NeRF相比,JacobiNeRF在2D像素和3D点之间传播注释的效率更高,特别是在极度稀疏的标签环境中,从而减轻了注释负担。此外,相同的机制还可以用于实体选择或场景修改。

We propose a method that trains a neural radiance field (NeRF) to encode not only the appearance of the scene but also semantic correlations between scene points, regions, or entities -- aiming to capture their mutual co-variation patterns. In contrast to the traditional first-order photometric reconstruction objective, our method explicitly regularizes the learning dynamics to align the Jacobians of highly-correlated entities, which proves to maximize the mutual information between them under random scene perturbations. By paying attention to this second-order information, we can shape a NeRF to express semantically meaningful synergies when the network weights are changed by a delta along the gradient of a single entity, region, or even a point. To demonstrate the merit of this mutual information modeling, we leverage the coordinated behavior of scene entities that emerges from our shaping to perform label propagation for semantic and instance segmentation. Our experiments show that a JacobiNeRF is more efficient in propagating annotations among 2D pixels and 3D points compared to NeRFs without mutual information shaping, especially in extremely sparse label regimes -- thus reducing annotation burden. The same machinery can further be used for entity selection or scene modifications. Our code is available at https://github.com/xxm19/jacobinerf.

Interactive Segmentation of Radiance Fields
Goel, RahulandSirikonda, DhawalandSaini, SaurabhandNarayanan, P.J.



研究问题:如何有效地对Radiance Fields(RF)进行对象分割,以实现混合现实个人空间中的场景理解和操作。
动机:现有的分割方法无法处理具有复杂外观的复杂对象,需要一种能够精确分割具有精细结构和外观的对象的方法。
方法:ISRF方法通过使用蒸馏的语义特征进行最近邻特征匹配来识别高置信度种子区域,然后在联合空间-语义空间中进行双边搜索以恢复准确的分割。
效果:ISRF方法在从RF中分割对象并将其合成到另一个场景、改变外观等方面取得了最先进的结果,同时提供了一个其他人可以使用的交互式分割工具。

Radiance Fields (RF) are popular to represent casually-captured scenes for new view synthesis and several applications beyond it. Mixed reality on personal spaces needs understanding and manipulating scenes represented as RFs, with semantic segmentation of objects as an important step. Prior segmentation efforts show promise but don't scale to complex objects with diverse appearance. We present the ISRF method to interactively segment objects with fine structure and appearance. Nearest neighbor feature matching using distilled semantic features identifies high-confidence seed regions. Bilateral search in a joint spatio-semantic space grows the region to recover accurate segmentation. We show state-of-the-art results of segmenting objects from RFs and compositing them to another scene, changing appearance, etc., and an interactive segmentation tool that others can use.

topic-10

Topic words :  object,  dataset,  human,  detection,  objects,  lidar,  large,  scene

Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving
Agro, BenandSykora, QuinlanandCasas, SergioandUrtasun, Raquel



研究问题:现有的自动驾驶车辆感知和预测方法要么进行对象检测后跟踪预测,要么预测整个场景的密集占据和流量网格,这两种方法都存在一些问题。
动机:前者由于需要保持低检测数量以提高效率,牺牲了对象召回率,存在安全风险;后者由于全卷积网络的内在限制,具有有限的接受域,且输出网格的高维度导致计算成本高。
方法:提出了一种统一的感知和未来预测方法,用单一的神经网络隐式表示随时间的占据和流量。该方法避免了不必要的计算,因为可以直接由运动规划器在连续的空间-时间位置进行查询。此外,通过添加一个高效而有效的全局注意力机制,设计了一种能够克服先前显式占据预测方法有限接受域的架构。
效果:在城市和高速公路环境中的大量实验表明,这种隐式模型优于当前最先进的技术。

A self-driving vehicle (SDV) must be able to perceive its surroundings and predict the future behavior of other traffic participants. Existing works either perform object detection followed by trajectory forecasting of the detected objects, or predict dense occupancy and flow grids for the whole scene. The former poses a safety concern as the number of detections needs to be kept low for efficiency reasons, sacrificing object recall. The latter is computationally expensive due to the high-dimensionality of the output grid, and suffers from the limited receptive field inherent to fully convolutional networks. Furthermore, both approaches employ many computational resources predicting areas or objects that might never be queried by the motion planner. This motivates our unified approach to perception and future prediction that implicitly represents occupancy and flow over time with a single neural network. Our method avoids unnecessary computation, as it can be directly queried by the motion planner at continuous spatio-temporal locations. Moreover, we design an architecture that overcomes the limited receptive field of previous explicit occupancy prediction methods by adding an efficient yet effective global attention mechanism. Through extensive experiments in both urban and highway settings, we demonstrate that our implicit model outperforms the current state-of-the-art. For more information, visit the project website: https://waabi.ai/research/implicito.

3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification
Zhang, JiazhaoandDai, LiuandMeng, FanpengandFan, QingnanandChen, XuelinandXu, KaiandWang, He



研究问题:如何提高未见过环境中的目标导航(ObjectNav)能力。
动机:现有的目标导航方法主要基于2D地图、场景图或图像序列,但在3D空间中进行导航时,利用精细的空间信息可以提高导航能力。然而,在底层任务中,利用3D场景表示进行策略学习可能由于样本效率低和计算成本高而不实用。
方法:提出了一个基于两个简单子策略的3D感知目标导航框架。这两个子策略,即角点引导探索策略和类别感知识别策略,同时利用在线融合的3D点作为观察结果进行操作。
效果:通过大量实验表明,该框架可以通过学习3D场景表示显著提高目标导航性能。该框架在所有基于模块化的方法中在Matterport3D和Gibson数据集上表现最好,同时训练所需的计算成本最多可减少30倍。代码将发布以惠及社区。

Object goal navigation (ObjectNav) in unseen environments is a fundamental task for Embodied AI. Agents in existing works learn ObjectNav policies based on 2D maps, scene graphs, or image sequences. Considering this task happens in 3D space, a 3D-aware agent can advance its ObjectNav capability via learning from fine-grained spatial information. However, leveraging 3D scene representation can be prohibitively unpractical for policy learning in this floor-level task, due to low sample efficiency and expensive computational cost. In this work, we propose a framework for the challenging 3D-aware ObjectNav based on two straightforward sub-policies. The two sub-polices, namely corner-guided exploration policy and category-aware identification policy, simultaneously perform by utilizing online fused 3D points as observation. Through extensive experiments, we show that this framework can dramatically improve the performance in ObjectNav through learning from 3D scene representation. Our framework achieves the best performance among all modular-based methods on the Matterport3D and Gibson datasets while requiring (up to30x) less computational cost for training. The code will be released to benefit the community.

Analyzing Physical Impacts Using Transient Surface Wave Imaging
Zhang, TianyuanandSheinin, MarkandChan, DorianandRau, MarkandO{\textquoteright



研究问题:如何通过物体表面的振动信息,恢复物体的物理属性和其与环境的交互情况。
动机:现有的方法或忽视了物体被干扰后立即传播的瞬态振动,或只关注局部信号的恢复,而忽略了不同物体点振动之间的时空关系。
方法:利用双快门相机,同时从稀疏的物体点上提取瞬态表面振动的信息,模拟物体表面被干扰后产生的弹性波,并使用该模型定位各种材料的干扰源。
效果:实验证明,瞬态物体振动包含了关于冲击力和冲击物体材料性质的额外线索。在实际应用中,如乒乓球比赛中球拍的冲击位置定位,以及通过成像地板振动恢复脚步的位置等,本方法均表现出良好的效果。

The subtle vibrations on an object's surface contain information about the object's physical properties and its interaction with the environment. Prior works imaged surface vibration to recover the object's material properties via modal analysis, which discards the transient vibrations propagating immediately after the object is disturbed. Conversely, prior works that captured transient vibrations focused on recovering localized signals (e.g., recording nearby sound sources), neglecting the spatiotemporal relationship between vibrations at different object points. In this paper, we extract information from the transient surface vibrations simultaneously measured at a sparse set of object points using the dual-shutter camera described by Sheinin[31]. We model the geometry of an elastic wave generated shortly after an object's surface is disturbed (e.g., a knock or a footstep), and use the model to localize the disturbance source for various materials (e.g., wood, plastic, tile). We also show that transient object vibrations contain additional cues about the impact force and the impacting object's material properties. We demonstrate our approach in applications like localizing the strikes of a ping-pong ball on a table mid-play and recovering the footsteps' locations by imaging the floor vibrations they create.

UniSim: A Neural Closed-Loop Sensor Simulator
Yang, ZeandChen, YunandWang, JingkangandManivasagam, SivabalanandMa, Wei-ChiuandYang, AnqiJoyceandUrtasun, Raquel



研究问题:如何对自动驾驶系统进行严格的测试,以实现安全驾驶的目标。
动机:为了确保自动驾驶车辆的安全,我们需要在封闭环境中对其进行测试,模拟罕见但重要的驾驶场景。
方法:我们提出了UniSim,一个神经传感器模拟器,它利用录制的驾驶日志生成逼真的多传感器封闭环境模拟。UniSim构建了神经网络特征网格来重建静态背景和动态角色,并将它们组合在一起,模拟新视角下的激光雷达和相机数据。
效果:实验表明,UniSim可以生成逼真的传感器数据,并在下游任务上实现了小范围的领域差距。通过UniSim,我们首次展示了如何在现实世界中对自动驾驶系统进行封闭环境的评估。

Rigorously testing autonomy systems is essential for making safe self-driving vehicles (SDV) a reality. It requires one to generate safety critical scenarios beyond what can be collected safely in the world, as many scenarios happen rarely on our roads. To accurately evaluate performance, we need to test the SDV on these scenarios in closed-loop, where the SDV and other actors interact with each other at each timestep. Previously recorded driving logs provide a rich resource to build these new scenarios from, but for closed loop evaluation, we need to modify the sensor data based on the new scene configuration and the SDV's decisions, as actors might be added or removed and the trajectories of existing actors and the SDV will differ from the original log. In this paper, we present UniSim, a neural sensor simulator that takes a single recorded log captured by a sensor-equipped vehicle and converts it into a realistic closed-loop multi-sensor simulation. UniSim builds neural feature grids to reconstruct both the static background and dynamic actors in the scene, and composites them together to simulate LiDAR and camera data at new viewpoints, with actors added or removed and at new placements. To better handle extrapolated views, we incorporate learnable priors for dynamic objects, and leverage a convolutional network to complete unseen regions. Our experiments show UniSim can simulate realistic sensor data with small domain gap on downstream tasks. With UniSim, we demonstrate, for the first time, closed-loop evaluation of an autonomy system on safety-critical scenarios as if it were in the real world.

DexArt: Benchmarking Generalizable Dexterous Manipulation With Articulated Objects
Bao, ChenandXu, HelinandQin, YuzheandWang, Xiaolong



研究问题:如何使机器人像人一样操作日常的铰接物体?
动机:目前的机器人操作主要依赖于使用平行夹具,这限制了机器人只能操作一组有限的物体。而使用多指机器人手将更好地模拟人类行为,并使机器人能够操作各种铰接物体。
方法:我们提出了一个新的基准测试,称为DexArt,它涉及在物理模拟器中进行灵巧的铰接物体操作。在这个基准测试中,我们定义了多个复杂的操作任务,机器人手需要在每个任务中操作各种不同的铰接物体。我们的主要关注点是评估学习到的策略在未见过铰接物体上的泛化能力。
效果:通过大量的研究,我们提供了关于3D表示学习如何影响具有3D点云输入的RL决策的新见解。

To enable general-purpose robots, we will require the robot to operate daily articulated objects as humans do. Current robot manipulation has heavily relied on using a parallel gripper, which restricts the robot to a limited set of objects. On the other hand, operating with a multi-finger robot hand will allow better approximation to human behavior and enable the robot to operate on diverse articulated objects. To this end, we propose a new benchmark called DexArt, which involves Dexterous manipulation with Articulated objects in a physical simulator. In our benchmark, we define multiple complex manipulation tasks, and the robot hand will need to manipulate diverse articulated objects within each task. Our main focus is to evaluate the generalizability of the learned policy on unseen articulated objects. This is very challenging given the high degrees of freedom of both hands and objects. We use Reinforcement Learning with 3D representation learning to achieve generalization. Through extensive studies, we provide new insights into how 3D representation learning affects decision making in RL with 3D point cloud inputs. More details can be found at https://www.chenbao.tech/dexart/.

Object Pop-Up: Can We Infer 3D Objects and Their Poses From Human Interactions Alone?
Petrov, IlyaA.andMarin, RiccardoandChibane, JulianandPons-Moll, Gerard



研究问题:本文旨在探索是否仅从人类互动中就可以推断出3D物体及其姿势。
动机:尽管计算机视觉社区已经开发出了一些以物体为中心的方法,但是从人类互动中推断出3D物体及其姿势的研究却鲜有涉及。
方法:通过对人类点云数据的分析,即使用户只是在模仿某种功能(如通过双筒望远镜看东西),也能从中推断出未被观察到的物体。
效果:通过合成数据和任务序列的验证,该方法在XR/VR领域具有应用潜力。

The intimate entanglement between objects affordances and human poses is of large interest, among others, for behavioural sciences, cognitive psychology, and Computer Vision communities. In recent years, the latter has developed several object-centric approaches: starting from items, learning pipelines synthesizing human poses and dynamics in a realistic way, satisfying both geometrical and functional expectations. However, the inverse perspective is significantly less explored: Can we infer 3D objects and their poses from human interactions alone? Our investigation follows this direction, showing that a generic 3D human point cloud is enough to pop up an unobserved object, even when the user is just imitating a functionality (e.g., looking through a binocular) without involving a tangible counterpart. We validate our method qualitatively and quantitatively, with synthetic data and sequences acquired for the task, showing applicability for XR/VR.

Leapfrog Diffusion Model for Stochastic Trajectory Prediction
Mao, WeiboandXu, ChenxinandZhu, QiandChen, SihengandWang, Yanfeng



研究问题:如何有效地预测人类行为的不确定性,实现随机轨迹预测?
动机:现有的扩散模型虽然在生成任务中表现出强大的表示能力,但由于需要大量的去噪步骤,无法满足实时预测的需求。
方法:提出一种新颖的基于扩散的轨迹预测模型LEapfrog Diffusion model(LED)。该模型通过训练一个可学习的跳跃初始化器直接学习未来轨迹的多模态分布,跳过大量去噪步骤,显著加速推理速度。同时,跳跃初始化器被训练以适当分配相关样本,提供多样化的未来轨迹预测,从而显著提高预测性能。
效果:在四个真实世界数据集上的实验表明,LED在性能上持续提升,并在NFL数据集上实现了23.7%/21.9%的位置误差/方向误差改善。相比标准的扩散模型,LED在NBA/NFL/SDD/ETH-UCY数据集上的推理速度分别提高了19.3/30.8/24.3/25.1倍,满足了实时推理的需求。

To model the indeterminacy of human behaviors, stochastic trajectory prediction requires a sophisticated multi-modal distribution of future trajectories. Emerging diffusion models have revealed their tremendous representation capacities in numerous generation tasks, showing potential for stochastic trajectory prediction. However, expensive time consumption prevents diffusion models from real-time prediction, since a large number of denoising steps are required to assure sufficient representation ability. To resolve the dilemma, we present LEapfrog Diffusion model (LED), a novel diffusion-based trajectory prediction model, which provides real-time, precise, and diverse predictions. The core of the proposed LED is to leverage a trainable leapfrog initializer to directly learn an expressive multi-modal distribution of future trajectories, which skips a large number of denoising steps, significantly accelerating inference speed. Moreover, the leapfrog initializer is trained to appropriately allocate correlated samples to provide a diversity of predicted future trajectories, significantly improving prediction performances. Extensive experiments on four real-world datasets, including NBA/NFL/SDD/ETH-UCY, show that LED consistently improves performance and achieves 23.7%/21.9% ADE/FDE improvement on NFL. The proposed LED also speeds up the inference 19.3/30.8/24.3/25.1 times compared to the standard diffusion model on NBA/NFL/SDD/ETH-UCY, satisfying real-time inference needs. Code is available at https://github.com/MediaBrain-SJTU/LED.

Resource-Efficient RGBD Aerial Tracking
Yang, JinyuandGao, ShangandLi, ZheandZheng, FengandLeonardis, Ale\v{s



研究问题:本文旨在解决无人机在复杂环境中的视觉感知问题,特别是在RGBD追踪方面的挑战。
动机:现有的研究主要关注于城市环境中的行人或车辆等有限类别的目标追踪,而对更复杂的场景和深度信息的应用还处于探索阶段。
方法:本文提出了一个大规模的RGBD空中追踪基准,包含1000个带有密集注释的无人机捕获的RGBD视频。同时,为了应对无人机应用中有限的计算资源和实时处理的需求,作者还提出了一种高效的RGBD追踪器EMT。
效果:实验结果表明,EMT追踪器在GPU上运行速度超过100帧/秒,在Nvidia Jetson NX Xavier的边缘平台上运行速度为25帧/秒,实现了有效的多模态融合和特征匹配,取得了良好的追踪性能。

Aerial robots are now able to fly in complex environments, and drone-captured data gains lots of attention in object tracking. However, current research on aerial perception has mainly focused on limited categories, such as pedestrian or vehicle, and most scenes are captured in urban environments from a birds-eye view. Recently, UAVs equipped with depth cameras have been also deployed for more complex applications, while RGBD aerial tracking is still unexplored. Compared with traditional RGB object tracking, adding depth information can more effectively deal with more challenging scenes such as target and background interference. To this end, in this paper, we explore RGBD aerial tracking in an overhead space, which can greatly enlarge the development of drone-based visual perception. To boost the research, we first propose a large-scale benchmark for RGBD aerial tracking, containing 1,000 drone-captured RGBD videos with dense annotations. Then, as drone-based applications require for real-time processing with limited computational resources, we also propose an efficient RGBD tracker named EMT. Our tracker runs at over 100 fps on GPU, and 25 fps on the edge platform of NVidia Jetson NX Xavier, benefiting from its efficient multimodal fusion and feature matching. Extensive experiments show that our EMT achieves promising tracking performance. All resources are available at https://github.com/yjybuaa/RGBDAerialTracking.

PACO: Parts and Attributes of Common Objects
Ramanathan, VigneshandKalia, AnmolandPetrovic, VladanandWen, YiandZheng, BaixueandGuo, BaishanandWang, RuiandMarquez, AaronandKovvuri, RamaandKadian, AbhishekandMousavi, AmirandSong, YiwenandDubey, AbhimanyuandMahajan, Dhruv



研究问题:本文旨在解决对象模型从预测类别标签到提供详细对象实例描述的问题。
动机:随着对象模型的发展,需要更丰富的数据集,如部分遮罩和属性等,以提供更详细的对象实例描述。
方法:介绍了PACO:常见物体的部分和属性,这是一个包含75个物体类别、456个物体部分类别和55个属性的图像(LVIS)和视频(Ego4D)数据集。我们提供了641K的部分遮罩,覆盖了260K的对象框,其中大约一半被详细地标注了属性。
效果:我们在数据集上设计了评估指标,并在三个任务上提供了基准结果:部分遮罩分割、对象和部分属性预测以及零样本实例检测。数据集、模型和代码在https://github.com/facebookresearch/paco上开源。

Object models are gradually progressing from predicting just category labels to providing detailed descriptions of object instances. This motivates the need for large datasets which go beyond traditional object masks and provide richer annotations such as part masks and attributes. Hence, we introduce PACO: Parts and Attributes of Common Objects. It spans 75 object categories, 456 object-part categories and 55 attributes across image (LVIS) and video (Ego4D) datasets. We provide 641K part masks annotated across 260K object boxes, with roughly half of them exhaustively annotated with attributes as well. We design evaluation metrics and provide benchmark results for three tasks on the dataset: part mask segmentation, object and part attribute prediction and zero-shot instance detection. Dataset, models, and code are open-sourced at https://github.com/facebookresearch/paco.

MoDAR: Using Motion Forecasting for 3D Object Detection in Point Cloud Sequences
Li, YingweiandQi, CharlesR.andZhou, YinandLiu, ChenxiandAnguelov, Dragomir



研究问题:如何提高3D物体检测中被遮挡和远距离物体的检测效果。
动机:点云序列数据提供了改善这类情况的独特机会,因为被遮挡或遥远的物体可以从不同的视角或随着时间的推移获得更好的可见性。
方法:提出MoDAR方法,使用运动预测输出作为虚拟模态来增强激光雷达点云。MoDAR模态将物体信息从时间上下文传播到目标帧,表示为一组虚拟点,每个对象都有一个来自预测轨迹上的一个路标的虚拟点。然后将原始传感器点和虚拟点的融合点云输入给任何现成的基于点云的3D物体检测器。
效果:在Waymo开放数据集上进行评估,我们的方法通过使用来自额外长序列(如18秒)的运动预测,显著提高了现有技术探测器的性能,达到了新的最先进的水平,同时没有增加太多的计算开销。

Occluded and long-range objects are ubiquitous and challenging for 3D object detection. Point cloud sequence data provide unique opportunities to improve such cases, as an occluded or distant object can be observed from different viewpoints or gets better visibility over time. However, the efficiency and effectiveness in encoding long-term sequence data can still be improved. In this work, we propose MoDAR, using motion forecasting outputs as a type of virtual modality, to augment LiDAR point clouds. The MoDAR modality propagates object information from temporal contexts to a target frame, represented as a set of virtual points, one for each object from a waypoint on a forecasted trajectory. A fused point cloud of both raw sensor points and the virtual points can then be fed to any off-the-shelf point-cloud based 3D object detector. Evaluated on the Waymo Open Dataset, our method significantly improves prior art detectors by using motion forecasting from extra-long sequences (e.g. 18 seconds), achieving new state of the arts, while not adding much computation overhead.

Connecting Vision and Language With Video Localized Narratives
Voigtlaender, PaulandChangpinyo, SoravitandPont-Tuset, JordiandSoricut, RaduandFerrari, Vittorio



研究问题:提出一种新的多模态视频标注形式,将视觉和语言连接起来。
动机:原始的局部叙述需要注释者在图像上同时说话并移动鼠标,为每个词创建一个鼠标轨迹段,但在视频上执行此操作具有挑战性。
方法:新的协议允许注释者使用局部叙述讲述视频的故事,捕捉到涉及多个演员相互互动以及与多个被动对象的复杂事件。
效果:我们在OVIS、UVO和Oops数据集上标注了2万个视频,共170万字。基于这些数据,我们还构建了新的视频叙述基础和视频问答任务基准,并提供来自强大基线模型的参考结果。

We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language. In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment. However, this is challenging on a video. Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects. We annotated 20k videos of the OVIS, UVO, and Oops datasets, totalling 1.7M words. Based on this data, we also construct new benchmarks for the video narrative grounding and video question answering tasks, and provide reference results from strong baseline models. Our annotations are available at https://google.github.io/video-localized-narratives/.

OmniCity: Omnipotent City Understanding With Multi-Level and Multi-View Images
Li, WeijiaandLai, YawenandXu, LinningandXiangli, YuanboandYu, JinhuaandHe, ConghuiandXia, Gui-SongandLin, Dahua



研究问题:本文提出了一种新的多级别、多视角的城市理解数据集OmniCity。
动机:为了解决城市理解中的复杂问题,需要一种包含多种视角和级别的大规模数据集。
方法:OmniCity包括了卫星图像、街景全景图像和单视图图像,通过在25K个地理位置收集的超过100K像素级标注图像构建而成。同时,作者还提出了一种利用现有卫星视图标签图和不同视图间转换关系的高效街道视图图像标注流程。
效果:OmniCity比现有的多级别、多视角基准包含更多的图像和更丰富的标注类型,提供了更多最先进的模型基准结果,并引入了一个新的细粒度建筑实例分割任务。此外,OmniCity为现有任务如跨视图图像匹配、合成、分割、检测等提供了新的设置,有助于大规模城市理解、重建和模拟的新方法的开发。

This paper presents OmniCity, a new dataset for omnipotent city understanding from multi-level and multi-view images. More precisely, OmniCity contains multi-view satellite images as well as street-level panorama and mono-view images, constituting over 100K pixel-wise annotated images that are well-aligned and collected from 25K geo-locations in New York City. To alleviate the substantial pixel-wise annotation efforts, we propose an efficient street-view image annotation pipeline that leverages the existing label maps of satellite view and the transformation relations between different views (satellite, panorama, and mono-view). With the new OmniCity dataset, we provide benchmarks for a variety of tasks including building footprint extraction, height estimation, and building plane/instance/fine-grained segmentation. Compared with existing multi-level and multi-view benchmarks, OmniCity contains a larger number of images with richer annotation types and more views, provides more benchmark results of state-of-the-art models, and introduces a new task for fine-grained building instance segmentation on street-level panorama images. Moreover, OmniCity provides new problem settings for existing tasks, such as cross-view image matching, synthesis, segmentation, detection, etc., and facilitates the developing of new methods for large-scale city understanding, reconstruction, and simulation. The OmniCity dataset as well as the benchmarks will be released at https://city-super.github.io/omnicity/.

NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions
Zhang, JuzeandLuo, HaiminandYang, HongdiandXu, XinruandWu, QianyangandShi, YeandYu, JingyiandXu, LanandWang, Jingya



研究问题:如何通过构建一个包含人类与物体自由交互的数据集,解决视觉推理中的遮挡、形状和纹理模糊、运动等问题。
动机:由于现实生活中的人与物体交互过程常常存在各种遮挡和模糊问题,因此需要构建一个能够捕捉自由视角交互的数据集来解决这个问题。
方法:研究者构建了一个名为HODome的密集多视图穹顶,用于获取复杂的人类对象交互数据集。同时,他们还开发了一套专门处理多视图视频输入的神经网络处理流程——NeuralDome,用于对人类和物体进行准确的跟踪、几何重建和自由视角渲染。
效果:在HODome数据集上的大量实验表明,NeuralDome在推理、建模和渲染等多种任务上都表现出了良好的效果。该数据集和NeuralDome工具将被分享给社区以供进一步开发。

Humans constantly interact with objects in daily life tasks. Capturing such processes and subsequently conducting visual inferences from a fixed viewpoint suffers from occlusions, shape and texture ambiguities, motions, etc. To mitigate the problem, it is essential to build a training dataset that captures free-viewpoint interactions. We construct a dense multi-view dome to acquire a complex human object interaction dataset, named HODome, that consists of 71M frames on 10 subjects interacting with 23 objects. To process the HODome dataset, we develop NeuralDome, a layer-wise neural processing pipeline tailored for multi-view video inputs to conduct accurate tracking, geometry reconstruction and free-view rendering, for both human subjects and objects. Extensive experiments on the HODome dataset demonstrate the effectiveness of NeuralDome on a variety of inference, modeling, and rendering tasks. Both the dataset and the NeuralDome tools will be disseminated to the community for further development, which can be found at https://juzezhang.github.io/NeuralDome

Target-Referenced Reactive Grasping for Dynamic Objects
Liu, JirongandZhang, RuoandFang, Hao-ShuandGou, MinghaoandFang, HongjieandWang, ChenxiandXu, ShengandYan, HengxuandLu, Cewu



研究问题:如何使机器人成功地抓取动态移动的对象。
动机:目前的方法主要关注预测抓取姿态的时间平滑性,但很少考虑其语义一致性,导致在杂乱的场景中,预测的抓取位置可能无法落在同一物体的同一部位。
方法:本文提出通过跟踪生成的抓取空间来解决目标参考设置下的反射式抓取问题。给定对象上的目标抓取姿态和在新观测中检测到的抓取姿态,该方法由两个阶段组成:1)通过注意力图神经网络发现抓取姿态对应关系,并选择与目标姿态最相似的一个;2)基于目标和历史信息细化选定的抓取姿态。
效果:在大规模的基准GraspNet-1Billion上进行评估,收集了30个动态物体场景进行测试。实验结果表明,该方法优于其他代表性方法,实际机器人实验的平均成功率超过80%。

Reactive grasping, which enables the robot to successfully grasp dynamic moving objects, is of great interest in robotics. Current methods mainly focus on the temporal smoothness of the predicted grasp poses but few consider their semantic consistency. Consequently, the predicted grasps are not guaranteed to fall on the same part of the same object, especially in cluttered scenes. In this paper, we propose to solve reactive grasping in a target-referenced setting by tracking through generated grasp spaces. Given a targeted grasp pose on an object and detected grasp poses in a new observation, our method is composed of two stages: 1) discovering grasp pose correspondences through an attentional graph neural network and selecting the one with the highest similarity with respect to the target pose; 2) refining the selected grasp poses based on target and historical information. We evaluate our method on a large-scale benchmark GraspNet-1Billion. We also collect 30 scenes of dynamic objects for testing. The results suggest that our method outperforms other representative methods. Furthermore, our real robot experiments achieve an average success rate of over 80 percent.

Stimulus Verification Is a Universal and Effective Sampler in Multi-Modal Human Trajectory Prediction
Sun, JianhuaandLi, YuxuanandChai, LiangandLu, Cewu



研究问题:如何有效地从候选未来轨迹中采样出最终预测结果,以提高多模态人类轨迹预测的准确性。
动机:虽然现有的研究已经开发出了各种强大的模型来预测候选轨迹,但如何有效地采样出最终的预测结果却未受到足够的关注。
方法:本文提出了刺激验证法,作为一种通用且有效的采样过程来提高多模态预测能力。刺激验证引入了一个概率模型,称为刺激验证器,用于验证预测的未来轨迹与其相应的刺激之间的连贯性。通过突出显示具有更好刺激连贯性的预测样本,刺激验证确保采样的轨迹在刺激的角度来看是合理的,从而有助于提高多模态预测性能。
效果:我们在五个代表性的预测框架上实施了刺激验证,并在三个广泛使用的基准上进行了详尽的实验。优越的结果证明了我们的方法的有效性。

To comprehensively cover the uncertainty of the future, the common practice of multi-modal human trajectory prediction is to first generate a set/distribution of candidate future trajectories and then sample required numbers of trajectories from them as final predictions. Even though a large number of previous researches develop various strong models to predict candidate trajectories, how to effectively sample the final ones has not received much attention yet. In this paper, we propose stimulus verification, serving as a universal and effective sampling process to improve the multi-modal prediction capability, where stimulus refers to the factor in the observation that may affect the future movements such as social interaction and scene context. Stimulus verification introduces a probabilistic model, denoted as stimulus verifier, to verify the coherence between a predicted future trajectory and its corresponding stimulus. By highlighting prediction samples with better stimulus-coherence, stimulus verification ensures sampled trajectories plausible from the stimulus' point of view and therefore aids in better multi-modal prediction performance. We implement stimulus verification on five representative prediction frameworks and conduct exhaustive experiments on three widely-used benchmarks. Superior results demonstrate the effectiveness of our approach.

MethaneMapper: Spectral Absorption Aware Hyperspectral Transformer for Methane Detection
Kumar, SatishandArevalo, IvanandIftekhar, ASMandManjunath, BS



研究问题:如何准确检测和量化甲烷排放,并解决现有方法对局部地形条件敏感、需要专家手动检查、容易出错且不可扩展的问题。
动机:现有的分析数据的方法存在许多问题,因此需要开发一种新方法来解决这些问题。
方法:提出了一种新的端到端的光谱吸收波长感知的变压器网络MethaneMapper,用于检测和量化甲烷排放。MethaneMapper引入了两个新的模块,帮助在光谱域中定位最相关的甲烷烟柱区域,并使用它们进行精确定位。
效果:实验表明,MethaneMapper在检测方面实现了0.63 mAP,与当前最先进的技术相比,模型大小减少了5倍。此外,还引入了一个大规模的甲烷烟柱分割掩模数据集,包含超过4000个甲烷烟柱地点。这个数据集将为研究人员提供机会,开发和推进新的方法来应对这个具有重大社会影响的温室气体检测难题。

Methane (CH 4 ) is the chief contributor to global climate change. Recent Airborne Visible-Infrared Imaging Spectrometer-Next Generation (AVIRIS-NG) has been very useful in quantitative mapping of methane emissions. Existing methods for analyzing this data are sensitive to local terrain conditions, often require manual inspection from domain experts, prone to significant error and hence are not scalable. To address these challenges, we propose a novel end-to-end spectral absorption wavelength aware transformer network, MethaneMapper, to detect and quantify the emissions. MethaneMapper introduces two novel modules that help to locate the most relevant methane plume regions in the spectral domain and uses them to localize these accurately. Thorough evaluation shows that MethaneMapper achieves 0.63 mAP in detection and reduces the model size (by 5x) compared to the current state of the art. In addition, we also introduce a large-scale dataset of methane plume segmentation mask for over 1200 AVIRIS-NG flightlines from 2015-2022. It contains over 4000 methane plume sites. Our dataset will provide researchers the opportunity to develop and advance new methods for tackling this challenging green-house gas detection problem with significant broader social impact. Dataset and source code link.

Autonomous Manipulation Learning for Similar Deformable Objects via Only One Demonstration
Ren, YuandChen, RonghanandCong, Yang



研究问题:大多数现有的方法主要关注于刚性物体的识别和操作,而现实生活中更常见的可变形物体却未得到足够关注。
动机:大多数现有的可变形物体操作方法存在两个问题:1)需要大量的演示:为了训练一个特定实例,机器人需要重复数千次的操作演示;2)泛化能力差:在将所学技能转移到同一类别的新实例时,往往需要重新训练。
方法:我们提出了一种基于类别的可变形3D物体操作框架,只需要一次演示就可以操作可变形3D物体,并将所学技能泛化到新的相似实例,无需重新训练。该框架主要由两个模块组成:Nocs状态转换(NST)模块将目标的观察点云转换为预定义的统一姿态状态(即Nocs状态),这是进行类别级操作学习的基础;神经空间编码(NSE)模块通过编码类别级空间信息,将所学技能泛化到新的实例,以实现预期的抓取点,无需重新训练。然后规划相对运动路径以实现自主操作。
效果:通过我们的Cap40数据集进行的模拟结果和实际机器人实验证明了我们框架的有效性。

In comparison with most methods focusing on 3D rigid object recognition and manipulation, deformable objects are more common in our real life but attract less attention. Generally, most existing methods for deformable object manipulation suffer two issues, 1) Massive demonstration: repeating thousands of robot-object demonstrations for model training of one specific instance; 2) Poor generalization: inevitably re-training for transferring the learned skill to a similar/new instance from the same category. Therefore, we propose a category-level deformable 3D object manipulation framework, which could manipulate deformable 3D objects with only one demonstration and generalize the learned skills to new similar instances without re-training. Specifically, our proposed framework consists of two modules. The Nocs State Transform (NST) module transfers the observed point clouds of the target to a pre-defined unified pose state (i.e., Nocs state), which is the foundation for the category-level manipulation learning; the Neural Spatial Encoding (NSE) module generalizes the learned skill to novel instances by encoding the category-level spatial information to pursue the expected grasping point without re-training. The relative motion path is then planned to achieve autonomous manipulation. Both the simulated results via our Cap40 dataset and real robotic experiments justify the effectiveness of our framework.

Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion
Rempe, DavisandLuo, ZhengyiandBinPeng, XueandYuan, YeandKitani, KrisandKreis, KarstenandFidler, SanjaandLitany, Or



研究问题:如何生成符合用户定义目标的真实行人轨迹和全身动画?
动机:利用最新的引导扩散建模技术,实现轨迹的测试时可控性,通常这仅与基于规则的系统相关。
方法:通过目标路标、速度和指定的社交群体来约束轨迹,同时考虑周围环境上下文,将此轨迹扩散模型与新颖的基于物理的人形控制器集成,形成闭环全身行人动画系统,能够在具有不同地形的模拟环境中放置大量人群。
效果:利用在动画控制器的RL训练过程中学习到的价值函数来指导扩散,以产生更适合特定场景(如避免碰撞和穿越不平坦地形)的轨迹。

We introduce a method for generating realistic pedestrian trajectories and full-body animations that can be controlled to meet user-defined goals. We draw on recent advances in guided diffusion modeling to achieve test-time controllability of trajectories, which is normally only associated with rule-based systems. Our guided diffusion model allows users to constrain trajectories through target waypoints, speed, and specified social groups while accounting for the surrounding environment context. This trajectory diffusion model is integrated with a novel physics-based humanoid controller to form a closed-loop, full-body pedestrian animation system capable of placing large crowds in a simulated environment with varying terrains. We further propose utilizing the value function learned during RL training of the animation controller to guide diffusion to produce trajectories better suited for particular scenarios such as collision avoidance and traversing uneven terrain.

Progressive Transformation Learning for Leveraging Virtual Images in Training
Shen, Yi-TingandLee, HyungtaeandKwon, HeesungandBhattacharyya, ShuvraS.



研究问题:如何有效地对无人机图像进行对象检测,如人类?
动机:获取大规模的包含各种姿态和视角的人类的无人机数据集是必要的。
方法:介绍了一种称为渐进式转换学习(PTL)的方法,通过逐渐添加增强现实感的转换虚拟图像来扩充训练数据集。
效果:实验表明,PTL在小数据和跨领域环境下的性能显著提高,超过了基线。

To effectively interrogate UAV-based images for detecting objects of interest, such as humans, it is essential to acquire large-scale UAV-based datasets that include human instances with various poses captured from widely varying viewing angles. As a viable alternative to laborious and costly data curation, we introduce Progressive Transformation Learning (PTL), which gradually augments a training dataset by adding transformed virtual images with enhanced realism. Generally, a virtual2real transformation generator in the conditional GAN framework suffers from quality degradation when a large domain gap exists between real and virtual images. To deal with the domain gap, PTL takes a novel approach that progressively iterates the following three steps: 1) select a subset from a pool of virtual images according to the domain gap, 2) transform the selected virtual images to enhance realism, and 3) add the transformed virtual images to the training set while removing them from the pool. In PTL, accurately quantifying the domain gap is critical. To do that, we theoretically demonstrate that the feature representation space of a given object detector can be modeled as a multivariate Gaussian distribution from which the Mahalanobis distance between a virtual object and the Gaussian distribution of each object category in the representation space can be readily computed. Experiments show that PTL results in a substantial performance increase over the baseline, especially in the small data and the cross-domain regime.

Localized Semantic Feature Mixers for Efficient Pedestrian Detection in Autonomous Driving
Khan, AbdulHannanandNawaz, MohammedShariqandDengel, Andreas



研究问题:如何提高自动驾驶系统中的行人检测效率和准确性?
动机:目前的行人检测器存在推理时间长、对小且被严重遮挡的行人检测效果差的问题。
方法:提出一种名为局部化语义特征混合器(LSFM)的新型无锚行人检测架构,使用超像素金字塔池化模块进行特征编码,并采用基于MLPMixer的密集焦点检测网络作为轻量级检测头。
效果:在Caltech、City Persons、Euro City Persons和TJU-Traffic-Pedestrian等数据集上,LSFM取得了最先进的性能,同时平均推理时间减少了55%。此外,LSFM首次在行人检测中超越了人类基线。最后,跨数据集评估证明,提出的LSFM能够很好地泛化到未见数据。

Autonomous driving systems rely heavily on the underlying perception module which needs to be both performant and efficient to allow precise decisions in real-time. Avoiding collisions with pedestrians is of topmost priority in any autonomous driving system. Therefore, pedestrian detection is one of the core parts of such systems' perception modules. Current state-of-the-art pedestrian detectors have two major issues. Firstly, they have long inference times which affect the efficiency of the whole perception module, and secondly, their performance in the case of small and heavily occluded pedestrians is poor. We propose Localized Semantic Feature Mixers (LSFM), a novel, anchor-free pedestrian detection architecture. It uses our novel Super Pixel Pyramid Pooling module instead of the, computationally costly, Feature Pyramid Networks for feature encoding. Moreover, our MLPMixer-based Dense Focal Detection Network is used as a light detection head, reducing computational effort and inference time compared to existing approaches. To boost the performance of the proposed architecture, we adapt and use mixup augmentation which improves the performance, especially in small and heavily occluded cases. We benchmark LSFM against the state-of-the-art on well-established traffic scene pedestrian datasets. The proposed LSFM achieves state-of-the-art performance in Caltech, City Persons, Euro City Persons, and TJU-Traffic-Pedestrian datasets while reducing the inference time on average by 55%. Further, LSFM beats the human baseline for the first time in the history of pedestrian detection. Finally, we conducted a cross-dataset evaluation which proved that our proposed LSFM generalizes well to unseen data.

Coaching a Teachable Student
Zhang, JimuyangandHuang, ZanmingandOhn-Bar, Eshed



研究问题:如何有效地教导一个传感器运动学生代理从有特权的教师代理的监督下驾驶。
动机:当前的传感器运动代理的蒸馏方法往往导致学生学习到的行为次优,我们假设这是由于两个代理的输入、建模能力和优化过程之间的内在差异造成的。
方法:我们开发了一种新的蒸馏方案,可以解决这些限制,缩小传感器运动代理和其有特权的教师之间的差距。我们的关键洞察是设计一个学生,让他们学会将自己的输入特征与教师的特权鸟瞰图(BEV)空间对齐。然后学生可以从教师的直接监督中受益,进行内部表示学习。为了支持困难的传感器运动学习任务,学生模型通过各种辅助监督的学生步调辅导机制进行优化。我们还提出了一种高容量的模仿学习的有特权的代理,它在CARLA中超越了之前的有特权的代理,确保学生学习安全驾驶行为。
效果:我们提出的传感器运动代理在CARLA中产生了一个鲁棒的基于图像的行为克隆代理,在驾驶分数上提高了当前模型的20.6%,而无需LiDAR、历史观察、模型集合、策略数据聚合或强化学习。

We propose a novel knowledge distillation framework for effectively teaching a sensorimotor student agent to drive from the supervision of a privileged teacher agent. Current distillation for sensorimotor agents methods tend to result in suboptimal learned driving behavior by the student, which we hypothesize is due to inherent differences between the input, modeling capacity, and optimization processes of the two agents. We develop a novel distillation scheme that can address these limitations and close the gap between the sensorimotor agent and its privileged teacher. Our key insight is to design a student which learns to align their input features with the teacher's privileged Bird's Eye View (BEV) space. The student then can benefit from direct supervision by the teacher over the internal representation learning. To scaffold the difficult sensorimotor learning task, the student model is optimized via a student-paced coaching mechanism with various auxiliary supervision. We further propose a high-capacity imitation learned privileged agent that surpasses prior privileged agents in CARLA and ensures the student learns safe driving behavior. Our proposed sensorimotor agent results in a robust image-based behavior cloning agent in CARLA, improving over current models by over 20.6% in driving score without requiring LiDAR, historical observations, ensemble of models, on-policy data aggregation or reinforcement learning.

Collaboration Helps Camera Overtake LiDAR in 3D Detection
Hu, YueandLu, YifanandXu, RunshengandXie, WeidiandChen, SihengandWang, Yanfeng



研究问题:如何在没有直接3D测量输入的情况下,通过改进网络设计提高精确的深度估计。
动机:解决基于激光雷达的检测系统在本地化3D空间中的对象时的问题,提供一种经济的解决方案。
方法:提出多代理协作的相机仅3D检测(CoCa3D),使代理能够通过通信共享互补信息,并通过选择最有意义的线索优化通信效率。
效果:在真实世界数据集和两个新的模拟数据集上进行评估,结果显示CoCa3D在DAIR-V2X、OPV2V+和CoPerception-UAVs+上的AP@70分别提高了44.21%、30.60%和12.59%。初步结果表明,在充分的协作下,相机在某些实际场景中可能超过激光雷达。

Camera-only 3D detection provides an economical solution with a simple configuration for localizing objects in 3D space compared to LiDAR-based detection systems. However, a major challenge lies in precise depth estimation due to the lack of direct 3D measurements in the input. Many previous methods attempt to improve depth estimation through network designs, e.g., deformable layers and larger receptive fields. This work proposes an orthogonal direction, improving the camera-only 3D detection by introducing multi-agent collaborations. Our proposed collaborative camera-only 3D detection (CoCa3D) enables agents to share complementary information with each other through communication. Meanwhile, we optimize communication efficiency by selecting the most informative cues. The shared messages from multiple viewpoints disambiguate the single-agent estimated depth and complement the occluded and long-range regions in the single-agent view. We evaluate CoCa3D in one real-world dataset and two new simulation datasets. Results show that CoCa3D improves previous SOTA performances by 44.21% on DAIR-V2X, 30.60% on OPV2V+, 12.59% on CoPerception-UAVs+ for AP@70. Our preliminary results show a potential that with sufficient collaboration, the camera might overtake LiDAR in some practical scenarios. We released the dataset and code at https://siheng-chen.github.io/dataset/CoPerception+ and https://github.com/MediaBrain-SJTU/CoCa3D.

RealImpact: A Dataset of Impact Sound Fields for Real Objects
Clarke, SamuelandGao, RuohanandWang, MasonandRau, MarkandXu, JuliaandWang, Jui-HsienandJames, DougL.andWu, Jiajun



研究问题:目前缺乏真实物体冲击声音场的标准数据集,用于音频-视觉学习和模拟与现实的校准差距。
动机:我们的目标是填补这个空白,提供一个大规模的真实物体冲击声音数据集,以帮助改进音频-视觉学习和校准现实差距的模拟方法。
方法:我们创建了一个名为RealImpact的大型数据集,包含150,000个日常物品的冲击声音记录,这些记录是在受控条件下进行的,并带有详细的注释,包括冲击位置、麦克风位置、接触力分布、材料标签和RGBD图像。
效果:初步实验显示,我们的数据集可以作为参考来评估当前模拟方法对真实世界物体冲击声音的估计。此外,通过两个基准任务(听者定位分类和视觉声学匹配)的评估,我们证明了该数据集在声学和视听学习方面的实用性。

Objects make unique sounds under different perturbations, environment conditions, and poses relative to the listener. While prior works have modeled impact sounds and sound propagation in simulation, we lack a standard dataset of impact sound fields of real objects for audio-visual learning and calibration of the sim-to-real gap. We present RealImpact, a large-scale dataset of real object impact sounds recorded under controlled conditions. RealImpact contains 150,000 recordings of impact sounds of 50 everyday objects with detailed annotations, including their impact locations, microphone locations, contact force profiles, material labels, and RGBD images. We make preliminary attempts to use our dataset as a reference to current simulation methods for estimating object impact sounds that match the real world. Moreover, we demonstrate the usefulness of our dataset as a testbed for acoustic and audio-visual learning via the evaluation of two benchmark tasks, including listener location classification and visual acoustic matching.

Affection: Learning Affective Explanations for Real-World Visual Data
Achlioptas, PanosandOvsjanikov, MaksandGuibas, LeonidasandTulyakov, Sergey



研究问题:探索真实世界图像引发的情感反应空间。
动机:通过大型数据集,分析公众情绪反应和自由形式的文本解释,以理解人们对特定图像的感受和原因。
方法:开发神经网络,为用语言解释的真实世界视觉数据提供合理的情感反应。
效果:为更人性化、情感感知的图像分析系统铺平道路,并公开了代码和数据集。

In this work, we explore the space of emotional reactions induced by real-world images. For this, we first introduce a large-scale dataset that contains both categorical emotional reactions and free-form textual explanations for 85,007 publicly available images, analyzed by 6,283 annotators who were asked to indicate and explain how and why they felt when observing a particular image, with a total of 526,749 responses. Although emotional reactions are subjective and sensitive to context (personal mood, social status, past experiences) -- we show that there is significant common ground to capture emotional responses with a large support in the subject population. In light of this observation, we ask the following questions: i) Can we develop neural networks that provide plausible affective responses to real-world visual data explained with language? ii) Can we steer such methods towards producing explanations with varying degrees of pragmatic language, justifying different emotional reactions by grounding them in the visual stimulus? Finally, iii) How to evaluate the performance of such methods for this novel task? In this work, we take the first steps in addressing all of these questions, paving the way for more human-centric and emotionally-aware image analysis systems. Our code and data are publicly available at https://affective-explanations.org.

PIRLNav: Pretraining With Imitation and RL Finetuning for ObjectNav
Ramrakhya, RamandBatra, DhruvandWijmans, ErikandDas, Abhishek



研究问题:如何使虚拟机器人在新环境中导航到目标对象。
动机:虽然模仿学习(IL)在人类示范数据集上使用行为克隆(BC)取得了良好的结果,但存在泛化能力差和收集示范数据成本高的问题。
方法:提出了PIRLNav,一种两阶段学习方案,先进行基于人类示范的行为克隆预训练,然后进行强化学习微调。
效果:这种BC->RL的策略在ObjectNav任务上达到了65.0%的成功率(比之前最先进的方法高出5.0%的绝对值)。同时,通过严格的实证分析,发现人类示范可以替代自动生成的示范源,如最短路径或任务无关的前沿探索轨迹;随着BC预训练数据集的增大,RL微调的效果会逐渐减弱;最后,分析了ObjectNav策略的失败模式,并提出了进一步改进的指导方针。

We study ObjectGoal Navigation -- where a virtual robot situated in a new environment is asked to navigate to an object. Prior work has shown that imitation learning (IL) using behavior cloning (BC) on a dataset of human demonstrations achieves promising results. However, this has limitations -- 1) BC policies generalize poorly to new states, since the training mimics actions not their consequences, and 2) collecting demonstrations is expensive. On the other hand, reinforcement learning (RL) is trivially scalable, but requires careful reward engineering to achieve desirable behavior. We present PIRLNav, a two-stage learning scheme for BC pretraining on human demonstrations followed by RL-finetuning. This leads to a policy that achieves a success rate of 65.0% on ObjectNav (+5.0% absolute over previous state-of-the-art). Using this BC->RL training recipe, we present a rigorous empirical analysis of design choices. First, we investigate whether human demonstrations can be replaced with 'free' (automatically generated) sources of demonstrations, e.g. shortest paths (SP) or task-agnostic frontier exploration (FE) trajectories. We find that BC->RL on human demonstrations outperforms BC->RL on SP and FE trajectories, even when controlled for the same BC-pretraining success on train, and even on a subset of val episodes where BC-pretraining success favors the SP or FE policies. Next, we study how RL-finetuning performance scales with the size of the BC pretraining dataset. We find that as we increase the size of the BC-pretraining dataset and get to high BC accuracies, the improvements from RL-finetuning are smaller, and that 90% of the performance of our best BC->RL policy can be achieved with less than half the number of BC demonstrations. Finally, we analyze failure modes of our ObjectNav policies, and present guidelines for further improving them.

EXCALIBUR: Encouraging and Evaluating Embodied Exploration
Zhu, HaoandKapoor, RaghavandMin, SoYeonandHan, WinsonandLi, JiataiandGeng, KaiwenandNeubig, GrahamandBisk, YonatanandKembhavi, AniruddhaandWeihs, Luca



研究问题:本文旨在开发一种名为EXCALIBUR的探索性交互式代理,以鼓励其长期探索环境并查询对物理世界的理解。
动机:目前的机器学习模型主要通过静态和固定的数据集进行被动学习,或被教导完成特定的目标导向任务。EXCALIBUR的出现是为了鼓励发展具有探索性的交互式代理。
方法:EXCALIBUR允许代理在其环境中进行长期探索,然后通过提问来查询他们对物理世界的理解,如“那个小而重的红色碗是玻璃做的吗?”或“有没有比鸡蛋还重的银勺?”等。一旦代理回答了一系列问题,他们可以重新进入场景以精炼知识、更新信念并提高问题解答的性能。
效果:实验表明,EXCALIBUR数据集对当前最先进的嵌入式系统提出了挑战,并为开发新的创新方法提供了空间。此外,我们还展示了一个虚拟现实界面,使人类能够在模拟世界中无缝交互,并使用它来收集人类性能指标。

Experience precedes understanding. Humans constantly explore and learn about their environment out of curiosity, gather information, and update their models of the world. On the other hand, machines are either trained to learn passively from static and fixed datasets, or taught to complete specific goal-conditioned tasks. To encourage the development of exploratory interactive agents, we present the EXCALIBUR benchmark. EXCALIBUR allows agents to explore their environment for long durations and then query their understanding of the physical world via inquiries like: "is the small heavy red bowl made from glass?" or "is there a silver spoon heavier than the egg?". This design encourages agents to perform free-form home exploration without myopia induced by goal conditioning. Once the agents have answered a series of questions, they can renter the scene to refine their knowledge, update their beliefs, and improve their performance on the questions. Our experiments demonstrate the challenges posed by this dataset for the present-day state-of-the-art embodied systems and the headroom afforded to develop new innovative methods. Finally, we present a virtual reality interface that enables humans to seamlessly interact within the simulated world and use it to gather human performance measures. EXCALIBUR affords unique challenges in comparison to present-day benchmarks and represents the next frontier for embodied AI research.

A Bag-of-Prototypes Representation for Dataset-Level Applications
Tu, WeijieandDeng, WeijianandGedeon, TomandZheng, Liang



研究问题:本研究旨在解决两个数据集级别的任务:评估训练集的适用性和测试集的难度。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,在知识图谱中的有信息量的实体可以通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

This work investigates dataset vectorization for two dataset-level tasks: assessing training set suitability and test set difficulty. The former measures how suitable a training set is for a target domain, while the latter studies how challenging a test set is for a learned model. Central of the two tasks is measuring the underlying relationship between datasets. This needs a desirable dataset vectorization scheme, which should preserve as much discriminative dataset information as possible so that the distance between the resulting dataset vectors can reflect dataset-to-dataset similarity. To this end, we propose a bag-of-prototypes (BoP) dataset representation that extends the image level bag consisting of patch descriptors to dataset-level bag consisting of semantic prototypes. Specifically, we develop a codebook consisting of K prototypes clustered from a reference dataset. Given a dataset to be encoded, we quantize each of its image features to a certain prototype in the codebook and obtain a K-dimensional histogram feature. Without assuming access to dataset labels, the BoP representation provides rich characterization of dataset semantic distribution. Further, BoP representations cooperates well with Jensen-Shannon divergence for measuring dataset-to-dataset similarity. Albeit very simple, BoP consistently shows its advantage over existing representations on a series of benchmarks for two dataset-level tasks.

Leverage Interactive Affinity for Affordance Learning
Luo, HongchenandZhai, WeiandZhang, JingandCao, YangandTao, Dacheng



研究问题:如何从图像中感知潜在"行动可能性"(即功能区)并学习物体的互动功能,由于人类与物体互动的多样性,这是一个挑战。
动机:现有的功能区学习方法通常采用标签分配范式,并假设功能区域和功能标签之间存在唯一的关系,当适应外观变化大且未见过的环境下时,表现不佳。
方法:提出利用互动亲和力进行功能区学习,即从人与物体的互动中提取互动亲和力,并将其转移到非互动物体上。互动亲和力表示人体不同部位与目标物体局部区域的接触,可以提供人与物体之间内在连通性的固有线索,从而减少感知到的行动可能性的模糊性。具体来说,提出了一种基于姿态辅助的互动亲和力学习框架,利用人体姿态引导网络从人与物体的互动中学习互动亲和力。特别是设计了一种关键点启发式感知(KHP)方案,利用人体姿态的关键点关联来减轻由于互动多样性和接触遮挡引起的不确定性。此外,通过收集和标注了5000多张图片构建了一个接触驱动的功能区学习(CAL)数据集。
效果:实验结果表明,我们的方法在客观指标和视觉质量方面优于代表性模型。

Perceiving potential "action possibilities" (i.e., affordance) regions of images and learning interactive functionalities of objects from human demonstration is a challenging task due to the diversity of human-object interactions. Prevailing affordance learning algorithms often adopt the label assignment paradigm and presume that there is a unique relationship between functional region and affordance label, yielding poor performance when adapting to unseen environments with large appearance variations. In this paper, we propose to leverage interactive affinity for affordance learning, i.e., extracting interactive affinity from human-object interaction and transferring it to non-interactive objects. Interactive affinity, which represents the contacts between different parts of the human body and local regions of the target object, can provide inherent cues of interconnectivity between humans and objects, thereby reducing the ambiguity of the perceived action possibilities. Specifically, we propose a pose-aided interactive affinity learning framework that exploits human pose to guide the network to learn the interactive affinity from human-object interactions. Particularly, a keypoint heuristic perception (KHP) scheme is devised to exploit the keypoint association of human pose to alleviate the uncertainties due to interaction diversities and contact occlusions. Besides, a contact-driven affordance learning (CAL) dataset is constructed by collecting and labeling over 5,000 images. Experimental results demonstrate that our method outperforms the representative models regarding objective metrics and visual quality. Code and dataset: github.com/lhc1224/PIAL-Net.

Objaverse: A Universe of Annotated 3D Objects
Deitke, MattandSchwenk, DustinandSalvador, JordiandWeihs, LucaandMichel, OscarandVanderBilt, EliandSchmidt, LudwigandEhsani, KianaandKembhavi, AniruddhaandFarhadi, Ali



研究问题:本文旨在解决当前大规模预训练语言模型对结构化知识利用不足的问题,以及3D数据集中对象类别多样性不足的问题。
动机:现有的预训练语言模型和3D数据集在结构和知识表示上存在局限,限制了AI的发展和应用。
方法:本文提出了一种增强的语言表示模型ERNIE,通过结合大规模文本语料库和知识图谱进行联合训练,以充分利用词汇、句法和知识信息。同时,构建了一个大规模的3D模型数据集Objaverse,包含80万+的3D模型和丰富的描述性标题、标签和动画。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并在其他常见的NLP任务上与最先进的BERT模型相媲美。Objaverse数据集的引入为AI领域的研究和新型应用打开了新的方向。

Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today's benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets of high-fidelity 3D models continue to be mid-sized with limited diversity of object categories. Addressing this gap, we present Objaverse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category. We demonstrate the large potential of Objaverse via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models. Objaverse can open new directions for research and enable new applications across the field of AI.

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision
Yang, ChenyuandChen, YuntaoandTian, HaoandTao, ChenxinandZhu, XizhouandZhang, ZhaoxiangandHuang, GaoandLi, HongyangandQiao, YuandLu, LeweiandZhou, JieandDai, Jifeng



研究问题:现有的鸟瞰图(BEV)检测器通常与某些深度预训练的骨干网络(如VoVNet)紧密相关,限制了图像骨干网络和BEV检测器之间的协同作用。
动机:为了解决这个限制,我们引入了透视空间监督,以简化BEV检测器的优化过程。
方法:我们提出了一种两阶段BEV检测器,其中透视头部的提案被送入鸟瞰图头部进行最终预测。
效果:通过广泛的消融研究,特别是在监督形式和所提出检测器的通用性方面,我们在传统和现代图像骨干网络上验证了该方法,并在大规模的nuScenes数据集上取得了新的最先进的结果。

We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and better suits modern image backbones. Existing state-of-the-art BEV detectors are often tied to certain depth pre-trained backbones like VoVNet, hindering the synergy between booming image backbones and BEV detectors. To address this limitation, we prioritize easing the optimization of BEV detectors by introducing perspective space supervision. To this end, we propose a two-stage BEV detector, where proposals from the perspective head are fed into the bird's-eye-view head for final predictions. To evaluate the effectiveness of our model, we conduct extensive ablation studies focusing on the form of supervision and the generality of the proposed detector. The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset. The code shall be released soon.

SlowLiDAR: Increasing the Latency of LiDAR-Based Detection Using Adversarial Examples
Liu, HanandWu, YuhaoandYu, ZhiyuanandVorobeychik, YevgeniyandZhang, Ning



研究问题:现有的LiDAR感知系统在对抗性扰动下的可用性(延迟)问题。
动机:大部分研究关注对抗性扰动对预测的影响,但对实时网络物理系统来说,延迟是一个关键问题。
方法:提出SlowLiDAR攻击,通过使用可微分代理和新的损耗函数来克服LiDAR检测管道中的不可微部分的技术挑战。
效果:实验结果表明,SlowLiDAR可以显著增加六种最流行的LiDAR检测管道的延迟,同时保持其无法察觉。

LiDAR-based perception is a central component of autonomous driving, playing a key role in tasks such as vehicle localization and obstacle detection. Since the safety of LiDAR-based perceptual pipelines is critical to safe autonomous driving, a number of past efforts have investigated its vulnerability under adversarial perturbations of raw point cloud inputs. However, most such efforts have focused on investigating the impact of such perturbations on predictions (integrity), and little has been done to understand the impact on latency (availability), a critical concern for real-time cyber-physical systems. We present the first systematic investigation of the availability of LiDAR detection pipelines, and SlowLiDAR, an adversarial perturbation attack that maximizes LiDAR detection runtime. The attack overcomes the technical challenges posed by the non-differentiable parts of the LiDAR detection pipelines by using differentiable proxies and uses a novel loss function that effectively captures the impact of adversarial perturbations on the execution time of the pipeline. Extensive experimental results show that SlowLiDAR can significantly increase the latency of the six most popular LiDAR detection pipelines while maintaining imperceptibility.

AeDet: Azimuth-Invariant Multi-View 3D Object Detection
Feng, ChengjianandJie, ZequnandZhong, YujieandChu, XiangxiangandMa, Lin



研究问题:如何改进现有的基于LSS的多视角3D物体检测方法,使其更好地处理BEV特征并优化检测器。
动机:当前的处理方法忽视了BEV特征的径向对称性,增加了检测器优化的难度。
方法:提出一种方位等变卷积(AeConv)和方位等变锚点,以保留BEV特征的内在属性并简化优化过程。同时引入了与相机参数解耦的虚拟深度,统一不同相机内参数图像的深度预测。
效果:在nuScenes上进行的大量实验表明,该方法所构建的检测器(AeDet)在NDS上达到了62.0%,大幅超过了最近的多视角3D物体检测器如PETRv2和BEVDepth。

Recent LSS-based multi-view 3D object detection has made tremendous progress, by processing the features in Brid-Eye-View (BEV) via the convolutional detector. However, the typical convolution ignores the radial symmetry of the BEV features and increases the difficulty of the detector optimization. To preserve the inherent property of the BEV features and ease the optimization, we propose an azimuth-equivariant convolution (AeConv) and an azimuth-equivariant anchor. The sampling grid of AeConv is always in the radial direction, thus it can learn azimuth-invariant BEV features. The proposed anchor enables the detection head to learn predicting azimuth-irrelevant targets. In addition, we introduce a camera-decoupled virtual depth to unify the depth prediction for the images with different camera intrinsic parameters. The resultant detector is dubbed Azimuth-equivariant Detector (AeDet). Extensive experiments are conducted on nuScenes, and AeDet achieves a 62.0% NDS, surpassing the recent multi-view 3D object detectors such as PETRv2 and BEVDepth by a large margin.

GFIE: A Dataset and Baseline for Gaze-Following From 2D to 3D in Indoor Environments
Hu, ZhengxiandYang, YuxueandZhai, XiaolinandYang, DingyeandZhou, BohanandLiu, Jingtai



研究问题:如何准确自动地定位视线方向,以理解人类意图。
动机:现有的视线跟踪数据集在收集视线标签时存在缺陷,手动标注可能引入主观偏差且劳动密集,而使用眼动仪进行自动标注会改变人的外观。
方法:我们开发了一个新型的视线数据收集系统,包括一个Azure Kinect和一个激光测距仪,用于生成激光点引导被试者的注意力。我们还开发了一种算法,可以在图像中定位激光点,用于注释2D/3D视线目标并去除由激光点引入的地面真实值。整个收集视线行为的过程使我们能够在半自动的情况下在无约束的环境中获得无偏的标签。
效果:我们在GFIE数据集上提出了一种基于立体视场感知的基线方法,建立了一个2D/3D视线跟踪基准。

Gaze-following is a kind of research that requires locating where the person in the scene is looking automatically under the topic of gaze estimation. It is an important clue for understanding human intention, such as identifying objects or regions of interest to humans. However, a survey of datasets used for gaze-following tasks reveals defects in the way they collect gaze point labels. Manual labeling may introduce subjective bias and is labor-intensive, while automatic labeling with an eye-tracking device would alter the person's appearance. In this work, we introduce GFIE, a novel dataset recorded by a gaze data collection system we developed. The system is constructed with two devices, an Azure Kinect and a laser rangefinder, which generate the laser spot to steer the subject's attention as they perform in front of the camera. And an algorithm is developed to locate laser spots in images for annotating 2D/3D gaze targets and removing ground truth introduced by the spots. The whole procedure of collecting gaze behavior allows us to obtain unbiased labels in unconstrained environments semi-automatically. We also propose a baseline method with stereo field-of-view (FoV) perception for establishing a 2D/3D gaze-following benchmark on the GFIE dataset. Project page: https://sites.google.com/view/gfie.

Iterative Vision-and-Language Navigation
Krantz, JacobandBanerjee, ShurjoandZhu, WangandCorso, JasonandAnderson, PeterandLee, StefanandThomason, Jesse



研究问题:本文旨在提出一种评估语言引导的代理在持久环境中进行迭代视觉和语言导航的新范式。
动机:现有的视觉和语言导航(VLN)基准测试在每个剧集开始时都会擦除代理的记忆,测试在没有任何先前信息的情况下进行冷启动导航的能力。然而,部署的机器人会在相同的环境中长时间停留。
方法:通过训练和评估在场景中保持记忆的VLN代理,提出了迭代视觉和语言导航(IVLN)范式,这些场景由多达100个按顺序遵循的语言指令和目标路径组成。
效果:我们发现,对于高性能的变换器VLN代理来说,扩展其隐含记忆并不足以进行IVLN,但能够构建地图的代理可以从环境的持久性中受益,这促使人们重新关注VLN中的地图构建代理。

We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for evaluating language-guided agents navigating in a persistent environment over time. Existing Vision-and-Language Navigation (VLN) benchmarks erase the agent's memory at the beginning of every episode, testing the ability to perform cold-start navigation with no prior information. However, deployed robots occupy the same environment for long periods of time. The IVLN paradigm addresses this disparity by training and evaluating VLN agents that maintain memory across tours of scenes that consist of up to 100 ordered instruction-following Room-to-Room (R2R) episodes, each defined by an individual language instruction and a target path. We present discrete and continuous Iterative Room-to-Room (IR2R) benchmarks comprising about 400 tours each in 80 indoor scenes. We find that extending the implicit memory of high-performing transformer VLN agents is not sufficient for IVLN, but agents that build maps can benefit from environment persistence, motivating a renewed focus on map-building agents in VLN.

MaLP: Manipulation Localization Using a Proactive Scheme
Asnani, VishalandYin, XiandHassner, TalandLiu, Xiaoming



研究问题:如何有效地检测和定位图像中的修改操作。
动机:现有的被动篡改定位方法在未见过的生成模型和篡改属性上泛化性能差。
方法:提出一种主动的篡改定位方案,称为MaLP。通过添加学习到的模板对真实图像进行加密,如果图像被任何生成模型篡改,这种来自模板的保护不仅有助于二进制检测,还能帮助识别被生成模型修改的像素。模板是通过利用双分支架构估计的局部和全局级别特征来学习的。
效果:实验表明,MaLP的性能优于现有的被动方法。通过对22种不同的生成模型进行测试,证明了MaLP的泛化性,为未来的篡改定位研究提供了基准。最后,证明MaLP可以用作生成模型的判别器,以提高生成模型的质量。

Advancements in the generation quality of various Generative Models (GMs) has made it necessary to not only perform binary manipulation detection but also localize the modified pixels in an image. However, prior works termed as passive for manipulation localization exhibit poor generalization performance over unseen GMs and attribute modifications. To combat this issue, we propose a proactive scheme for manipulation localization, termed MaLP. We encrypt the real images by adding a learned template. If the image is manipulated by any GM, this added protection from the template not only aids binary detection but also helps in identifying the pixels modified by the GM. The template is learned by leveraging local and global-level features estimated by a two-branch architecture. We show that MaLP performs better than prior passive works. We also show the generalizability of MaLP by testing on 22 different GMs, providing a benchmark for future research on manipulation localization. Finally, we show that MaLP can be used as a discriminator for improving the generation quality of GMs. Our models/codes are available at www.github.com/vishal3477/pro_loc.

Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection
Yu, YiandDa, Feipeng



研究问题:如何准确预测物体的朝向,并解决由于旋转对称性引起的周期性模糊问题。
动机:随着计算机视觉的发展,定向对象检测逐渐显现出其重要性。然而,由于旋转对称性,定向对象检测中存在各种周期性模糊问题。
方法:提出一种名为相位移动编码器(PSC)的新型可微角度编码器,以及其双频版本(PSCD)。通过将不同周期的旋转周期性映射到不同频率的相位,为定向对象检测中由旋转对称性引起的各种周期性模糊问题提供了一个统一的框架。
效果:在三个数据集上的视觉分析和实验证明了该方法的有效性和潜力。当面临需要高质量边界框的场景时,所提出的方法有望提供有竞争力的性能。

With the vigorous development of computer vision, oriented object detection has gradually been featured. In this paper, a novel differentiable angle coder named phase-shifting coder (PSC) is proposed to accurately predict the orientation of objects, along with a dual-frequency version (PSCD). By mapping the rotational periodicity of different cycles into the phase of different frequencies, we provide a unified framework for various periodic fuzzy problems caused by rotational symmetry in oriented object detection. Upon such a framework, common problems in oriented object detection such as boundary discontinuity and square-like problems are elegantly solved in a unified form. Visual analysis and experiments on three datasets prove the effectiveness and the potentiality of our approach. When facing scenarios requiring high-quality bounding boxes, the proposed methods are expected to give a competitive performance. The codes are publicly available at https://github.com/open-mmlab/mmrotate.

Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence
Alloulah, MohammedandArnold, Maximilian



研究问题:如何实现下一代蜂窝网络中的无线电感知功能,并提高其在全球户外的感知覆盖范围?
动机:深度学习在计算机视觉领域取得了重大突破,但在无线电感知任务中的应用受到限制,主要原因是缺乏专门用于研究无线电感知性能和潜力的系统数据集和基准。
方法:我们提出了MaxRay,一个合成的无线电视觉数据集和基准,用于精确定位无线电信号。我们还提出了一种无监督学习的方法,通过从无线电视觉对应中提取自我坐标来定位目标。
效果:我们的实验结果表明,可以从配对的无线电视觉数据中自动学习准确的无线电目标定位,而无需标签,这对于实证数据非常重要。这为大规模数据扩展打开了大门,并可能证明是实现统一通信感知蜂窝基础设施的关键。

Next generation cellular networks will implement radio sensing functions alongside customary communications, thereby enabling unprecedented worldwide sensing coverage outdoors. Deep learning has revolutionised computer vision but has had limited application to radio perception tasks, in part due to lack of systematic datasets and benchmarks dedicated to the study of the performance and promise of radio sensing. To address this gap, we present MaxRay: a synthetic radio-visual dataset and benchmark that facilitate precise target localisation in radio. We further propose to learn to localise targets in radio without supervision by extracting self-coordinates from radio-visual correspondence. We use such self-supervised coordinates to train a radio localiser network. We characterise our performance against a number of state-of-the-art baselines. Our results indicate that accurate radio target localisation can be automatically learned from paired radio-visual data without labels, which is important for empirical data. This opens the door for vast data scalability and may prove key to realising the promise of robust radio sensing atop a unified communication-perception cellular infrastructure. Dataset will be hosted on IEEE DataPort.

Indiscernible Object Counting in Underwater Scenes
Sun, GuoleiandAn, ZhaochongandLiu, YunandLiu, CeandSakaridis, ChristosandFan, Deng-PingandVanGool, Luc



研究问题:本文旨在解决在难以区分的场景中进行对象计数的问题,即难以区分的对象计数(IOC)。
动机:由于缺乏适当的IOC数据集,我们提出了一个大规模的IOCfish5K数据集,以推动这个领域的研究。
方法:我们创建了一个包含5637张高分辨率图像和659024个注释中心点的大规模数据集IOCfish5K。同时,我们还设计了一个新的强基线模型IOCFormer,该模型结合了密度和回归分支,可以有效地处理隐蔽场景下的对象计数。
效果:实验表明,IOCFormer在IOCfish5K上取得了最先进的分数,证明了我们的方法的有效性。

Recently, indiscernible scene understanding has attracted a lot of attention in the vision community. We further advance the frontier of this field by systematically studying a new challenge named indiscernible object counting (IOC), the goal of which is to count objects that are blended with respect to their surroundings. Due to a lack of appropriate IOC datasets, we present a large-scale dataset IOCfish5K which contains a total of 5,637 high-resolution images and 659,024 annotated center points. Our dataset consists of a large number of indiscernible objects (mainly fish) in underwater scenes, making the annotation process all the more challenging. IOCfish5K is superior to existing datasets with indiscernible scenes because of its larger scale, higher image resolutions, more annotations, and denser scenes. All these aspects make it the most challenging dataset for IOC so far, supporting progress in this area. For benchmarking purposes, we select 14 mainstream methods for object counting and carefully evaluate them on IOCfish5K. Furthermore, we propose IOCFormer, a new strong baseline that combines density and regression branches in a unified framework and can effectively tackle object counting under concealed scenes. Experiments show that IOCFormer achieves state-of-the-art scores on IOCfish5K.

Relational Context Learning for Human-Object Interaction Detection
Kim, SanghyunandJung, DeunsolandCho, Minsu



研究问题:如何提高关系推理在发现HOI实例中的关键性。
动机:现有的最新方法构建了两个解码器分支的变换器架构,但可能由于分支间上下文交换不足,导致关系推理缺乏上下文信息。
方法:提出多重关系网络(MUREN),通过人、物和交互标记的一元、二元和三元关系,在三个解码器分支之间进行丰富的上下文交换。
效果:该方法学习了发现HOI实例的全面关系上下文,并在HICO-DET和V-COCO两个标准基准上实现了最先进的性能。

Recent state-of-the-art methods for HOI detection typically build on transformer architectures with two decoder branches, one for human-object pair detection and the other for interaction classification. Such disentangled transformers, however, may suffer from insufficient context exchange between the branches and lead to a lack of context information for relational reasoning, which is critical in discovering HOI instances. In this work, we propose the multiplex relation network (MUREN) that performs rich context exchange between three decoder branches using unary, pairwise, and ternary relations of human, object, and interaction tokens. The proposed method learns comprehensive relational contexts for discovering HOI instances, achieving state-of-the-art performance on two standard benchmarks for HOI detection, HICO-DET and V-COCO.

FLAG3D: A 3D Fitness Activity Dataset With Language Instruction
Tang, YansongandLiu, JinpengandLiu, AoyangandYang, BinandDai, WenxunandRao, YongmingandLu, JiwenandZhou, JieandLi, Xiu



研究问题:本文旨在解决当前计算机视觉中健身活动分析领域对高质量数据、精细标签和多样化环境的需求。
动机:随着全球健身活动的普及,健身活动分析成为计算机视觉领域的新兴研究课题。然而,现有的任务和算法需要大量的高质量数据资源。
方法:本文提出了一个大规模的3D健身活动数据集FLAG3D,包含18万个60类别的动作序列。该数据集具有以下三个特点:1)通过先进的动作捕捉系统准确且密集地捕获3D人体姿态,以处理复杂的动作和大范围的运动;2)提供详细且专业的语言指令来描述如何执行特定的动作;3)从高科技的动作捕捉系统、渲染软件和价格合理的智能手机获取多样化的视频资源,在自然环境中进行拍摄。
效果:广泛的实验和深入的分析表明,FLAG3D对于各种挑战(如跨领域人体动作识别、动态人体网格恢复和语言引导的人体动作生成)具有重要的研究价值。该数据集和源代码已公开发布。

With the continuously thriving popularity around the world, fitness activity analytic has become an emerging research topic in computer vision. While a variety of new tasks and algorithms have been proposed recently, there are growing hunger for data resources involved in high-quality data, fine-grained labels, and diverse environments. In this paper, we present FLAG3D, a large-scale 3D fitness activity dataset with language instruction containing 180K sequences of 60 categories. FLAG3D features the following three aspects: 1) accurate and dense 3D human pose captured from advanced MoCap system to handle the complex activity and large movement, 2) detailed and professional language instruction to describe how to perform a specific activity, 3) versatile video resources from a high-tech MoCap system, rendering software, and cost-effective smartphones in natural environments. Extensive experiments and in-depth analysis show that FLAG3D contributes great research value for various challenges, such as cross-domain human action recognition, dynamic human mesh recovery, and language-guided human action generation. Our dataset and source code are publicly available at https://andytang15.github.io/FLAG3D.

PoseExaminer: Automated Testing of Out-of-Distribution Robustness in Human Pose and Shape Estimation
Liu, QihaoandKortylewski, AdamandYuille, AlanL.



研究问题:当前人体姿态和形状(HPS)估计方法在真实世界应用中,当观察到的数据与训练数据显著不同时,可能会面临分布外(OOD)的临界情况。
动机:为了解决这个基本问题,开发了一种模拟器,通过可解释的参数以细粒度的方式控制,探索人体姿态的图像流形,例如通过改变姿态、形状和服装。
方法:引入了一种学习型测试方法,称为PoseExaminer,通过搜索人体姿态图像的参数空间来自动诊断HPS算法的失败模式。
效果:实验表明,PoseExaminer发现了当前最先进的模型在现实场景中的各种限制,这些限制在当前的基准测试中被忽略了。此外,通过利用PoseExaminer发现的失败模式对HPS方法进行微调,可以提高其鲁棒性,甚至在标准基准测试上的性能也有显著提高。

Human pose and shape (HPS) estimation methods achieve remarkable results. However, current HPS benchmarks are mostly designed to test models in scenarios that are similar to the training data. This can lead to critical situations in real-world applications when the observed data differs significantly from the training data and hence is out-of-distribution (OOD). It is therefore important to test and improve the OOD robustness of HPS methods. To address this fundamental problem, we develop a simulator that can be controlled in a fine-grained manner using interpretable parameters to explore the manifold of images of human pose, e.g. by varying poses, shapes, and clothes. We introduce a learning-based testing method, termed PoseExaminer, that automatically diagnoses HPS algorithms by searching over the parameter space of human pose images to find the failure modes. Our strategy for exploring this high-dimensional parameter space is a multi-agent reinforcement learning system, in which the agents collaborate to explore different parts of the parameter space. We show that our PoseExaminer discovers a variety of limitations in current state-of-the-art models that are relevant in real-world scenarios but are missed by current benchmarks. For example, it finds large regions of realistic human poses that are not predicted correctly, as well as reduced performance for humans with skinny and corpulent body shapes. In addition, we show that fine-tuning HPS methods by exploiting the failure modes found by PoseExaminer improve their robustness and even their performance on standard benchmarks by a significant margin. The code are available for research purposes.

Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning
Kim, JinwooandChoi, JanghyukandChoi, Ho-JinandKim, SeonJoo



研究问题:如何对单视图图像进行对象中心学习,以实现场景的通用和组合理解。
动机:由于单视图图像的信息量较少,难以分离给定的场景,因此对单视图图像进行对象中心学习仍然具有挑战性。
方法:提出了一种新的对象中心学习框架SLASH,该框架在Slot Attention的基础上增加了两个简单而有效的模块:Attention Refining Kernel(ARK)和Intermediate Point Predictor and Encoder(IPPE)。这两个模块分别防止插槽被背景噪音干扰,并指示插槽集中注意力的位置,以促进对象中心表示的学习。
效果:实验表明,该方法能够一致地学习对象中心表示,并在四个数据集上取得了强大的性能。

Object-centric learning (OCL) aspires general and com- positional understanding of scenes by representing a scene as a collection of object-centric representations. OCL has also been extended to multi-view image and video datasets to apply various data-driven inductive biases by utilizing geometric or temporal information in the multi-image data. Single-view images carry less information about how to disentangle a given scene than videos or multi-view im- ages do. Hence, owing to the difficulty of applying induc- tive biases, OCL for single-view images still remains chal- lenging, resulting in inconsistent learning of object-centric representation. To this end, we introduce a novel OCL framework for single-view images, SLot Attention via SHep- herding (SLASH), which consists of two simple-yet-effective modules on top of Slot Attention. The new modules, At- tention Refining Kernel (ARK) and Intermediate Point Pre- dictor and Encoder (IPPE), respectively, prevent slots from being distracted by the background noise and indicate lo- cations for slots to focus on to facilitate learning of object- centric representation. We also propose a weak- and semi- supervision approach for OCL, whilst our proposed frame- work can be used without any assistant annotation during the inference. Experiments show that our proposed method enables consistent learning of object-centric representa- tion and achieves strong performance across four datasets. Code is available at https://github.com/object- understanding/SLASH.

IPCC-TP: Utilizing Incremental Pearson Correlation Coefficient for Joint Multi-Agent Trajectory Prediction
Zhu, DekaiandZhai, GuangyaoandDi, YanandManhardt, FabianandBerkemeyer, HendrikandTran, TuanandNavab, NassirandTombari, FedericoandBusam, Benjamin



研究问题:如何提高多智能体轨迹预测的可靠性,以实现自主系统的安全规划和控制。
动机:与单智能体情况相比,同时处理多个智能体的主要挑战在于模拟复杂的社会互动,这种互动由各种驾驶意图和道路条件引起。
方法:本文提出了一种基于增量皮尔逊相关系数(IPCC)的新型相关性感知模块IPCC-TP,用于改进多智能体交互建模。IPCC-TP通过紧密耦合的均值和协方差的估计,根据交互式增量运动学习成对联合高斯分布。
效果:在nuScenes和Argoverse 2数据集上的大量实验表明,IPCC-TP将基线的性能提高了很大一截。

Reliable multi-agent trajectory prediction is crucial for the safe planning and control of autonomous systems. Compared with single-agent cases, the major challenge in simultaneously processing multiple agents lies in modeling complex social interactions caused by various driving intentions and road conditions. Previous methods typically leverage graph-based message propagation or attention mechanism to encapsulate such interactions in the format of marginal probabilistic distributions. However, it is inherently sub-optimal. In this paper, we propose IPCC-TP, a novel relevance-aware module based on Incremental Pearson Correlation Coefficient to improve multi-agent interaction modeling. IPCC-TP learns pairwise joint Gaussian Distributions through the tightly-coupled estimation of the means and covariances according to interactive incremental movements. Our module can be conveniently embedded into existing multi-agent prediction methods to extend original motion distribution decoders. Extensive experiments on nuScenes and Argoverse 2 datasets demonstrate that IPCC-TP improves the performance of baselines by a large margin.

BEV-Guided Multi-Modality Fusion for Driving Perception
Man, YunzeandGui, Liang-YanandWang, Yu-Xiong



研究问题:如何整合多种传感器并解决自动驾驶中的多样化任务,同时在端到端算法中实现?
动机:将各种传感器统一在一个端到端的Bird's Eye-View(BEV)指导下是自动驾驶中具有挑战性但至关重要的话题。
方法:我们引入了BEVGuide,这是一个新的BEV表示学习框架,首次尝试以端到端的方式直接在BEV指导下统一广泛的传感器。我们的架构接受来自多样化的传感器池的输入,包括但不限于摄像头、激光雷达和雷达传感器,并使用通用的变压器主干提取BEV特征嵌入。我们设计了一个BEV引导的多传感器注意力块,从BEV嵌入中获取查询,并从特定于传感器的特征中学习BEV表示。
效果:由于其轻量级主干设计和高度灵活性,BEVGuide非常高效,几乎支持任何输入传感器配置。大量实验证明,我们的框架在具有多样化传感器集的BEV感知任务中表现出色。

Integrating multiple sensors and addressing diverse tasks in an end-to-end algorithm are challenging yet critical topics for autonomous driving. To this end, we introduce BEVGuide, a novel Bird's Eye-View (BEV) representation learning framework, representing the first attempt to unify a wide range of sensors under direct BEV guidance in an end-to-end fashion. Our architecture accepts input from a diverse sensor pool, including but not limited to Camera, Lidar and Radar sensors, and extracts BEV feature embeddings using a versatile and general transformer backbone. We design a BEV-guided multi-sensor attention block to take queries from BEV embeddings and learn the BEV representation from sensor-specific features. BEVGuide is efficient due to its lightweight backbone design and highly flexible as it supports almost any input sensor configurations. Extensive experiments demonstrate that our framework achieves exceptional performance in BEV perception tasks with a diverse sensor set. Project page is at https://yunzeman.github.io/BEVGuide.

Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding
Hwang, MinyoungandJeong, JaeyeonandKim, MinsooandOh, YoonseonandOh, Songhwai



研究问题:视觉-语言导航(VLN)的主要挑战在于如何在未见过的环境中理解自然语言指令。
动机:传统的VLN算法的局限性在于,如果执行的动作错误,那么代理就无法遵循指令或探索不必要的区域,导致代理走向一条无法恢复的路径。
方法:我们提出了Meta-Explore,一种分层导航方法,使用开发策略来纠正被误导的最近的动作。我们展示了一个开发策略,即在未访问但可观察的状态中选择一个精心选择的局部目标,优于将代理移动到以前访问过的状态的方法。我们还强调了需要用语义上有意义的线索想象遗憾的探索。
效果:我们在R2R、SOON和REVERIE三个VLN基准测试中评估了我们的方法。Meta-Explore优于其他基线,并显示出显著的泛化性能。此外,使用提出的频域SOS特征进行局部目标搜索显著提高了成功率,在SOON基准测试中成功率提高了17.1%,SPL提高了20.6%。

The main challenge in vision-and-language navigation (VLN) is how to understand natural-language instructions in an unseen environment. The main limitation of conventional VLN algorithms is that if an action is mistaken, the agent fails to follow the instructions or explores unnecessary regions, leading the agent to an irrecoverable path. To tackle this problem, we propose Meta-Explore, a hierarchical navigation method deploying an exploitation policy to correct misled recent actions. We show that an exploitation policy, which moves the agent toward a well-chosen local goal among unvisited but observable states, outperforms a method which moves the agent to a previously visited state. We also highlight the demand for imagining regretful explorations with semantically meaningful clues. The key to our approach is understanding the object placements around the agent in spectral-domain. Specifically, we present a novel visual representation, called scene object spectrum (SOS), which performs category-wise 2D Fourier transform of detected objects. Combining exploitation policy and SOS features, the agent can correct its path by choosing a promising local goal. We evaluate our method in three VLN benchmarks: R2R, SOON, and REVERIE. Meta-Explore outperforms other baselines and shows significant generalization performance. In addition, local goal search using the proposed spectral-domain SOS features significantly improves the success rate by 17.1% and SPL by 20.6% for the SOON benchmark.

Query-Centric Trajectory Prediction
Zhou, ZikangandWang, JianpingandLi, Yung-HuiandHuang, Yu-Kai



研究问题:如何预测周围代理的未来轨迹,以实现自动驾驶车辆的安全运行。
动机:现有的方法在预测未来轨迹时存在计算冗余和无法捕获多模态行为的问题。
方法:提出了一种名为QCNet的模型框架,该框架采用查询中心范式进行场景编码,实现了过去计算的重用,并引入了锚点无关的查询来生成轨迹提案,然后通过锚点基础的查询对轨迹提案进行进一步的优化。
效果:在Argoverse 1和Argoverse 2运动预测基准测试中,该方法在所有主要指标上都优于所有其他方法,排名第一。同时,由于其查询中心的设计理念,该方法可以实现流式场景编码和并行多代理解码。

Predicting the future trajectories of surrounding agents is essential for autonomous vehicles to operate safely. This paper presents QCNet, a modeling framework toward pushing the boundaries of trajectory prediction. First, we identify that the agent-centric modeling scheme used by existing approaches requires re-normalizing and re-encoding the input whenever the observation window slides forward, leading to redundant computations during online prediction. To overcome this limitation and achieve faster inference, we introduce a query-centric paradigm for scene encoding, which enables the reuse of past computations by learning representations independent of the global spacetime coordinate system. Sharing the invariant scene features among all target agents further allows the parallelism of multi-agent trajectory decoding. Second, even given rich encodings of the scene, existing decoding strategies struggle to capture the multimodality inherent in agents' future behavior, especially when the prediction horizon is long. To tackle this challenge, we first employ anchor-free queries to generate trajectory proposals in a recurrent fashion, which allows the model to utilize different scene contexts when decoding waypoints at different horizons. A refinement module then takes the trajectory proposals as anchors and leverages anchor-based queries to refine the trajectories further. By supplying adaptive and high-quality anchors to the refinement module, our query-based decoder can better deal with the multimodality in the output of trajectory prediction. Our approach ranks 1st on Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming all methods on all main metrics by a large margin. Meanwhile, our model can achieve streaming scene encoding and parallel multi-agent decoding thanks to the query-centric design ethos.

Phone2Proc: Bringing Robust Robots Into Our Chaotic World
Deitke, MattandHendrix, RoseandFarhadi, AliandEhsani, KianaandKembhavi, Aniruddha



研究问题:训练在模拟环境中的实体代理通常无法适应真实世界环境。
动机:为了解决实体代理在真实世界中的表现不佳的问题,本文提出了一种新方法。
方法:Phone2Proc方法利用10分钟的手机扫描和条件过程生成技术,创建与目标环境语义相似的训练场景分布。
效果:通过使用Phone2Proc进行训练,实体代理在从模拟到真实的ObjectNav任务中的性能成功率从34.7%提高到70.7%,并且在包括家庭、办公室和RoboTHOR在内的多样化真实环境中的超过200次试验中表现出了对现实世界变化的显著鲁棒性。

Training embodied agents in simulation has become mainstream for the embodied AI community. However, these agents often struggle when deployed in the physical world due to their inability to generalize to real-world environments. In this paper, we present Phone2Proc, a method that uses a 10-minute phone scan and conditional procedural generation to create a distribution of training scenes that are semantically similar to the target environment. The generated scenes are conditioned on the wall layout and arrangement of large objects from the scan, while also sampling lighting, clutter, surface textures, and instances of smaller objects with randomized placement and materials. Leveraging just a simple RGB camera, training with Phone2Proc shows massive improvements from 34.7% to 70.7% success rate in sim-to-real ObjectNav performance across a test suite of over 200 trials in diverse real-world environments, including homes, offices, and RoboTHOR. Furthermore, Phone2Proc's diverse distribution of generated scenes makes agents remarkably robust to changes in the real world, such as human movement, object rearrangement, lighting changes, or clutter.

Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes
Ju, XuanandZeng, AilingandWang, JiananandXu, QiangandZhang, Lei



研究问题:当前以人为中心的计算机视觉任务主要关注真实世界中的自然图像,而对雕塑、绘画和卡通等人造场景中的人类形象关注不足。
动机:艺术作品作为生活的抽象,将人类同时融入自然和人造场景中,我们希望通过艺术来连接自然和人造场景的相关任务。
方法:我们引入了Human-Art数据集,该数据集包含5万张高质量图片,超过123万个人体实例,来自5个自然场景和15个人造场景,为2D和3D的人体提供了边界框、关键点、自接触点和文本信息。
效果:我们希望Human-Art数据集能为相关研究提供洞见,并开启新的研究问题。

Humans have long been recorded in a variety of forms since antiquity. For example, sculptures and paintings were the primary media for depicting human beings before the invention of cameras. However, most current human-centric computer vision tasks like human pose estimation and human image generation focus exclusively on natural images in the real world. Artificial humans, such as those in sculptures, paintings, and cartoons, are commonly neglected, making existing models fail in these scenarios. As an abstraction of life, art incorporates humans in both natural and artificial scenes. We take advantage of it and introduce the Human-Art dataset to bridge related tasks in natural and artificial scenarios. Specifically, Human-Art contains 50k high-quality images with over 123k person instances from 5 natural and 15 artificial scenarios, which are annotated with bounding boxes, keypoints, self-contact points, and text information for humans represented in both 2D and 3D. It is, therefore, comprehensive and versatile for various downstream tasks. We also provide a rich set of baseline results and detailed analyses for related tasks, including human detection, 2D and 3D human pose estimation, image generation, and motion transfer. As a challenging dataset, we hope Human-Art can provide insights for relevant research and open up new research questions.

Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark
Wang, XiaofengandZhu, ZhengandZhang, YunpengandHuang, GuanandYe, YunandXu, WenboandChen, ZiweiandWang, Xingang



研究问题:如何量化自动驾驶中视觉感知的性能和效率之间的权衡,并解决传统评估方法忽视推理时间延迟的问题。
动机:当前自动驾驶中的视觉感知技术在性能上有所提升,但延迟过高,无法满足实际应用需求。
方法:提出自动驾驶流媒体感知(ASAP)基准,首次对自动驾驶中的视觉感知在线性能进行评估。基于2Hz标注的nuScenes数据集,设计了一个标注扩展流程,为12Hz原始图像生成高帧率标签。同时,构建了受限计算下的流感知评估协议(SPUR),在各种计算资源限制下使用12Hz输入进行流式评估。
效果:ASAP基准的实验结果显示,在不同的约束条件下,模型排名会发生变化,说明在优化实际应用部署时,应考虑模型延迟和计算预算。此外,还为基于相机的流媒体3D检测建立了基线,显著提高了各种硬件的流式性能。

In recent years, vision-centric perception has flourished in various autonomous driving tasks, including 3D detection, semantic map construction, motion forecasting, and depth estimation. Nevertheless, the latency of vision-centric approaches is too high for practical deployment (e.g., most camera-based 3D detectors have a runtime greater than 300ms). To bridge the gap between ideal researches and real-world applications, it is necessary to quantify the trade-off between performance and efficiency. Traditionally, autonomous-driving perception benchmarks perform the online evaluation, neglecting the inference time delay. To mitigate the problem, we propose the Autonomous-driving StreAming Perception (ASAP) benchmark, which is the first benchmark to evaluate the online performance of vision-centric perception in autonomous driving. On the basis of the 2Hz annotated nuScenes dataset, we first propose an annotation-extending pipeline to generate high-frame-rate labels for the 12Hz raw images. Referring to the practical deployment, the Streaming Perception Under constRained-computation (SPUR) evaluation protocol is further constructed, where the 12Hz inputs are utilized for streaming evaluation under the constraints of different computational resources. In the ASAP benchmark, comprehensive experiment results reveal that the model rank alters under different constraints, suggesting that the model latency and computation budget should be considered as design choices to optimize the practical deployment. To facilitate further research, we establish baselines for camera-based streaming 3D detection, which consistently enhance the streaming performance across various hardware. The ASAP benchmark will be made publicly available.

Azimuth Super-Resolution for FMCW Radar in Autonomous Driving
Li, Yu-JheandHunt, ShawnandPark, JinhyungandO{\textquoteright



研究问题:如何提高调频连续波多输入多输出雷达的方位角分辨率。
动机:由于硬件尺寸限制,调频连续波多输入多输出雷达通常具有较低的分辨率,这对自动驾驶中的物体定位和速度估计至关重要。
方法:提出一种轻量且高效的模数转换超分辨率模型(ADC-SR),该模型仅使用少数接收器的信号预测或生成额外的雷达信号,以提高MIMO雷达的方位角分辨率。
效果:实验证明,与处理后的距离-方位-多普勒(RAD)地图的基线模型相比,处理原始模数转换信号的ADC-SR方法在参数减少98%(50倍)的情况下,性能相当。同时,结合标准RAD超分辨率模型的混合超分辨率模型(Hybrid-SR)可以大幅提高性能。在城市雷达数据集和RADIal数据集上的实验验证了利用原始雷达模数转换信号的重要性。通过在我们的超分辨率模型结果上进行目标检测,发现我们的超分辨率模型可以提高约4%的mAP检测性能。

We tackle the task of Azimuth (angular dimension) super-resolution for Frequency Modulated Continuous Wave (FMCW) multiple-input multiple-output (MIMO) radar. FMCW MIMO radar is widely used in autonomous driving alongside Lidar and RGB cameras. However, compared to Lidar, MIMO radar is usually of low resolution due to hardware size restrictions. For example, achieving 1-degree azimuth resolution requires at least 100 receivers, but a single MIMO device usually supports at most 12 receivers. Having limitations on the number of receivers is problematic since a high-resolution measurement of azimuth angle is essential for estimating the location and velocity of objects. To improve the azimuth resolution of MIMO radar, we propose a light, yet efficient, Analog-to-Digital super-resolution model (ADC-SR) that predicts or hallucinates additional radar signals using signals from only a few receivers. Compared with the baseline models that are applied to processed radar Range-Azimuth-Doppler (RAD) maps, we show that our ADC-SR method that processes raw ADC signals achieves comparable performance with 98% (50 times) fewer parameters. We also propose a hybrid super-resolution model (Hybrid-SR) combining our ADC-SR with a standard RAD super-resolution model, and show that performance can be improved by a large margin. Experiments on our City-Radar dataset and the RADIal dataset validate the importance of leveraging raw radar ADC signals. To assess the value of our super-resolution model for autonomous driving, we also perform object detection on the results of our super-resolution model and find that our super-resolution model improves detection performance by around 4% in mAP.

UniHCP: A Unified Model for Human-Centric Perceptions
Ci, YuanzhengandWang, YizhouandChen, MeilinandTang, ShixiangandBai, LeiandZhu, FengandZhao, RuiandYu, FengweiandQi, DonglianandOuyang, Wanli



研究问题:如何设计一个通用的人体感知模型,以解决各种以人为中心的视觉任务。
动机:虽然特定的以人为中心的任务有其自身的相关语义重点,但它们也共享人体的基本语义结构。然而,很少有工作尝试利用这种同质性来设计一个通用的模型。
方法:我们重新审视了广泛的以人为中心的任务,并以最简方式将它们统一起来。我们提出了UniHCP,这是一个统一的人体感知模型,它以简单的端到端方式,用普通的视觉转换器架构统一了广泛的以人为中心的任务。通过在33个人体中心数据集上的大规模联合训练,UniHCP可以在几个域内和下游任务上通过直接评估超过强大的基线。当适应特定任务时,UniHCP在广泛的以人为中心的任务上实现了新的最先进的性能。
效果:例如,在CIHP上的人像解析达到了69.8 mIoU,在PA-100K上的属性预测达到了86.18 mA,在Market1501上的ReID达到了90.3 mAP,在CrowdHuman上的行人检测达到了85.8 JI,表现优于为每个任务量身定制的专用模型。

Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian detection, person re-identification, etc.) play a key role in industrial applications of visual models. While specific human-centric tasks have their own relevant semantic aspect to focus on, they also share the same underlying semantic structure of the human body. However, few works have attempted to exploit such homogeneity and design a general-propose model for human-centric tasks. In this work, we revisit a broad range of human-centric tasks and unify them in a minimalist manner. We propose UniHCP, a Unified Model for Human-Centric Perceptions, which unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 humancentric datasets, UniHCP can outperform strong baselines on several in-domain and downstream tasks by direct evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing, 86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID, and 85.8 JI on CrowdHuman for pedestrian detection, performing better than specialized models tailored for each task. The code and pretrained model are available at https://github.com/OpenGVLab/UniHCP.

Behavioral Analysis of Vision-and-Language Navigation Agents
Yang, ZijiaoandMajumdar, ArjunandLee, Stefan



研究问题:视觉-语言导航(VLN)代理如何根据环境将指令转化为行动。
动机:为了成功,VLN代理必须能够基于周围环境将指令转化为行动。
方法:我们开发了一种基于技能特定基础来研究代理行为的方法,通过生成技能特定的干预并测量代理预测的变化。
效果:我们的分析表明,训练中的偏见对代理行为有持久影响,现有的模型能够将简单的指称表达式转化为行动。我们对多个模型的比较显示,技能特定分数与整体VLN任务性能的提高相关。

To be successful, Vision-and-Language Navigation (VLN) agents must be able to ground instructions to actions based on their surroundings. In this work, we develop a methodology to study agent behavior on a skill-specific basis -- examining how well existing agents ground instructions about stopping, turning, and moving towards specified objects or rooms. Our approach is based on generating skill-specific interventions and measuring changes in agent predictions. We present a detailed case study analyzing the behavior of a recent agent and then compare multiple agents in terms of skill-specific competency scores. This analysis suggests that biases from training have lasting effects on agent behavior and that existing models are able to ground simple referring expressions. Our comparisons between models show that skill-specific scores correlate with improvements in overall VLN task performance.

Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection
Zeng, JiaandChen, LiandDeng, HanmingandLu, LeweiandYan, JunchiandQiao, YuandLi, Hongyang



研究问题:如何从不完美的专家中提取知识进行模型压缩。
动机:尽管现有的3D物体检测方法在性能上表现出色,但效率低下。
方法:提出FD3D,一种针对3D物体检测的焦点蒸馏器。通过一系列查询来定位实例级别的区域以生成掩蔽特征,增强这些区域的特征表示能力。同时,这些查询找出了精细蒸馏的代表性位置。
效果:在BEVFormer和DETR3D两种流行的检测模型上应用该方法,结果表明,该方法在nuScenes基准测试上的NDS指标分别提高了4.07和3.17点。

Multi-camera 3D object detection blossoms in recent years and most of state-of-the-art methods are built up on the bird's-eye-view (BEV) representations. Albeit remarkable performance, these works suffer from low efficiency. Typically, knowledge distillation can be used for model compression. However, due to unclear 3D geometry reasoning, expert features usually contain some noisy and confusing areas. In this work, we investigate on how to distill the knowledge from an imperfect expert. We propose FD3D, a Focal Distiller for 3D object detection. Specifically, a set of queries are leveraged to locate the instance-level areas for masked feature generation, to intensify feature representation ability in these areas. Moreover, these queries search out the representative fine-grained positions for refined distillation. We verify the effectiveness of our method by applying it to two popular detection models, BEVFormer and DETR3D. The results demonstrate that our method achieves improvements of 4.07 and 3.17 points respectively in terms of NDS metric on nuScenes benchmark. Code is hosted at https://github.com/OpenPerceptionX/BEVPerception-Survey-Recipe.

Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting
Khurana, TarashaandHu, PeiyunandHeld, DavidandRamanan, Deva



研究问题:如何预测世界的未来变化,以实现自主系统的运动规划。
动机:传统的运动规划方法依赖于昂贵的人工标注(如语义类别标签、边界框、轨迹或城市HD地图),难以扩展到大规模的未标注数据集。
方法:提出一种基于未标注LiDAR序列的3D点云预测的自我监督任务,将此任务重新定义为空间时间(4D)占用预测,并通过给定传感器外参和内参从4D占用预测中"渲染"点云数据,以训练和测试未标注LiDAR序列的占用算法。
效果:该方法使得自主系统能够对世界进行预测,而不是对其传感器进行预测,并且可以在不同的数据集、传感器和车辆上评估和比较点云预测算法。

Predicting how the world can evolve in the future is crucial for motion planning in autonomous systems. Classical methods are limited because they rely on costly human annotations in the form of semantic class labels, bounding boxes, and tracks or HD maps of cities to plan their motion -- and thus are difficult to scale to large unlabeled datasets. One promising self-supervised task is 3D point cloud forecasting from unannotated LiDAR sequences. We show that this task requires algorithms to implicitly capture (1) sensor extrinsics (i.e., the egomotion of the autonomous vehicle), (2) sensor intrinsics (i.e., the sampling pattern specific to the particular LiDAR sensor), and (3) the shape and motion of other objects in the scene. But autonomous systems should make predictions about the world and not their sensors! To this end, we factor out (1) and (2) by recasting the task as one of spacetime (4D) occupancy forecasting. But because it is expensive to obtain ground-truth 4D occupancy, we "render" point cloud data from 4D occupancy predictions given sensor extrinsics and intrinsics, allowing one to train and test occupancy algorithms with unannotated LiDAR sequences. This also allows one to evaluate and compare point cloud forecasting algorithms across diverse datasets, sensors, and vehicles.

TopNet: Transformer-Based Object Placement Network for Image Compositing
Zhu, SijieandLin, ZheandCohen, ScottandKuen, JasonandZhang, ZhifeiandChen, Chen



研究问题:自动将物体放置在背景图像中进行图像合成。
动机:现有的方法无法充分利用背景图像中的局部信息,导致合成图像的质量受限。
方法:提出一种利用转换器模块学习物体特征与所有局部背景特征之间相关性的方法,以提供所有可能的位置/缩放配置的详细信息。进一步提出稀疏对比损失来训练模型。
效果:新方法在一次网络前向传递中生成一个3D热力图,表示所有位置/缩放组合的可信度,比之前的滑动窗口方法快10倍以上。同时支持用户定义位置或缩放的交互式搜索。该方法在真实世界图像上表现出色,具有广泛的适用性。

We investigate the problem of automatically placing an object into a background image for image compositing. Given a background image and a segmented object, the goal is to train a model to predict plausible placements (location and scale) of the object for compositing. The quality of the composite image highly depends on the predicted location/scale. Existing works either generate candidate bounding boxes or apply sliding-window search using global representations from background and object images, which fail to model local information in background images. However, local clues in background images are important to determine the compatibility of placing the objects with certain locations/scales. In this paper, we propose to learn the correlation between object features and all local background features with a transformer module so that detailed information can be provided on all possible location/scale configurations. A sparse contrastive loss is further proposed to train our model with sparse supervision. Our new formulation generates a 3D heatmap indicating the plausibility of all location/scale combinations in one network forward pass, which is >10x faster than the previous sliding-window method. It also supports interactive search when users provide a pre-defined location or scale. The proposed method can be trained with explicit annotation or in a self-supervised manner using an off-the-shelf inpainting model, and it outperforms state-of-the-art methods significantly. User study shows that the trained model generalizes well to real-world images with diverse challenging scenes and object categories.

Robot Structure Prior Guided Temporal Attention for Camera-to-Robot Pose Estimation From Image Sequence
Tian, YangandZhang, JiyaoandYin, ZekaiandDong, Hao



研究问题:解决从单视图连续图像序列的在线相机到机器人位姿估计问题,这是机器人与世界互动的关键任务。
动机:此任务的主要障碍是机器人的自我遮挡和单视图图像的模糊性。
方法:我们的方法首次证明了时间信息和机器人结构先验在解决这些挑战中的有效性。给定连续帧和机器人关节配置,我们的方法学习精确地回归预定义的机器人关键点(如关节)的2D坐标。有了相机内参和机器人关节状态,我们使用透视n点(PnP)求解器得到相机到机器人的姿态。我们进一步利用机器人结构先验迭代改进相机到机器人的姿态。为了训练整个流程,我们构建了一个大规模的合成数据集,通过领域随机化弥合模拟与现实的差距。
效果:我们在合成和真实世界的数据集上进行了广泛的实验,并在下游的机器人抓取任务中展示了我们的方法实现了新的最先进的性能,并且在实时(36 FPS)上超过了传统的手眼标定算法。代码和数据可在项目页面获取:https://sites.google.com/view/sgtapose。

In this work, we tackle the problem of online camera-to-robot pose estimation from single-view successive frames of an image sequence, a crucial task for robots to interact with the world. The primary obstacles of this task are the robot's self-occlusions and the ambiguity of single-view images. This work demonstrates, for the first time, the effectiveness of temporal information and the robot structure prior in addressing these challenges. Given the successive frames and the robot joint configuration, our method learns to accurately regress the 2D coordinates of the predefined robot's keypoints (e.g., joints). With the camera intrinsic and robotic joints status known, we get the camera-to-robot pose using a Perspective-n-point (PnP) solver. We further improve the camera-to-robot pose iteratively using the robot structure prior. To train the whole pipeline, we build a large-scale synthetic dataset generated with domain randomisation to bridge the sim-to-real gap. The extensive experiments on synthetic and real-world datasets and the downstream robotic grasping task demonstrate that our method achieves new state-of-the-art performances and outperforms traditional hand-eye calibration algorithms in real-time (36 FPS). Code and data are available at the project page: https://sites.google.com/view/sgtapose.

Learning Human-to-Robot Handovers From Point Clouds
Christen, SammyandYang, WeiandP\'erez-D{\textquoteright



研究问题:提出第一个学习视觉基础的人类到机器人交接控制策略的框架。
动机:尽管具身人工智能在模拟环境中训练机器人代理方面取得了重大进展,但由于模拟人类的困难,与人类进行交互仍然具有挑战性。
方法:通过两阶段教师-学生框架利用运动和抓取规划、强化学习和自我监督进行人工辅助训练。
效果:在模拟基准测试、模拟到模拟转移和模拟到真实转移方面,显著优于基线模型。

We propose the first framework to learn control policies for vision-based human-to-robot handovers, a critical task for human-robot interaction. While research in Embodied AI has made significant progress in training robot agents in simulated environments, interacting with humans remains challenging due to the difficulties of simulating humans. Fortunately, recent research has developed realistic simulated environments for human-to-robot handovers. Leveraging this result, we introduce a method that is trained with a human-in-the-loop via a two-stage teacher-student framework that uses motion and grasp planning, reinforcement learning, and self-supervision. We show significant performance gains over baselines on a simulation benchmark, sim-to-sim transfer and sim-to-real transfer.

ProphNet: Efficient Agent-Centric Motion Forecasting With Anchor-Informed Proposals
Wang, XishunandSu, TongandDa, FangandYang, Xiaodong



研究问题:自动驾驶系统中的动态预测是一个关键模块,由于多源输入的异质性、代理行为的多模态性和车载部署所需的低延迟,这个任务极具挑战性。
动机:为了应对这些困难,本文提出了一种新颖的以代理为中心的模型,该模型具有锚定信息的建议,用于有效的多模态动态预测。
方法:我们设计了一种模态无关的策略,以简洁的方式统一编码复杂的输入。我们生成了与承载目标导向上下文的锚点融合的多样化建议,以引发覆盖广泛未来轨迹的多模态预测。网络架构高度统一且简洁,使得模型易于进行现实世界的部署。
效果:实验表明,我们的以代理为中心的网络在预测精度上优于最先进的方法,同时实现了场景中心级别的推理延迟。

Motion forecasting is a key module in an autonomous driving system. Due to the heterogeneous nature of multi-sourced input, multimodality in agent behavior, and low latency required by onboard deployment, this task is notoriously challenging. To cope with these difficulties, this paper proposes a novel agent-centric model with anchor-informed proposals for efficient multimodal motion forecasting. We design a modality-agnostic strategy to concisely encode the complex input in a unified manner. We generate diverse proposals, fused with anchors bearing goal-oriented context, to induce multimodal prediction that covers a wide range of future trajectories. The network architecture is highly uniform and succinct, leading to an efficient model amenable for real-world deployment. Experiments reveal that our agent-centric network compares favorably with the state-of-the-art methods in prediction accuracy, while achieving scene-centric level inference latency.

Learning and Aggregating Lane Graphs for Urban Automated Driving
B\"uchner, MartinandZ\"urn, JannikandTodoran, Ion-GeorgeandValada, AbhinavandBurgard, Wolfram



研究问题:本文旨在解决自动驾驶和高清地图学习中的一个重要且具有挑战性的任务,即车道图估计。
动机:现有的使用车载或航空图像的方法在处理复杂的车道拓扑、分布外场景或图像空间中的显著遮挡等问题上存在困难。此外,合并重叠的车道图以获得一致的大型图形仍然具有挑战。
方法:我们提出了一种新颖的自底向上的方法,从航空图像中进行车道图估计,该方法将多个重叠的图形聚合成一个单一的一致图形。由于其模块化设计,我们的方法可以解决两个互补的任务:使用图神经网络从任意车辆位置预测自我相关的后继车道图,并将这些预测聚合成一致的全局车道图。
效果:我们在大规模的车道图数据集上进行了广泛的实验,证明我们的方法能够产生非常准确的车道图,即使在严重遮挡的区域也是如此。所提出的图形聚合方法证明可以消除不一致的预测,同时提高整体图形质量。

Lane graph estimation is an essential and highly challenging task in automated driving and HD map learning. Existing methods using either onboard or aerial imagery struggle with complex lane topologies, out-of-distribution scenarios, or significant occlusions in the image space. Moreover, merging overlapping lane graphs to obtain consistent largescale graphs remains difficult. To overcome these challenges, we propose a novel bottom-up approach to lane graph estimation from aerial imagery that aggregates multiple overlapping graphs into a single consistent graph. Due to its modular design, our method allows us to address two complementary tasks: predicting ego-respective successor lane graphs from arbitrary vehicle positions using a graph neural network and aggregating these predictions into a consistent global lane graph. Extensive experiments on a large-scale lane graph dataset demonstrate that our approach yields highly accurate lane graphs, even in regions with severe occlusions. The presented approach to graph aggregation proves to eliminate inconsistent predictions while increasing the overall graph quality. We make our large-scale urban lane graph dataset and code publicly available at http://urbanlanegraph.cs.uni-freiburg.de.

Habitat-Matterport 3D Semantics Dataset
Yadav, KarmeshandRamrakhya, RamandRamakrishnan, SanthoshKumarandGervet, TheoandTurner, JohnandGokaslan, AaronandMaestre, NoahandChang, AngelXuanandBatra, DhruvandSavva, ManolisandClegg, AlexanderWilliamandChaplot, DevendraSingh



研究问题:开发一种可以充分利用词汇、句法和知识信息的语言表示模型。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,需要通过外部知识来增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,开发出ERNIE模型。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is the largest dataset of 3D real-world spaces with densely annotated semantics that is currently available to the academic community. It consists of 142,646 object instance annotations across 216 3D spaces and 3,100 rooms within those spaces. The scale, quality, and diversity of object annotations far exceed those of prior datasets. A key difference setting apart HM3DSEM from other datasets is the use of texture information to annotate pixel-accurate object boundaries. We demonstrate the effectiveness of HM3DSEM dataset for the Object Goal Navigation task using different methods. Policies trained using HM3DSEM perform outperform those trained on prior datasets. Introduction of HM3DSEM in the Habitat ObjectNav Challenge lead to an increase in participation from 400 submissions in 2021 to 1022 submissions in 2022. Project page: https://aihabitat.org/datasets/hm3d-semantics/

Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation
Gao, ChenandPeng, XingyuandYan, MiandWang, HeandYang, LirongandRen, HaibingandLi, HongshengandLiu, Si



研究问题:本文旨在解决视觉-语言导航(VLN)任务中,现有的单步规划方案不适用于分层导航过程的问题。
动机:在VLN任务中,导航过程需要自适应地设置和实现一系列子目标,这是一个自然的分层导航过程。然而,现有的方法采用单步规划方案,即在每一步都直接执行导航动作,这不适合这种分层导航过程。
方法:本文提出了一种自适应区域感知的分层规划器(AZHP),将导航过程明确分为两个异构阶段,即通过区域划分/选择设置子目标(高层动作)和执行子目标(低层动作)。具体来说,AZHP通过设计的状态切换器模块(SSM)异步执行两个级别的动作。对于高层动作,我们设计了一种场景感知的自适应区域划分(SZP)方法,能够实时地将整个导航区域划分为不同的区域。然后,我们提出了一种面向目标的区域选择(GZS)方法,用于为当前子目标选择一个合适的区域。对于低层动作,代理在选定的区域中进行多步导航决策。此外,我们还设计了一种分层强化学习(HRL)策略和辅助损失函数,结合课程学习来训练AZHP框架,为每个阶段提供有效的监督信号。
效果:大量的实验表明,我们提出的方法具有优越性,在三个视觉-语言导航基准测试(REVERIE、SOON、R2R)上实现了最先进的性能。

The task of Vision-Language Navigation (VLN) is for an embodied agent to reach the global goal according to the instruction. Essentially, during navigation, a series of sub-goals need to be adaptively set and achieved, which is naturally a hierarchical navigation process. However, previous methods leverage a single-step planning scheme, i.e., directly performing navigation action at each step, which is unsuitable for such a hierarchical navigation process. In this paper, we propose an Adaptive Zone-aware Hierarchical Planner (AZHP) to explicitly divides the navigation process into two heterogeneous phases, i.e., sub-goal setting via zone partition/selection (high-level action) and sub-goal executing (low-level action), for hierarchical planning. Specifically, AZHP asynchronously performs two levels of action via the designed State-Switcher Module (SSM). For high-level action, we devise a Scene-aware adaptive Zone Partition (SZP) method to adaptively divide the whole navigation area into different zones on-the-fly. Then the Goal-oriented Zone Selection (GZS) method is proposed to select a proper zone for the current sub-goal. For low-level action, the agent conducts navigation-decision multi-steps in the selected zone. Moreover, we design a Hierarchical RL (HRL) strategy and auxiliary losses with curriculum learning to train the AZHP framework, which provides effective supervision signals for each stage. Extensive experiments demonstrate the superiority of our proposed method, which achieves state-of-the-art performance on three VLN benchmarks (REVERIE, SOON, R2R).

GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts
Geng, HaoranandXu, HelinandZhao, ChengyangandXu, ChaoandYi, LiandHuang, SiyuanandWang, He



研究问题:如何通过学习跨类别技能来提高物体感知和操作的泛化能力?
动机:目前对于可泛化的物体感知和操作的研究还处于初级阶段,而跨类别的泛化能力是研究者所追求但尚未充分探索的。
方法:提出通过“可泛化且可操作的部分”(GAParts)来学习这种跨类别技能。在27个物体类别中识别并定义了9种GAParts类(如盖子、手柄等),构建了一个大规模的以部分为中心的交互式数据集GAPartNet,并对其中的8489个部分实例进行了丰富的部分级标注(语义、姿态)。
效果:基于GAPartNet,我们研究了三个跨类别任务:部分分割、部分姿态估计和基于部分的对象操作。由于已见和未见的物体类别之间存在显著的领域差距,我们提出了一种从领域泛化的角度出发的鲁棒3D分割方法,该方法整合了对抗性学习技术,无论在已见还是未见的类别上,都大大超过了所有现有方法的性能。此外,我们还利用部分分割和姿态估计的结果,借助GAPart的姿态定义设计出了能够良好泛化到未见物体类别的部分基操控策略,无论是在模拟器还是在真实世界中都表现出良好的效果。

For years, researchers have been devoted to generalizable object perception and manipulation, where cross-category generalizability is highly desired yet underexplored. In this work, we propose to learn such cross-category skills via Generalizable and Actionable Parts (GAParts). By identifying and defining 9 GAPart classes (lids, handles, etc.) in 27 object categories, we construct a large-scale part-centric interactive dataset, GAPartNet, where we provide rich, part-level annotations (semantics, poses) for 8,489 part instances on 1,166 objects. Based on GAPartNet, we investigate three cross-category tasks: part segmentation, part pose estimation, and part-based object manipulation. Given the significant domain gaps between seen and unseen object categories, we propose a robust 3D segmentation method from the perspective of domain generalization by integrating adversarial learning techniques. Our method outperforms all existing methods by a large margin, no matter on seen or unseen categories. Furthermore, with part segmentation and pose estimation results, we leverage the GAPart pose definition to design part-based manipulation heuristics that can generalize well to unseen object categories in both the simulator and the real world.

OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation
Wu, TongandZhang, JiaruiandFu, XiaoandWang, YuxinandRen, JiaweiandPan, LiangandWu, WayneandYang, LeiandWang, JiaqiandQian, ChenandLin, DahuaandLiu, Ziwei



研究问题:本文旨在解决目前3D对象建模主要依赖合成数据集的问题,以促进真实世界中的3D感知、重建和生成的发展。
动机:由于缺乏大规模的真实扫描3D数据库,现有的3D对象建模方法大多依赖于合成数据集。为了解决这个问题,我们提出了OmniObject3D,这是一个大型的、高质量的真实扫描3D对象数据集。
方法:我们使用专业的扫描仪对6000个物体进行扫描,每个物体都提供了纹理网格、点云、多视角渲染图像和多个真实捕获的视频。我们还设置了四个评估轨道:a) 鲁棒的3D感知,b) 新视图合成,c) 神经表面重建,d) 3D对象生成。
效果:实验结果表明,OmniObject3D在各种基准测试中表现出色,为未来的现实3D视觉研究提供了新的观察、挑战和机会。

Recent advances in modeling 3D objects mostly rely on synthetic datasets due to the lack of large-scale real-scanned 3D databases. To facilitate the development of 3D perception, reconstruction, and generation in the real world, we propose OmniObject3D, a large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects. OmniObject3D has several appealing properties: 1) Large Vocabulary: It comprises 6,000 scanned objects in 190 daily categories, sharing common classes with popular 2D datasets (e.g., ImageNet and LVIS), benefiting the pursuit of generalizable 3D representations. 2) Rich Annotations: Each 3D object is captured with both 2D and 3D sensors, providing textured meshes, point clouds, multiview rendered images, and multiple real-captured videos. 3) Realistic Scans: The professional scanners support high-quality object scans with precise shapes and realistic appearances. With the vast exploration space offered by OmniObject3D, we carefully set up four evaluation tracks: a) robust 3D perception, b) novel-view synthesis, c) neural surface reconstruction, and d) 3D object generation. Extensive studies are performed on these four benchmarks, revealing new observations, challenges, and opportunities for future research in realistic 3D vision.

Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking
Pang, ZiqiandLi, JieandTokmakov, PavelandChen, DianandZagoruyko, SergeyandWang, Yu-Xiong



研究问题:提出一种端到端的多摄像头3D多目标跟踪(MOT)框架。
动机:强调空间-时间连续性,整合过去和未来的推理对被追踪的对象。
方法:采用"注意力跟踪"框架,通过对象查询来连贯地表示被追踪的实例在时间上。"过去推理"模块通过交叉关注前几帧和其他对象的查询来精炼轨迹和增强对象特征;"未来推理"模块消化历史信息并预测稳健的未来轨迹。
效果:在nuScenes数据集上,该方法大幅度提高了AMOTA,并且与先前的方法相比,显著减少了90%的ID切换,这是一个数量级的差距。

This work proposes an end-to-end multi-camera 3D multi-object tracking (MOT) framework. It emphasizes spatio-temporal continuity and integrates both past and future reasoning for tracked objects. Thus, we name it "Past-and-Future reasoning for Tracking" (PF-Track). Specifically, our method adapts the "tracking by attention" framework and represents tracked instances coherently over time with object queries. To explicitly use historical cues, our "Past Reasoning" module learns to refine the tracks and enhance the object features by cross-attending to queries from previous frames and other objects. The "Future Reasoning" module digests historical information and predicts robust future trajectories. In the case of long-term occlusions, our method maintains the object positions and enables re-association by integrating motion predictions. On the nuScenes dataset, our method improves AMOTA by a large margin and remarkably reduces ID-Switches by 90% compared to prior approaches, which is an order of magnitude less. The code and models are made available at https://github.com/TRI-ML/PF-Track.

Tracking Through Containers and Occluders in the Wild
VanHoorick, BasileandTokmakov, PavelandStent, SimonandLi, JieandVondrick, Carl



研究问题:在杂乱和动态的环境中,通过重度遮挡和包含进行视觉跟踪仍然是一个困难的计算机视觉系统挑战。
动机:为了解决这一问题,我们引入了TCOW,这是一个新的基准和模型,用于通过重度遮挡和包含进行视觉跟踪。
方法:我们设置了一个任务,目标是给定一个视频序列,对目标对象的投影范围以及存在的任何包围器或遮挡物进行分割。为此,我们创建了一个混合的合成和注释的真实数据集,以支持各种形式的任务变化下的有监督学习和模型性能的结构评估。
效果:我们对两种最新的基于变压器的视频模型进行了评估,发现尽管它们在某些任务变化设置下能够出色地跟踪目标,但在我们可以宣称跟踪模型已经获得了对象持久性的真实概念之前,仍然存在相当大的性能差距。

Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems. In this paper, we introduce TCOW, a new benchmark and model for visual tracking through heavy occlusion and containment. We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the surrounding container or occluder whenever one exists. To study this task, we create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance under various forms of task variation, such as moving or nested containment. We evaluate two recent transformer-based video models and find that while they can be surprisingly capable of tracking targets under certain settings of task variation, there remains a considerable performance gap before we can claim a tracking model to have acquired a true notion of object permanence.

LANA: A Language-Capable Navigator for Instruction Following and Generation
Wang, XiaohanandWang, WenguanandShao, JiayiandYang, Yi



研究问题:如何使机器人代理不仅能够执行人类编写的导航指令,还能为人类提供路线描述。
动机:现有的VLN研究主要关注将指令解释为行动,只提供了“哑巴”的寻路代理。
方法:设计LANA,一种具有语言能力的导航代理,通过一个模型同时学习指令跟随和生成。具体来说,建立了两个共享的编码器(分别用于路线和语言编码)和两个解码器(分别用于动作预测和指令生成),以利用跨任务知识和捕获特定任务的特性。在预训练和微调过程中,将指令跟随和生成都设置为优化目标。
效果:实验证明,与最新的专用解决方案相比,LANA在指令跟随和路线描述方面都取得了更好的性能,且复杂度降低了近一半。此外,由于具备语言生成能力,LANA可以向人类解释其行为并协助人类的寻路。这项工作有望推动未来构建更可信、更具社会智能的导航机器人的努力。

Recently, visual-language navigation (VLN) -- entailing robot agents to follow navigation instructions -- has shown great advance. However, existing literature put most emphasis on interpreting instructions into actions, only delivering "dumb" wayfinding agents. In this article, we devise LANA, a language-capable navigation agent which is able to not only execute human-written navigation commands, but also provide route descriptions to humans. This is achieved by simultaneously learning instruction following and generation with only one single model. More specifically, two encoders, respectively for route and language encoding, are built and shared by two decoders, respectively, for action prediction and instruction generation, so as to exploit cross-task knowledge and capture task-specific characteristics. Throughout pretraining and fine-tuning, both instruction following and generation are set as optimization objectives. We empirically verify that, compared with recent advanced task-specific solutions, LANA attains better performances on both instruction following and route description, with nearly half complexity. In addition, endowed with language generation capability, LANA can explain to humans its behaviors and assist human's wayfinding. This work is expected to foster future efforts towards building more trustworthy and socially-intelligent navigation robots. Our code will be released.

StarCraftImage: A Dataset for Prototyping Spatial Reasoning Methods for Multi-Agent Environments
Kulinski, SeanandWaytowich, NicholasR.andHare, JamesZ.andInouye, DavidI.



研究问题:如何在多智能体环境中进行空间推理任务,如事件预测、代理类型识别或缺失数据填充。
动机:在多智能体环境(如星际争霸II)中的空间推理任务对许多应用(如自主监视传感器网络和强化学习子任务)至关重要,但提取简单的标准表示形式以原型化这些任务既费力又阻碍了可重复性。
方法:研究人员从60,000个游戏回放中精心总结了一个255个连续游戏状态的窗口,创建了360万个摘要图像,包括所有相关元数据,如游戏结果和玩家种族。他们开发了三种复杂度递减的格式:类似于多光谱地理空间图像的超光谱图像、模仿CIFAR10的RGB图像和模仿MNIST的灰度图像。
效果:这个数据集可以用于原型化空间推理方法,为多智能体环境中的空间推理任务提供了一个易于使用且具有挑战性的基准。

Spatial reasoning tasks in multi-agent environments such as event prediction, agent type identification, or missing data imputation are important for multiple applications (e.g., autonomous surveillance over sensor networks and subtasks for reinforcement learning (RL)). StarCraft II game replays encode intelligent (and adversarial) multi-agent behavior and could provide a testbed for these tasks; however, extracting simple and standardized representations for prototyping these tasks is laborious and hinders reproducibility. In contrast, MNIST and CIFAR10, despite their extreme simplicity, have enabled rapid prototyping and reproducibility of ML methods. Following the simplicity of these datasets, we construct a benchmark spatial reasoning dataset based on StarCraft II replays that exhibit complex multi-agent behaviors, while still being as easy to use as MNIST and CIFAR10. Specifically, we carefully summarize a window of 255 consecutive game states to create 3.6 million summary images from 60,000 replays, including all relevant metadata such as game outcome and player races. We develop three formats of decreasing complexity: Hyperspectral images that include one channel for every unit type (similar to multispectral geospatial images), RGB images that mimic CIFAR10, and grayscale images that mimic MNIST. We show how this dataset can be used for prototyping spatial reasoning methods. All datasets, code for extraction, and code for dataset loading can be found at https://starcraftdata.davidinouye.com/.

Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection
Wang, YingjieandDeng, JiajunandLi, YaoandHu, JinshuiandLiu, CongandZhang, YuandJi, JianminandOuyang, WanliandZhang, Yanyong



研究问题:如何有效地结合激光雷达(LiDAR)和雷达(Radar)的感知方式,以改进特征表示。
动机:尽管激光雷达和雷达是两种互补的传感方法,但如何将它们有效结合以提高特征表示仍然不清楚。主要挑战在于雷达数据极度稀疏且缺乏高度信息。
方法:本文提出了一种双向激光雷达-雷达融合框架(Bi-LRFusion),通过两步来解决挑战并提高动态物体的3D检测性能。首先,通过从激光雷达分支学习重要细节来丰富雷达的局部特征,以缓解由于缺乏高度信息和极度稀疏性导致的问题;其次,在统一的鸟瞰视角表示中将激光雷达特征与增强的雷达特征相结合。
效果:在nuScenes和ORR数据集上进行了大量实验,结果显示我们的Bi-LRFusion在检测动态物体方面取得了最先进的性能。值得注意的是,这两个数据集中的雷达数据格式不同,这表明了我们的方法具有通用性。

LiDAR and Radar are two complementary sensing approaches in that LiDAR specializes in capturing an object's 3D shape while Radar provides longer detection ranges as well as velocity hints. Though seemingly natural, how to efficiently combine them for improved feature representation is still unclear. The main challenge arises from that Radar data are extremely sparse and lack height information. Therefore, directly integrating Radar features into LiDAR-centric detection networks is not optimal. In this work, we introduce a bi-directional LiDAR-Radar fusion framework, termed Bi-LRFusion, to tackle the challenges and improve 3D detection for dynamic objects. Technically, Bi-LRFusion involves two steps: first, it enriches Radar's local features by learning important details from the LiDAR branch to alleviate the problems caused by the absence of height information and extreme sparsity; second, it combines LiDAR features with the enhanced Radar features in a unified bird's-eye-view representation. We conduct extensive experiments on nuScenes and ORR datasets, and show that our Bi-LRFusion achieves state-of-the-art performance for detecting dynamic objects. Notably, Radar data in these two datasets have different formats, which demonstrates the generalizability of our method. Codes will be published.

BioNet: A Biologically-Inspired Network for Face Recognition
Li, Pengyu



研究问题:如何利用最新的神经科学发现来提升人脸识别的性能。
动机:尽管已有一些计算机视觉的研究试图通过增强人脸属性来提高人脸识别性能,但这些方法并未受到人类面部识别机制的启发,也没有显著提高性能。
方法:我们设计了一个名为BioNet的生物启发网络,该网络由视觉皮层网络(VCN)和颞下皮质网络(ICN)两个级联子网络组成。VCN采用经典的卷积神经网络作为主干,而ICN则包含三个生物启发模块:皮层功能区隔化、区隔反应转换和反应强度调制。
效果:实验证明,1) 最新的关于人类面部识别系统的研究发现可以进一步推动基于CNN的人脸识别网络的发展;2) 利用生物机制,与身份相关的属性(如性别)和与身份无关的属性(如表情)都可以使深度人脸识别模型受益,其中与身份无关的属性的贡献甚至更大;3) 我们提出的BioNet在标准的人脸识别基准数据集上显著提高了最先进的技术水平。

Recently, whether and how cutting-edge Neuroscience findings can inspire Artificial Intelligence (AI) confuse both communities and draw much discussion. As one of the most critical fields in AI, Computer Vision (CV) also pays much attention to the discussion. To show our ideas and experimental evidence to the discussion, we focus on one of the most broadly researched topics both in Neuroscience and CV fields, i.e., Face Recognition (FR). Neuroscience studies show that face attributes are essential to the human face-recognizing system. How the attributes contribute also be explained by the Neuroscience community. Even though a few CV works improved the FR performance with attribute enhancement, none of them are inspired by the human face-recognizing mechanism nor boosted performance significantly. To show our idea experimentally, we model the biological characteristics of the human face-recognizing system with classical Convolutional Neural Network Operators (CNN Ops) purposely. We name the proposed Biologically-inspired Network as BioNet. Our BioNet consists of two cascade sub-networks, i.e., the Visual Cortex Network (VCN) and the Inferotemporal Cortex Network (ICN). The VCN is modeled with a classical CNN backbone. The proposed ICN comprises three biologically-inspired modules, i.e., the Cortex Functional Compartmentalization, the Compartment Response Transform, and the Response Intensity Modulation. The experiments prove that: 1) The cutting-edge findings about the human face-recognizing system can further boost the CNN-based FR network. 2) With the biological mechanism, both identity-related attributes (e.g., gender) and identity-unrelated attributes (e.g., expression) can benefit the deep FR models. Surprisingly, the identity-unrelated ones contribute even more than the identity-related ones. 3) The proposed BioNet significantly boosts state-of-the-art on standard FR benchmark datasets. For example, BioNet boosts IJB-B@1e-6 from 52.12% to 68.28% and MegaFace from 98.74% to 99.19%. The source code will be released.

Visual-Tactile Sensing for In-Hand Object Reconstruction
Xu, WenqiangandYu, ZhenjunandXue, HanandYe, RuolinandYao, SiqiongandLu, Cewu



研究问题:如何利用触感传感器进行视觉-触觉学习,实现手部和物体的重建?
动机:触感是人类感知世界的重要方式之一,结合视觉可以精细化局部几何结构,测量接触区域的形变,并指示手-物体接触状态。开源触感传感器如DIGIT的出现,使得视觉-触觉学习的研究变得更易获取和复制。
方法:我们提出了一种新的视觉-触觉手持物体重建框架VTacO,并将其扩展到VTacOH以进行手部和物体的重建。由于我们的方法可以支持刚性和可变形物体的重建,并且没有现有的基准适合这个目标,因此我们提出了一个模拟环境VT-Sim,它可以生成刚性和可变形物体的手部-物体交互。
效果:大量的实验表明,我们提出的方法在定性和定量上都能超越先前的基线方法。最后,我们将在模拟中训练的模型直接应用于各种真实世界的测试案例,展示了定性的结果。代码、模型、模拟环境和数据集将公开发布。

Tactile sensing is one of the modalities human rely on heavily to perceive the world. Working with vision, this modality refines local geometry structure, measures deformation at contact area, and indicates hand-object contact state. With the availability of open-source tactile sensors such as DIGIT, research on visual-tactile learning is becoming more accessible and reproducible. Leveraging this tactile sensor, we propose a novel visual-tactile in-hand object reconstruction framework VTacO, and extend it to VTacOH for hand-object reconstruction. Since our method can support both rigid and deformable object reconstruction, and no existing benchmark are proper for the goal. We propose a simulation environment, VT-Sim, which supports to generate hand-object interaction for both rigid and deformable objects. With VT-Sim, we generate a large-scale training dataset, and evaluate our method on it. Extensive experiments demonstrate that our proposed method can outperform the previous baseline methods qualitatively and quantitatively. Finally, we directly apply our model trained in simulation to various real-world test cases, which display qualitative results. Codes, models, simulation environment, datasets will be publicly available.

FJMP: Factorized Joint Multi-Agent Motion Prediction Over Learned Directed Acyclic Interaction Graphs
Rowe, LukeandEthier, MartinandDykhne, Eli-HenryandCzarnecki, Krzysztof



研究问题:预测多智能体驾驶场景中道路参与者的未来运动是一项关键任务。
动机:在多智能体交互驾驶场景中,生成一组场景级别的未来轨迹预测。
方法:提出FJMP,一种因子化联合运动预测框架,将未来场景互动动态建模为稀疏有向交互图,然后将其剪枝成有向无环图,并根据有向无环图的偏序关系将联合预测任务分解为一系列边际和条件预测。
效果:在INTERACTION和Argoverse 2数据集上进行实验,证明FJMP比非因子化方法产生更准确、更一致的场景联合轨迹预测,特别是在最具交互性和运动学的代理上。FJMP在INTERACTION数据集的多代理测试排行榜上排名第一。

Predicting the future motion of road agents is a critical task in an autonomous driving pipeline. In this work, we address the problem of generating a set of scene-level, or joint, future trajectory predictions in multi-agent driving scenarios. To this end, we propose FJMP, a Factorized Joint Motion Prediction framework for multi-agent interactive driving scenarios. FJMP models the future scene interaction dynamics as a sparse directed interaction graph, where edges denote explicit interactions between agents. We then prune the graph into a directed acyclic graph (DAG) and decompose the joint prediction task into a sequence of marginal and conditional predictions according to the partial ordering of the DAG, where joint future trajectories are decoded using a directed acyclic graph neural network (DAGNN). We conduct experiments on the INTERACTION and Argoverse 2 datasets and demonstrate that FJMP produces more accurate and scene-consistent joint trajectory predictions than non-factorized approaches, especially on the most interactive and kinematically interesting agents. FJMP ranks 1st on the multi-agent test leaderboard of the INTERACTION dataset.

Probing Neural Representations of Scene Perception in a Hippocampally Dependent Task Using Artificial Neural Networks
Frey, MarkusandDoeller, ChristianF.andBarry, Caswell



研究问题:现有的深度神经网络在解释高级皮层区域表示的能力相对较弱,特别是在从自我中心到他者中心的转换上。
动机:为了解决这一问题,研究人员设计了一种新场景感知基准测试,以探索深度神经网络将不同自我中心视角的场景进行转换的能力。
方法:研究人员使用了一种受海马体和颞叶结构之间连接启发的网络架构,并使用三元组损失进行训练,同时通过强制分解潜在空间,将信息传播分为“什么”和“哪里”的路径,用于重建输入。
效果:实验结果表明,这种方法在无监督物体分割任务上超越了现有技术,并在CATER和MOVi-A,B,C基准测试上取得了显著改进。

Deep artificial neural networks (DNNs) trained through backpropagation provide effective models of the mammalian visual system, accurately capturing the hierarchy of neural responses through primary visual cortex to inferior temporal cortex (IT). However, the ability of these networks to explain representations in higher cortical areas is relatively lacking and considerably less well researched. For example, DNNs have been less successful as a model of the egocentric to allocentric transformation embodied by circuits in retrosplenial and posterior parietal cortex. We describe a novel scene perception benchmark inspired by a hippocampal dependent task, designed to probe the ability of DNNs to transform scenes viewed from different egocentric perspectives. Using a network architecture inspired by the connectivity between temporal lobe structures and the hippocampus, we demonstrate that DNNs trained using a triplet loss can learn this task. Moreover, by enforcing a factorized latent space, we can split information propagation into "what" and "where" pathways, which we use to reconstruct the input. This allows us to beat the state-of-the-art for unsupervised object segmentation on the CATER and MOVi-A,B,C benchmarks.

V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting
Yu, HaibaoandYang, WenxianandRuan, HongzhiandYang, ZhenweiandTang, YingjuanandGao, XuandHao, XinandShi, YifengandPan, YifengandSun, NingandSong, JuanandYuan, JiruiandLuo, PingandNie, Zaiqing



研究问题:如何利用基础设施和车辆侧信息跟踪和预测周围交通参与者的行为,以提高自动驾驶的决策和安全性。
动机:缺乏真实世界的序列数据集限制了这一领域的研究。
方法:引入V2X-Seq,首个大规模的序列V2X数据集,包括从自然场景中捕获的数据帧、轨迹、矢量地图和交通灯。V2X-Seq包含两部分:顺序感知数据集(包含从95个场景中捕获的15000多个帧)和轨迹预测数据集(包含从28个交叉口区域捕获的约8万个基础设施视角、8万个车辆视角和5万个合作视角的场景,覆盖672小时的数据)。基于V2X-Seq,提出了三个新的车辆基础设施协同(VIC)自动驾驶任务:VIC3D跟踪、在线VIC预测和离线VIC预测。同时提供了这些任务的基准测试。
效果:实验结果表明,新提出的任务在提高自动驾驶决策和安全性方面具有显著效果。

Utilizing infrastructure and vehicle-side information to track and forecast the behaviors of surrounding traffic participants can significantly improve decision-making and safety in autonomous driving. However, the lack of real-world sequential datasets limits research in this area. To address this issue, we introduce V2X-Seq, the first large-scale sequential V2X dataset, which includes data frames, trajectories, vector maps, and traffic lights captured from natural scenery. V2X-Seq comprises two parts: the sequential perception dataset, which includes more than 15,000 frames captured from 95 scenarios, and the trajectory forecasting dataset, which contains about 80,000 infrastructure-view scenarios, 80,000 vehicle-view scenarios, and 50,000 cooperative-view scenarios captured from 28 intersections' areas, covering 672 hours of data. Based on V2X-Seq, we introduce three new tasks for vehicle-infrastructure cooperative (VIC) autonomous driving: VIC3D Tracking, Online-VIC Forecasting, and Offline-VIC Forecasting. We also provide benchmarks for the introduced tasks. Find data, code, and more up-to-date information at https://github.com/AIR-THU/DAIR-V2X-Seq.

3D Video Object Detection With Learnable Object-Centric Global Optimization
He, JiaweiandChen, YuntaoandWang, NaiyanandZhang, Zhaoxiang



研究问题:本文旨在探索长期时间视觉对应关系优化在3D视频目标检测中的应用。
动机:现有的3D视频目标检测方法主要依赖于2D图像信息,忽略了物体在时间和空间上的连续性。而视觉对应关系可以建立多个图像之间的像素级一对一映射,是3D场景重建的基础。
方法:本文提出了BA-Det模型,通过对象为中心的时间对应关系学习和特征度量对象束调整进行端到端优化的目标检测器。
效果:实验结果表明,BA-Det在多种基准3D检测器上在不同设置下均表现出了高效和有效,并在大规模Waymo开放数据集(WOD)上取得了最先进的性能。

We explore long-term temporal visual correspondence-based optimization for 3D video object detection in this work. Visual correspondence refers to one-to-one mappings for pixels across multiple images. Correspondence-based optimization is the cornerstone for 3D scene reconstruction but is less studied in 3D video object detection, because moving objects violate multi-view geometry constraints and are treated as outliers during scene reconstruction. We address this issue by treating objects as first-class citizens during correspondence-based optimization. In this work, we propose BA-Det, an end-to-end optimizable object detector with object-centric temporal correspondence learning and featuremetric object bundle adjustment. Empirically, we verify the effectiveness and efficiency of BA-Det for multiple baseline 3D detectors under various setups. Our BA-Det achieves SOTA performance on the large-scale Waymo Open Dataset (WOD) with only marginal computation cost. Our code is available at https://github.com/jiaweihe1996/BA-Det.

Imitation Learning As State Matching via Differentiable Physics
Chen, SiweiandMa, XiaoandXu, Zhongwen



研究问题:现有的模仿学习(IL)方法,如逆强化学习(IRL),通常具有双循环训练过程,交替学习奖励函数和策略,往往导致训练时间长、方差大的问题。
动机:本文提出了一种新的模仿学习方法,即通过可微分物理(ILD)进行模仿学习,该方法消除了双循环设计,并在最终性能、收敛速度和稳定性方面取得了显著改进。
方法:ILD将可微分物理模拟器作为物理先验纳入其计算图中进行策略学习。它通过从参数化策略中采样动作来展开动力学,简单地最小化专家轨迹和代理轨迹之间的距离,并通过时间物理运算符向后传播梯度到策略中。有了物理先验,ILD策略不仅可以转移到未见过的环境规范,而且在各种任务上也能产生更高的最终性能。此外,ILD自然地形成了单循环结构,显著提高了稳定性和训练速度。为了简化由时间物理运算符引起的复杂优化景观,ILD在优化过程中为每个状态动态选择学习目标。
效果:实验表明,ILD在一系列连续控制任务中优于最先进的方法,只需要一个专家演示。此外,ILD可以应用于具有挑战性的变形物体操纵任务,并可以推广到未见过的配置。

Existing imitation learning (IL) methods such as inverse reinforcement learning (IRL) usually have a double-loop training process, alternating between learning a reward function and a policy and tend to suffer long training time and high variance. In this work, we identify the benefits of differentiable physics simulators and propose a new IL method, i.e., Imitation Learning via Differentiable Physics (ILD), which gets rid of the double-loop design and achieves significant improvements in final performance, convergence speed, and stability. The proposed ILD incorporates the differentiable physics simulator as a physics prior into its computational graph for policy learning. It unrolls the dynamics by sampling actions from a parameterized policy, simply minimizing the distance between the expert trajectory and the agent trajectory, and back-propagating the gradient into the policy via temporal physics operators. With the physics prior, ILD policies can not only be transferable to unseen environment specifications but also yield higher final performance on a variety of tasks. In addition, ILD naturally forms a single-loop structure, which significantly improves the stability and training speed. To simplify the complex optimization landscape induced by temporal physics operations, ILD dynamically selects the learning objectives for each state during optimization. In our experiments, we show that ILD outperforms state-of-the-art methods in a variety of continuous control tasks with Brax, requiring only one expert demonstration. In addition, ILD can be applied to challenging deformable object manipulation tasks and can be generalized to unseen configurations.

Critical Learning Periods for Multisensory Integration in Deep Networks
Kleinman, MichaelandAchille, AlessandroandSoatto, Stefano



研究问题:神经网络整合多元信息的能力在训练早期阶段是否受到适当关联信号的影响。
动机:干扰学习过程的初始阶段可能会永久损害技能的发展,这种现象在人工和生物系统中被称为关键学习期。
方法:通过分析广泛和浅层的网络,研究了深度线性网络在多源整合中的关键学习期,同时比较了深层和浅层网络的差异。
效果:研究发现,引入跨传感器重建目标的训练架构对关键学习期的抗性显著提高。这可能部分解释了最近自我监督多模态训练相对于以前有监督努力的成功。

We show that the ability of a neural network to integrate information from diverse sources hinges critically on being exposed to properly correlated signals during the early phases of training. Interfering with the learning process during this initial stage can permanently impair the development of a skill, both in artificial and biological systems where the phenomenon is known as a critical learning period. We show that critical periods arise from the complex and unstable early transient dynamics, which are decisive of final performance of the trained system and their learned representations. This evidence challenges the view, engendered by analysis of wide and shallow networks, that early learning dynamics of neural networks are simple, akin to those of a linear model. Indeed, we show that even deep linear networks exhibit critical learning periods for multi-source integration, while shallow networks do not. To better understand how the internal representations change according to disturbances or sensory deficits, we introduce a new measure of source sensitivity, which allows us to track the inhibition and integration of sources during training. Our analysis of inhibition suggests cross-source reconstruction as a natural auxiliary training objective, and indeed we show that architectures trained with cross-sensor reconstruction objectives are remarkably more resilient to critical periods. Our findings suggest that the recent success in self-supervised multi-modal training compared to previous supervised efforts may be in part due to more robust learning dynamics and not solely due to better architectures and/or more data.

GarmentTracking: Category-Level Garment Pose Tracking
Xue, HanandXu, WenqiangandZhang, JieyiandTang, TutianandLi, YutongandDu, WenxinandYe, RuolinandLu, Cewu



研究问题:开发一种能够估计和跟踪完整服装姿态的视觉系统,以解决各种下游任务和实际应用。
动机:由于服装对人类的重要性,一个可以准确估计和跟踪服装姿态的系统具有广泛的应用前景。
方法:我们提出了一个完整的解决方案,包括一个记录系统VR-Garment,一个大规模的数据集VR-Folding,以及一个端到端的在线跟踪框架GarmentTracking。
效果:实验表明,我们提出的GarmentTracking在预测服装姿态方面表现出色,即使在服装有大的非刚性变形时也能保持高速度和高精度,优于基线方法。

Garments are important to humans. A visual system that can estimate and track the complete garment pose can be useful for many downstream tasks and real-world applications. In this work, we present a complete package to address the category-level garment pose tracking task: (1) A recording system VR-Garment, with which users can manipulate virtual garment models in simulation through a VR interface. (2) A large-scale dataset VR-Folding, with complex garment pose configurations in manipulation like flattening and folding. (3) An end-to-end online tracking framework GarmentTracking, which predicts complete garment pose both in canonical space and task space given a point cloud sequence. Extensive experiments demonstrate that the proposed GarmentTracking achieves great performance even when the garment has large non-rigid deformation. It outperforms the baseline approach on both speed and accuracy. We hope our proposed solution can serve as a platform for future research. Codes and datasets are available in https://garment-tracking.robotflow.ai.

TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving
Fang, ShaohengandWang, ZiandZhong, YiqiandGe, JunhaoandChen, Siheng



研究问题:如何同步多视角和时间戳的特征,并进一步利用这些空间-时间特征进行视觉为中心的联合感知和预测。
动机:由于几何畸变的存在,同步多视角和时间戳的特征以及进一步利用这些空间-时间特征是自动驾驶研究中的一个重要挑战。
方法:提出一种基于鸟瞰图金字塔变压器的时间鸟瞰图金字塔变压器(TBP-Former)方法,包括两个创新设计。首先,提出一个姿态同步的BEV编码器,将任意相机姿态和任意时间的原始图像输入映射到一个共享且同步的BEV空间,以实现更好的空间-时间同步。其次,引入一个空间-时间金字塔变压器,全面提取多尺度BEV特征,并在空间先验的支持下预测未来的BEV状态。
效果:在nuScenes数据集上的大量实验表明,我们提出的框架总体上优于所有最先进的基于视觉的预测方法。

Vision-centric joint perception and prediction (PnP) has become an emerging trend in autonomous driving research. It predicts the future states of the traffic participants in the surrounding environment from raw RGB images. However, it is still a critical challenge to synchronize features obtained at multiple camera views and timestamps due to inevitable geometric distortions and further exploit those spatial-temporal features. To address this issue, we propose a temporal bird's-eye-view pyramid transformer (TBP-Former) for vision-centric PnP, which includes two novel designs. First, a pose-synchronized BEV encoder is proposed to map raw image inputs with any camera pose at any time to a shared and synchronized BEV space for better spatial-temporal synchronization. Second, a spatial-temporal pyramid transformer is introduced to comprehensively extract multi-scale BEV features and predict future BEV states with the support of spatial priors. Extensive experiments on nuScenes dataset show that our proposed framework overall outperforms all state-of-the-art vision-based prediction methods.

Seeing With Sound: Long-range Acoustic Beamforming for Multimodal Scene Understanding
Chakravarthula, PraneethandD{\textquoteright



研究问题:现有的自动驾驶车辆主要依赖电磁波传感器,但在环境恶劣的情况下可能会受到影响,而且只能检测到直接视线内的对象。
动机:为了解决这些问题,研究人员提出了一种新的声波压力波束形成方法,作为传统光学传感器的补充,用于检测动态交通环境中的对象。
方法:研究人员引入了长距离声学压力波束形成技术,通过汽车在自然环境中产生的噪声直接进行检测。他们创建了第一个多模态长距离声学压力波束形成数据集,并提出了一种新的神经孔径扩展方法进行波束形成。
效果:实验结果表明,这种方法在具有挑战性的汽车场景中,可以有效地补充现有的RGB摄像头,提高对象检测的准确性和速度。

Existing autonomous vehicles primarily use sensors that rely on electromagnetic waves which are undisturbed in good environmental conditions but can suffer in adverse scenarios, such as low light or for objects with low reflectance. Moreover, only objects in direct line-of-sight are typically detected by these existing methods. Acoustic pressure waves emanating from road users do not share these limitations. However, such signals are typically ignored in automotive perception because they suffer from low spatial resolution and lack directional information. In this work, we introduce long-range acoustic beamforming of pressure waves from noise directly produced by automotive vehicles in-the-wild as a complementary sensing modality to traditional optical sensor approaches for detection of objects in dynamic traffic environments. To this end, we introduce the first multimodal long-range acoustic beamforming dataset. We propose a neural aperture expansion method for beamforming and we validate its utility for multimodal automotive object detection. We validate the benefit of adding sound detections to existing RGB cameras in challenging automotive scenarios, where camera-only approaches fail or do not deliver the ultra-fast rates of pressure sensors.

Neural Map Prior for Autonomous Driving
Xiong, XuanandLiu, YichengandYuan, TianyuanandWang, YueandWang, YilunandZhao, Hang



研究问题:如何利用神经网络表示全球地图,实现自动全局地图更新并提高局部地图推断性能。
动机:传统的离线高清地图创建过程劳动密集且成本高昂,无法及时更新。而在线传感器观察推断的地图范围受限,易受遮挡影响。
方法:提出神经地图先验(NMP),这是一种神经网络表示的全球地图,能自动更新全局地图并提升局部地图推断性能。通过利用交叉注意力动态捕捉当前特征和先前特征之间的关联性,将强大的地图先验融入局部地图推断中。使用学习型融合模块指导网络融合之前遍历的特征,以进行全局神经地图先验的更新。
效果:在nuScenes数据集上的实验结果表明,该框架与大多数地图分割/检测方法兼容,并在具有挑战性的天气条件和延长的时间范围内提高了地图预测性能。据我们所知,这是第一个用于构建全局地图先验的学习系统。

High-definition (HD) semantic maps are a crucial component for autonomous driving on urban streets. Traditional offline HD maps are created through labor-intensive manual annotation processes, which are costly and do not accommodate timely updates. Recently, researchers have proposed to infer local maps based on online sensor observations. However, the range of online map inference is constrained by sensor perception range and is easily affected by occlusions. In this work, we propose Neural Map Prior (NMP), a neural representation of global maps that enables automatic global map updates and enhances local map inference performance. To incorporate the strong map prior into local map inference, we leverage cross-attention to dynamically capture the correlations between current features and prior features. For updating the global neural map prior, we use a learning-based fusion module to guide the network in fusing features from previous traversals. This design allows the network to capture a global neural map prior while making sequential online map predictions. Experimental results on the nuScenes dataset demonstrate that our framework is compatible with most map segmentation/detection methods, improving map prediction performance in challenging weather conditions and over an extended horizon. To the best of our knowledge, this represents the first learning-based system for constructing a global map prior.

PartManip: Learning Cross-Category Generalizable Part Manipulation Policy From Point Cloud Observations
Geng, HaoranandLi, ZimingandGeng, YiranandChen, JiayiandDong, HaoandWang, He



研究问题:如何让实体代理在复杂的真实世界场景中学习到可泛化的对象操作策略。
动机:部件作为不同对象类别的共享组件,有可能提高操作策略的泛化能力,实现跨类别的对象操作。
方法:构建了首个大规模的基于部件的跨类别对象操作基准PartManip,包含11个对象类别、494个对象和6个任务类别中的1432个任务。通过训练基于状态的专家和使用提出的基于部件的规范化和部件感知奖励,将知识提炼给学生,以解决基于视觉的策略学习难题。同时引入领域对抗学习进行领域不变特征提取,以提高跨类别的泛化能力。
效果:实验表明,我们学习的策略在模拟环境中的表现优于其他方法,尤其是在未见过的类别上。同时,该方法也能在真实世界中成功操作新的对象。

Learning a generalizable object manipulation policy is vital for an embodied agent to work in complex real-world scenes. Parts, as the shared components in different object categories, have the potential to increase the generalization ability of the manipulation policy and achieve cross-category object manipulation. In this work, we build the first large-scale, part-based cross-category object manipulation benchmark, PartManip, which is composed of 11 object categories, 494 objects, and 1432 tasks in 6 task classes. Compared to previous work, our benchmark is also more diverse and realistic, i.e., having more objects and using sparse-view point cloud as input without oracle information like part segmentation. To tackle the difficulties of vision-based policy learning, we first train a state-based expert with our proposed part-based canonicalization and part-aware rewards, and then distill the knowledge to a vision-based student. We also find an expressive backbone is essential to overcome the large diversity of different objects. For cross-category generalization, we introduce domain adversarial learning for domain-invariant feature extraction. Extensive experiments in simulation show that our learned policy can outperform other methods by a large margin, especially on unseen object categories. We also demonstrate our method can successfully manipulate novel objects in the real world.

Towards Unsupervised Object Detection From LiDAR Point Clouds
Zhang, LunjunandYang, AnqiJoyceandXiong, YuwenandCasas, SergioandYang, BinandRen, MengyeandUrtasun, Raquel



研究问题:本文研究了在自动驾驶场景中,如何从3D点云中进行无监督物体检测的问题。
动机:目前的无监督物体检测方法存在一些问题,如需要重复遍历同一地点、无法在稀疏和远距离区域进行零样本检测等。
方法:本文提出了一种名为OYSTER的方法,该方法利用了(i)点云密集区域的点聚类,(ii)时间一致性来过滤噪声的无监督检测,(iii)CNN的平移等变性来将自动标签扩展到远距离,以及(iv)自我监督以提高自身性能。
效果:实验结果表明,OYSTER方法在PandaSet和Argoverse 2 Sensor数据集上显著优于无监督基线,显示出自我监督结合物体先验能够在野外进行物体发现的可能性。

In this paper, we study the problem of unsupervised object detection from 3D point clouds in self-driving scenes. We present a simple yet effective method that exploits (i) point clustering in near-range areas where the point clouds are dense, (ii) temporal consistency to filter out noisy unsupervised detections, (iii) translation equivariance of CNNs to extend the auto-labels to long range, and (iv) self-supervision for improving on its own. Our approach, OYSTER (Object Discovery via Spatio-Temporal Refinement), does not impose constraints on data collection (such as repeated traversals of the same location), is able to detect objects in a zero-shot manner without supervised finetuning (even in sparse, distant regions), and continues to self-improve given more rounds of iterative self-training. To better measure model performance in self-driving scenarios, we propose a new planning-centric perception metric based on distance-to-collision. We demonstrate that our unsupervised object detector significantly outperforms unsupervised baselines on PandaSet and Argoverse 2 Sensor dataset, showing promise that self-supervision combined with object priors can enable object discovery in the wild. For more information, visit the project website: https://waabi.ai/research/oyster.

M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis
Cheng, HiuyiandZhang, PeirongandWu, SihangandZhang, JiaxinandZhu, QiyuanandXie, ZechengandLi, JingandDing, KaiandJin, Lianwen



研究问题:当前公开的文档布局分析数据集大多只包含PDF文档,缺乏真实的文档,这可能导致训练的模型无法很好地泛化到真实世界的场景。
动机:为了解决这个问题,本文提出了一个名为M^6-Doc的大型多样化文档布局分析数据集,并设计了一种基于transformer的文档布局分析方法TransDLANet。
方法:M^6-Doc具有多格式、多类型、多布局、多语言、多注释类别和现代文档等六种特性。TransDLANet利用自适应元素匹配机制优化查询嵌入,提高召回率,并通过构建分割分支进行更精确的文档图像实例分割。
效果:通过在M^6-Doc数据集上与各种布局分析方法进行全面评估,实验结果表明TransDLANet取得了最先进的性能,mAP达到了64.5%。

Document layout analysis is a crucial prerequisite for document understanding, including document retrieval and conversion. Most public datasets currently contain only PDF documents and lack realistic documents. Models trained on these datasets may not generalize well to real-world scenarios. Therefore, this paper introduces a large and diverse document layout analysis dataset called M^6-Doc. The M^6 designation represents six properties: (1) Multi-Format (including scanned, photographed, and PDF documents); (2) Multi-Type (such as scientific articles, textbooks, books, test papers, magazines, newspapers, and notes); (3) Multi-Layout (rectangular, Manhattan, non-Manhattan, and multi-column Manhattan); (4) Multi-Language (Chinese and English); (5) Multi-Annotation Category (74 types of annotation labels with 237,116 annotation instances in 9,080 manually annotated pages); and (6) Modern documents. Additionally, we propose a transformer-based document layout analysis method called TransDLANet, which leverages an adaptive element matching mechanism that enables query embedding to better match ground truth to improve recall, and constructs a segmentation branch for more precise document image instance segmentation. We conduct a comprehensive evaluation of M^6-Doc with various layout analysis methods and demonstrate its effectiveness. TransDLANet achieves state-of-the-art performance on M^6-Doc with 64.5% mAP. The M^6-Doc dataset will be available at https://github.com/HCIILAB/M6Doc.

Object-Goal Visual Navigation via Effective Exploration of Relations Among Historical Navigation States
Du, HemingandLi, LinchengandHuang, ZiandYu, Xin



研究问题:本文旨在解决现有目标导向视觉导航方法中,导航状态的相关性对导航效率和成功率的影响。
动机:现有的目标导向视觉导航方法主要关注学习有信息量的视觉表示,但忽视了导航状态对导航效果和效率的影响。
方法:本文提出了一种历史启发的导航策略学习(HiNL)框架,通过探索历史导航状态之间的关系来有效估计导航状态。在HiNL中,我们设计了一个历史感知状态估计(HaSE)模块,以减轻主导历史状态对当前状态估计的影响,并鼓励代理对当前观察变化保持警觉,从而做出有效的行动。此外,我们还设计了一种基于历史的状态正则化(HbSR),以在训练过程中明确抑制导航状态之间的相关性。
效果:在人工平台AI2-THOR上的实验表明,HiNL在未见过的环境测试中,无论是成功率还是SPL,都显著优于最先进的方法。

Object-goal visual navigation aims at steering an agent toward an object via a series of moving steps. Previous works mainly focus on learning informative visual representations for navigation, but overlook the impacts of navigation states on the effectiveness and efficiency of navigation. We observe that high relevance among navigation states will cause navigation inefficiency or failure for existing methods. In this paper, we present a History-inspired Navigation Policy Learning (HiNL) framework to estimate navigation states effectively by exploring relationships among historical navigation states. In HiNL, we propose a History-aware State Estimation (HaSE) module to alleviate the impacts of dominant historical states on the current state estimation. Meanwhile, HaSE also encourages an agent to be alert to the current observation changes, thus enabling the agent to make valid actions. Furthermore, we design a History-based State Regularization (HbSR) to explicitly suppress the correlation among navigation states in training. As a result, our agent can update states more effectively while reducing the correlations among navigation states. Experiments on the artificial platform AI2-THOR (i.e.,, iTHOR and RoboTHOR) demonstrate that HiNL significantly outperforms state-of-the-art methods on both Success Rate and SPL in unseen testing environments.

Detecting and Grounding Multi-Modal Media Manipulation
Shao, RuiandWu, TianxingandLiu, Ziwei



研究问题:本文提出了一种新的多模态假媒体检测和定位问题,即检测和定位多模态媒体操纵(DGM^4)。
动机:虚假信息已成为一个紧迫的问题,而现有的深度伪造检测和文本假新闻检测方法只能处理基于二元分类的单一模态伪造,无法分析不同模态间的微妙伪造痕迹。
方法:构建了首个DGM^4数据集,并设计了一种新型的分层多模态操纵推理变压器(HAMMER),通过对比学习进行浅层操纵推理,并通过多模态聚合器进行深层操纵推理。
效果:实验结果表明,HAMMER模型在多模态假媒体检测和定位问题上表现出优越性,为未来的多模态媒体操纵研究提供了有价值的观察结果。

Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. While various deepfake detection and text fake news detection methods have been proposed, they are only designed for single-modality forgery based on binary classification, let alone analyzing and reasoning subtle forgery traces across different modalities. In this paper, we highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM^4). DGM^4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content (i.e., image bounding boxes and text tokens), which requires deeper reasoning of multi-modal media manipulation. To support a large-scale investigation, we construct the first DGM^4 dataset, where image-text pairs are manipulated by various approaches, with rich annotation of diverse manipulations. Moreover, we propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities. HAMMER performs 1) manipulation-aware contrastive learning between two uni-modal encoders as shallow manipulation reasoning, and 2) modality-aware cross-attention by multi-modal aggregator as deep manipulation reasoning. Dedicated manipulation detection and grounding heads are integrated from shallow to deep levels based on the interacted multi-modal information. Finally, we build an extensive benchmark and set up rigorous evaluation metrics for this new research problem. Comprehensive experiments demonstrate the superiority of our model; several valuable observations are also revealed to facilitate future research in multi-modal media manipulation.

Boosting Detection in Crowd Analysis via Underutilized Output Features
Wu, ShaokaiandYang, Fengyu



研究问题:尽管基于检测的方法在密集人群中的表现不佳,但我们认为这些方法的潜力被低估了,因为它们提供了常被忽视的人群分析的关键信息。
动机:我们主张,输出建议和边界框的区域大小和置信度分数为人群分析提供了关于人群规模和密度的见解。为了利用这些未充分利用的特征,我们提出了Crowd Hat,这是一个可以轻易集成到现有检测模型中的即插即用模块。
方法:该模块使用混合的2D-1D压缩技术来精炼输出特征并获取人群特定信息的 spatial 和 numerical 分布。基于这些特征,我们进一步提出了区域自适应NMS阈值和一个解耦然后对齐的范例,以解决基于检测的方法的主要限制。
效果:我们在各种人群分析任务上进行了广泛的评估,包括人群计数、定位和检测,结果证明了利用输出特征和使用基于检测的方法在人群分析中的有效性。我们的代码可以在 https://github.com/wskingdom/Crowd-Hat 找到。

Detection-based methods have been viewed unfavorably in crowd analysis due to their poor performance in dense crowds. However, we argue that the potential of these methods has been underestimated, as they offer crucial information for crowd analysis that is often ignored. Specifically, the area size and confidence score of output proposals and bounding boxes provide insight into the scale and density of the crowd. To leverage these underutilized features, we propose Crowd Hat, a plug-and-play module that can be easily integrated with existing detection models. This module uses a mixed 2D-1D compression technique to refine the output features and obtain the spatial and numerical distribution of crowd-specific information. Based on these features, we further propose region-adaptive NMS thresholds and a decouple-then-align paradigm that address the major limitations of detection-based methods. Our extensive evaluations on various crowd analysis tasks, including crowd counting, localization, and detection, demonstrate the effectiveness of utilizing output features and the potential of detection-based methods in crowd analysis. Our code is available at https://github.com/wskingdom/Crowd-Hat.

MixSim: A Hierarchical Framework for Mixed Reality Traffic Simulation
Suo, SimonandWong, KelvinandXu, JustinandTu, JamesandCui, AlexanderandCasas, SergioandUrtasun, Raquel



研究问题:如何安全地将自动驾驶车辆部署到现实世界中?
动机:目前的自动驾驶车辆测试主要在模拟环境中进行,但为了确保其在现实世界中的安全运行,需要对其进行闭环测试。
方法:提出了一种混合现实交通模拟框架MixSim,通过学习反应性的路线条件策略,使模拟环境能够对真实世界中的情况进行反应和控制。
效果:实验证明,MixSim可以作为真实世界情况的、反应性的、可控的数字双胞胎,为自动驾驶车辆的闭环测试提供了可能。

The prevailing way to test a self-driving vehicle (SDV) in simulation involves non-reactive open-loop replay of real world scenarios. However, in order to safely deploy SDVs to the real world, we need to evaluate them in closed-loop. Towards this goal, we propose to leverage the wealth of interesting scenarios captured in the real world and make them reactive and controllable to enable closed-loop SDV evaluation in what-if situations. In particular, we present MixSim, a hierarchical framework for mixed reality traffic simulation. MixSim explicitly models agent goals as routes along the road network and learns a reactive route-conditional policy. By inferring each agent's route from the original scenario, MixSim can reactively re-simulate the scenario and enable testing different autonomy systems under the same conditions. Furthermore, by varying each agent's route, we can expand the scope of testing to what-if situations with realistic variations in agent behaviors or even safety-critical interactions. Our experiments show that MixSim can serve as a realistic, reactive, and controllable digital twin of real world scenarios. For more information, please visit the project website: https://waabi.ai/research/mixsim/

The ObjectFolder Benchmark: Multisensory Learning With Neural and Real Objects
Gao, RuohanandDou, YimingandLi, HaoandAgarwal, TanmayandBohg, JeannetteandLi, YunzhuandFei-Fei, LiandWu, Jiajun



研究问题:本文旨在开发一个以物体为中心的多感官学习基准测试套件,并创建包含100个真实世界家用物品的多感官测量数据的ObjectFolder Real数据集。
动机:现有的研究主要关注单一感官的学习,而现实生活中的物体识别、重建和操作需要结合视觉、听觉和触觉等多种感官信息。
方法:作者设计了一个新的数据收集管道,用于收集现实世界物体的3D网格、视频、冲击声音和触觉读数,创建了包含100个真实世界家用物品的ObjectFolder Real数据集。同时,作者还开发了一个包含10个任务的多感官物体中心学习基准测试套件ObjectFolder Benchmark。
效果:通过在ObjectFolder的1000个多感官神经对象和ObjectFolder Real的真实多感官数据上进行系统基准测试,结果证明了多感官知觉的重要性,揭示了视觉、音频和触觉在不同物体中心学习任务中各自的作用。

We introduce the ObjectFolder Benchmark, a benchmark suite of 10 tasks for multisensory object-centric learning, centered around object recognition, reconstruction, and manipulation with sight, sound, and touch. We also introduce the ObjectFolder Real dataset, including the multisensory measurements for 100 real-world household objects, building upon a newly designed pipeline for collecting the 3D meshes, videos, impact sounds, and tactile readings of real-world objects. For each task in the ObjectFolder Benchmark, we conduct systematic benchmarking on both the 1,000 multisensory neural objects from ObjectFolder, and the real multisensory data from ObjectFolder Real. Our results demonstrate the importance of multisensory perception and reveal the respective roles of vision, audio, and touch for different object-centric learning tasks. By publicly releasing our dataset and benchmark suite, we hope to catalyze and enable new research in multisensory object-centric learning in computer vision, robotics, and beyond. Project page: https://objectfolder.stanford.edu

NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis
Zhou, AllanandKim, MooJinandWang, LiruiandFlorence, PeteandFinn, Chelsea



研究问题:如何通过专家演示有效地训练视觉机器人操作策略,同时减少对大量演示或昂贵在线专家监督的依赖。
动机:目前的模仿学习方法通常需要大量的演示或昂贵的在线专家监督来学习反应性的闭环行为。
方法:提出了一种名为SPARTN(通过NeRF增强机器人轨迹的合成扰动)的全离线数据增强方案,用于改进使用手持摄像头的机器人策略。该方法利用神经辐射场(NeRFs)在视觉演示中合成注入纠正性噪声:使用NeRFs生成扰动的视角,同时计算纠正性动作。
效果:在模拟的6自由度视觉抓取基准测试中,SPARTN比没有纠正性增强的模仿学习方法提高了2.8倍的离线成功率,甚至超过了一些使用在线监督的方法。此外,它缩小了RGB-only和RGB-D成功率之间的差距,消除了以前对深度传感器的需求。在实际的6自由度机器人抓取实验中,该方法平均提高了22.5%的绝对成功率,包括那些传统上对深度基于方法具有挑战性的对象。

Expert demonstrations are a rich source of supervision for training visual robotic manipulation policies, but imitation learning methods often require either a large number of demonstrations or expensive online expert supervision to learn reactive closed-loop behaviors. In this work, we introduce SPARTN (Synthetic Perturbations for Augmenting Robot Trajectories via NeRF): a fully-offline data augmentation scheme for improving robot policies that use eye-in-hand cameras. Our approach leverages neural radiance fields (NeRFs) to synthetically inject corrective noise into visual demonstrations: using NeRFs to generate perturbed viewpoints while simultaneously calculating the corrective actions. This requires no additional expert supervision or environment interaction, and distills the geometric information in NeRFs into a real-time reactive RGB-only policy. In a simulated 6-DoF visual grasping benchmark, SPARTN improves offline success rates by 2.8x over imitation learning without the corrective augmentations and even outperforms some methods that use online supervision. It additionally closes the gap between RGB-only and RGB-D success rates, eliminating the previous need for depth sensors. In real-world 6-DoF robotic grasping experiments from limited human demonstrations, our method improves absolute success rates by 22.5% on average, including objects that are traditionally challenging for depth-based methods.

Multi-Granularity Archaeological Dating of Chinese Bronze Dings Based on a Knowledge-Guided Relation Graph
Zhou, RixinandWei, JiafuandZhang, QianandQi, RuihuaandYang, XiandLi, Chuntao



研究问题:如何利用深度学习技术进行青铜鼎的考古年代鉴定。
动机:目前的青铜鼎考古年代鉴定依赖于训练有素的专家,耗时耗力。
方法:收集大规模的青铜鼎图像数据集,引入多头分类器和知识引导的关系图挖掘属性与鼎的时代之间的关系。
效果:实验结果表明,该方法在青铜鼎考古年代鉴定上达到了最先进的性能。

The archaeological dating of bronze dings has played a critical role in the study of ancient Chinese history. Current archaeology depends on trained experts to carry out bronze dating, which is time-consuming and labor-intensive. For such dating, in this study, we propose a learning-based approach to integrate advanced deep learning techniques and archaeological knowledge. To achieve this, we first collect a large-scale image dataset of bronze dings, which contains richer attribute information than other existing fine-grained datasets. Second, we introduce a multihead classifier and a knowledge-guided relation graph to mine the relationship between attributes and the ding era. Third, we conduct comparison experiments with various existing methods, the results of which show that our dating method achieves a state-of-the-art performance. We hope that our data and applied networks will enrich fine-grained classification research relevant to other interdisciplinary areas of expertise. The dataset and source code used are included in our supplementary materials, and will be open after submission owing to the anonymity policy. Source codes and data are available at: https://github.com/zhourixin/bronze-Ding.

What Happened 3 Seconds Ago? Inferring the Past With Thermal Imaging
Tang, ZitianandYe, WenjieandMa, Wei-ChiuandZhao, Hang



研究问题:如何从RGB图像中推断过去的人体运动?
动机:由于预测问题的固有不确定性,从RGB图像中推断过去的人体运动具有挑战性。而热成像则通过测量热辐射,记录了环境中过去人与物体交互的痕迹。
方法:我们收集了首个用于人体运动分析的RGB-Thermal数据集,命名为Thermal-IM。然后,我们开发了一个三阶段的神经网络模型,用于精确估计过去的人体姿态。
效果:实验表明,热线索显著降低了此任务的模糊性,所提出的模型取得了显著的性能。该数据集可在https://github.com/ZitianTang/Thermal-IM获取。

Inferring past human motion from RGB images is challenging due to the inherent uncertainty of the prediction problem. Thermal images, on the other hand, encode traces of past human-object interactions left in the environment via thermal radiation measurement. Based on this observation, we collect the first RGB-Thermal dataset for human motion analysis, dubbed Thermal-IM. Then we develop a three-stage neural network model for accurate past human pose estimation. Comprehensive experiments show that thermal cues significantly reduce the ambiguities of this task, and the proposed model achieves remarkable performance. The dataset is available at https://github.com/ZitianTang/Thermal-IM.

MIME: Human-Aware 3D Scene Generation
Yi, HongweiandHuang, Chun-HaoP.andTripathi, ShashankandHering, LeaandThies, JustusandBlack, MichaelJ.



研究问题:如何有效地生成3D室内场景,考虑到人类运动和互动?
动机:现有的3D场景生成方法成本高且劳动密集,而通过人类运动和互动可以更有效地生成3D室内场景。
方法:提出MIME模型,利用自回归转换器架构,根据已生成的场景物体和人类运动输入,输出下一个可能的物体。
效果:实验表明,MIME生成的3D场景比不考虑人类运动的现有生成场景方法更具多样性和可信度。

Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or from IMU sensors worn on the body, effectively turning human movement in a "scanner" of the 3D world. Intuitively, human movement indicates the free-space in a room and human contact indicates surfaces or objects that support activities such as sitting, lying or touching. We propose MIME (Mining Interaction and Movement to infer 3D Environments), which is a generative model of indoor scenes that produces furniture layouts that are consistent with the human movement. MIME uses an auto-regressive transformer architecture that takes the already generated objects in the scene as well as the human motion as input, and outputs the next plausible object. To train MIME, we build a dataset by populating the 3D FRONT scene dataset with 3D humans. Our experiments show that MIME produces more diverse and plausible 3D scenes than a recent generative scene method that does not know about human movement. Code and data will be available for research.

A New Path: Scaling Vision-and-Language Navigation With Synthetic Instructions and Imitation Learning
Kamath, AishwaryaandAnderson, PeterandWang, SuandKoh, JingYuandKu, AlexanderandWaters, AustinandYang, YinfeiandBaldridge, JasonandParekh, Zarana



研究问题:现有的视觉-语言导航(VLN)研究在处理复杂语言基础和空间语言理解上存在困难。
动机:由于人类指令数据的稀缺性和训练环境的有限多样性,现有的VLN研究无法很好地执行自然语言导航指令。
方法:通过大规模的网络文本和图像-文本数据集进行预训练,并使用高质量的多语言导航指令生成器Marky生成视觉基础的指令。同时,利用图像到图像的GAN从新的视角合成图像观察。
效果:通过这种方法,我们创建了一个比现有人工注释数据集大两个数量级的数据集,包含更广泛的环境和视角。我们的简单转换器代理在具有挑战性的RxR数据集上的表现超过了所有现有的RL代理,将最先进的NDTW从71.1提高到79.1(在可见环境中),从64.6提高到66.8(在未见过的环境测试中)。

Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pre-training on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale training on near-human quality synthetic instructions.

Towards Building Self-Aware Object Detectors via Reliable Uncertainty Quantification and Calibration
Oksuz, KemalandJoy, TomandDokania, PuneetK.



研究问题:现有的物体检测器鲁棒性测试方法存在缺陷,如执行分布外检测的方法不当和使用不考虑定位和分类质量的校准度量标准。
动机:为了解决这些问题,我们提出了自我感知物体检测(SAOD)任务,这是一个统一的测试框架,尊重并符合物体检测器在自动驾驶等安全关键环境中面临的挑战。
方法:SAOD任务要求物体检测器能够:对领域偏移具有鲁棒性;为整个场景获得可靠的不确定性估计;并为检测结果提供校准的信心分数。我们广泛使用我们的框架,引入新的度量标准和大规模的测试数据集,来测试许多物体检测器在两种不同的用例中,以突出其鲁棒性能的关键见解。
效果:最后,我们为SAOD任务引入了一个简单的基线,使研究人员能够为未来提出的方法进行基准测试,并朝着适合目的的鲁棒物体检测器迈进。代码可在以下网址获取:https://github.com/fiveai/saod

The current approach for testing the robustness of object detectors suffers from serious deficiencies such as improper methods of performing out-of-distribution detection and using calibration metrics which do not consider both localisation and classification quality. In this work, we address these issues, and introduce the Self Aware Object Detection (SAOD) task, a unified testing framework which respects and adheres to the challenges that object detectors face in safety-critical environments such as autonomous driving. Specifically, the SAOD task requires an object detector to be: robust to domain shift; obtain reliable uncertainty estimates for the entire scene; and provide calibrated confidence scores for the detections. We extensively use our framework, which introduces novel metrics and large scale test datasets, to test numerous object detectors in two different use-cases, allowing us to highlight critical insights into their robustness performance. Finally, we introduce a simple baseline for the SAOD task, enabling researchers to benchmark future proposed methods and move towards robust object detectors which are fit for purpose. Code is available at: https://github.com/fiveai/saod

CIRCLE: Capture in Rich Contextual Environments
Ara\'ujo, Jo\~aoPedroandLi, JiamanandVetrivel, KarthikandAgarwal, RishiandWu, JiajunandGopinath, DeepakandClegg, AlexanderWilliamandLiu, Karen



研究问题:如何有效地在具有丰富上下文的生态环境中合成3D人体运动,以模拟人们在现实世界中执行的真实活动。
动机:传统的基于光学的运动捕捉系统无法同时捕捉人类运动和复杂场景,且缺乏丰富的上下文3D人体运动数据集,这对创建高质量的生成性人体运动模型构成了障碍。
方法:我们提出了一种新的运动捕捉系统,演员在这个高度情境化的虚拟世界中感知和操作,同时在真实世界中进行动作捕捉。我们的系统能够在高度多样化的场景中快速收集高质量的人体运动,无需担心遮挡或在真实世界中需要物理场景构建的问题。
效果:我们提出了CIRCLE数据集,包含5个主题在9个场景中的10小时全身伸手运动,以及以各种形式(如RGBD视频)表示的环境自我中心信息。我们使用此数据集训练了一个根据场景信息生成人体运动模型。利用我们的数据集,该模型学习使用自我中心场景信息在复杂3D场景中完成非平凡的伸手任务。

Synthesizing 3D human motion in a contextual, ecological environment is important for simulating realistic activities people perform in the real world. However, conventional optics-based motion capture systems are not suited for simultaneously capturing human movements and complex scenes. The lack of rich contextual 3D human motion datasets presents a roadblock to creating high-quality generative human motion models. We propose a novel motion acquisition system in which the actor perceives and operates in a highly contextual virtual world while being motion captured in the real world. Our system enables rapid collection of high-quality human motion in highly diverse scenes, without the concern of occlusion or the need for physical scene construction in the real world. We present CIRCLE, a dataset containing 10 hours of full-body reaching motion from 5 subjects across nine scenes, paired with ego-centric information of the environment represented in various forms, such as RGBD videos. We use this dataset to train a model that generates human motion conditioned on scene information. Leveraging our dataset, the model learns to use ego-centric scene information to achieve nontrivial reaching tasks in the context of complex 3D scenes. To download the data please visit our website (https://stanford-tml.github.io/circle_dataset/).

PyPose: A Library for Robot Learning With Physics-Based Optimization
Wang, ChenandGao, DasongandXu, KuanandGeng, JunyiandHu, YaoyuandQiu, YuhengandLi, BowenandYang, FanandMoon, BradyandPandey, AbhinavandAryanandXu, JiaheandWu, TianhaoandHe, HaonanandHuang, DaningandRen, ZhongqiangandZhao, ShiboandFu, TaimengandReddy, PranayandLin, XiaoandWang, WenshanandShi, JingnanandTalak, RajatandCao, KunandDu, YiandWang, HanandYu, HuaiandWang, ShanzhaoandChen, SiyuandKashyap, AnanthandBandaru, RohanandDantu, KarthikandWu, JiajunandXie, LihuaandCarlone, LucaandHutter, MarcoandScherer, Sebastian



研究问题:如何结合深度学习和物理优化,以适应不断变化的环境并处理复杂任务?
动机:深度学习在机器人感知方面取得了显著的成功,但在应对不断变化的环境中表现不佳;而物理优化虽然泛化能力更强,但在复杂任务中表现不佳,且需要手动调整参数。
方法:提出了PyPose,一个面向机器人的基于PyTorch的库,将深度感知模型与物理优化相结合。
效果:实验表明,PyPose比现有最先进的库快10倍以上,为未来研究提供了具体示例,包括SLAM、规划、控制和惯性导航等领域。

Deep learning has had remarkable success in robotic perception, but its data-centric nature suffers when it comes to generalizing to ever-changing environments. By contrast, physics-based optimization generalizes better, but it does not perform as well in complicated tasks due to the lack of high-level semantic information and reliance on manual parametric tuning. To take advantage of these two complementary worlds, we present PyPose: a robotics-oriented, PyTorch-based library that combines deep perceptual models with physics-based optimization. PyPose's architecture is tidy and well-organized, it has an imperative style interface and is efficient and user-friendly, making it easy to integrate into real-world robotic applications. Besides, it supports parallel computing of any order gradients of Lie groups and Lie algebras and 2nd-order optimizers, such as trust region methods. Experiments show that PyPose achieves more than 10x speedup in computation compared to the state-of-the-art libraries. To boost future research, we provide concrete examples for several fields of robot learning, including SLAM, planning, control, and inertial navigation.

Multi-Sensor Large-Scale Dataset for Multi-View 3D Reconstruction
Voynov, OlegandBobrovskikh, GlebandKarpyshev, PavelandGalochkin, SaveliyandArdelean, Andrei-TimoteiandBozhenko, ArseniyandKarmanova, EkaterinaandKopanev, PavelandLabutin-Rymsho, YaroslavandRakhimov, RuslanandSafin, AleksandrandSerpiva, ValeriiandArtemov, AlexeyandBurnaev, EvgenyandTsetserukou, DzmitryandZorin, Denis



研究问题:开发一种新的多传感器数据集,用于多视角3D表面重建。
动机:现有的算法在处理具有挑战性的材料属性时表现不佳,需要一个新的、多样化的数据集进行评估和训练。
方法:收集了来自不同分辨率和模态的传感器(如智能手机、Intel RealSense、Microsoft Kinect、工业相机和结构光扫描仪)的已注册RGB和深度数据,并从100个视角在14种光照条件下获取了约140万张图像。
效果:创建了一个包含107个场景、1.4百万张图像的数据集,可用于评估和训练3D重建算法及相关任务。

We present a new multi-sensor dataset for multi-view 3D surface reconstruction. It includes registered RGB and depth data from sensors of different resolutions and modalities: smartphones, Intel RealSense, Microsoft Kinect, industrial cameras, and structured-light scanner. The scenes are selected to emphasize a diverse set of material properties challenging for existing algorithms. We provide around 1.4 million images of 107 different scenes acquired from 100 viewing directions under 14 lighting conditions. We expect our dataset will be useful for evaluation and training of 3D reconstruction algorithms and for related tasks. The dataset is available at skoltech3d.appliedai.tech.

Privacy-Preserving Representations Are Not Enough: Recovering Scene Content From Camera Poses
Chelani, KunalandSattler, TorstenandKahl, FredrikandKukelova, Zuzana



研究问题:如何保护视觉定位过程中的隐私,防止攻击者通过查询本地化服务获取场景细节。
动机:随着AR/VR/MR设备和基于云的应用的普及,隐私问题在定位过程中变得越来越重要。
方法:本文提出了一种攻击方法,攻击者可以通过查询本地化服务来学习场景的细节,而无需任何访问权限。这种攻击基于现代视觉定位算法对外观和几何变化的鲁棒性。
效果:本文开发了一种概念验证版本的攻击,并证明了其实际可行性。该攻击不要求使用特定的本地化算法,因此也适用于隐私保护表示。

Visual localization is the task of estimating the camera pose from which a given image was taken and is central to several 3D computer vision applications. With the rapid growth in the popularity of AR/VR/MR devices and cloud-based applications, privacy issues are becoming a very important aspect of the localization process. Existing work on privacy-preserving localization aims to defend against an attacker who has access to a cloud-based service. In this paper, we show that an attacker can learn about details of a scene without any access by simply querying a localization service. The attack is based on the observation that modern visual localization algorithms are robust to variations in appearance and geometry. While this is in general a desired property, it also leads to algorithms localizing objects that are similar enough to those present in a scene. An attacker can thus query a server with a large enough set of images of objects, e.g., obtained from the Internet, and some of them will be localized. The attacker can thus learn about object placements from the camera poses returned by the service (which is the minimal information returned by such a service). In this paper, we develop a proof-of-concept version of this attack and demonstrate its practical feasibility. The attack does not place any requirements on the localization algorithm used, and thus also applies to privacy-preserving representations. Current work on privacy-preserving representations alone is thus insufficient.

A New Dataset Based on Images Taken by Blind People for Testing the Robustness of Image Classification Models Trained for ImageNet Categories
Bafghi, RezaAkbarianandGurari, Danna



研究问题:如何提高在一个领域训练的图像分类模型在另一个领域图像上的性能。
动机:现有的图像分类模型在跨领域应用时性能下降,且缺乏针对这一问题的公开数据集。
方法:构建了一个新数据集VizWiz-Classification,包含8900张由盲人拍摄的图片,每张图片都标注了200个ImageNet物体类别的存在与否。
效果:通过分析100个ImageNet图像分类模型在该数据集上的表现,发现这些模型在质量有问题的图像上表现不佳。

Our goal is to improve upon the status quo for designing image classification models trained in one domain that perform well on images from another domain. Complementing existing work in robustness testing, we introduce the first dataset for this purpose which comes from an authentic use case where photographers wanted to learn about the content in their images. We built a new test set using 8,900 images taken by people who are blind for which we collected metadata to indicate the presence versus absence of 200 ImageNet object categories. We call this dataset VizWiz-Classification. We characterize this dataset and how it compares to the mainstream datasets for evaluating how well ImageNet-trained classification models generalize. Finally, we analyze the performance of 100 ImageNet classification models on our new test dataset. Our fine-grained analysis demonstrates that these models struggle on images with quality issues. To enable future extensions to this work, we share our new dataset with evaluation server at: https://vizwiz.org/tasks-and-datasets/image-classification

Renderable Neural Radiance Map for Visual Navigation
Kwon, ObinandPark, JeonghoandOh, Songhwai



研究问题:如何设计一种可以包含3D环境全局视觉信息的新型地图,用于视觉导航?
动机:现有的地图在视觉导航中存在信息不全的问题,需要一种新的地图来更好地描述和指导视觉定位和导航。
方法:提出一种新型的可渲染神经辐射度图(RNR-Map),以网格形式包含每个像素的隐藏码,这些隐藏码从图像观察中嵌入,并可以转换为给定摄像头位姿的神经辐射度场,从而实现图像渲染。记录的隐藏码隐含地包含有关环境的视觉信息,使RNR-Map具有视觉描述性。开发了能有效利用RNR-Map的定位和导航框架。
效果:实验结果表明,基于RNR-Map的定位框架可以在单一查询图像的基础上快速准确地找到目标位置,且对环境变化具有鲁棒性。提出的导航框架在困难的场景下优于现有的图像目标导航方法,在NRNS数据集的弯曲场景中表现出65.7%的成功率,比当前最先进的方法提高了18.6%。

We propose a novel type of map for visual navigation, a renderable neural radiance map (RNR-Map), which is designed to contain the overall visual information of a 3D environment. The RNR-Map has a grid form and consists of latent codes at each pixel. These latent codes are embedded from image observations, and can be converted to the neural radiance field which enables image rendering given a camera pose. The recorded latent codes implicitly contain visual information about the environment, which makes the RNR-Map visually descriptive. This visual information in RNR-Map can be a useful guideline for visual localization and navigation. We develop localization and navigation frameworks that can effectively utilize the RNR-Map. We evaluate the proposed frameworks on camera tracking, visual localization, and image-goal navigation. Experimental results show that the RNR-Map-based localization framework can find the target location based on a single query image with fast speed and competitive accuracy compared to other baselines. Also, this localization framework is robust to environmental changes, and even finds the most visually similar places when a query image from a different environment is given. The proposed navigation framework outperforms the existing image-goal navigation methods in difficult scenarios, under odometry and actuation noises. The navigation framework shows 65.7% success rate in curved scenarios of the NRNS dataset, which is an improvement of 18.6% over the current state-of-the-art. Project page: https://rllab-snu.github.io/projects/RNR-Map/

Diffusion-Based Generation, Optimization, and Planning in 3D Scenes
Huang, SiyuanandWang, ZanandLi, PuhaoandJia, BaoxiongandLiu, TengyuandZhu, YixinandLiang, WeiandZhu, Song-Chun



研究问题:本文旨在提出一种用于3D场景理解的条件生成模型SceneDiffuser。
动机:现有的3D场景理解模型存在模块间差异大、后验塌陷等问题,需要一种统一的场景感知、基于物理和目标导向的模型。
方法:SceneDiffuser采用迭代采样策略,通过全可微的扩散去噪过程,联合进行场景感知生成、基于物理优化和目标导向规划。
效果:在人体姿态和运动生成、灵巧抓取生成、3D导航路径规划、机器人手臂运动规划等任务上,SceneDiffuser相比现有模型有显著改进,显示出巨大的应用潜力。

We introduce SceneDiffuser, a conditional generative model for 3D scene understanding. SceneDiffuser provides a unified model for solving scene-conditioned generation, optimization, and planning. In contrast to prior works, SceneDiffuser is intrinsically scene-aware, physics-based, and goal-oriented. With an iterative sampling strategy, SceneDiffuser jointly formulates the scene-aware generation, physics-based optimization, and goal-oriented planning via a diffusion-based denoising process in a fully differentiable fashion. Such a design alleviates the discrepancies among different modules and the posterior collapse of previous scene-conditioned generative models. We evaluate SceneDiffuser with various 3D scene understanding tasks, including human pose and motion generation, dexterous grasp generation, path planning for 3D navigation, and motion planning for robot arms. The results show significant improvements compared with previous models, demonstrating the tremendous potential of SceneDiffuser for the broad community of 3D scene understanding.

Planning-Oriented Autonomous Driving
Hu, YihanandYang, JiazhiandChen, LiandLi, KeyuandSima, ChonghaoandZhu, XizhouandChai, SiqiandDu, SenyaoandLin, TianweiandWang, WenhaiandLu, LeweiandJia, XiaosongandLiu, QiangandDai, JifengandQiao, YuandLi, Hongyang



研究问题:如何设计一个优化的框架,以实现自动驾驶汽车的规划。
动机:现有的自动驾驶系统要么采用独立模型处理单个任务,要么设计多任务模式,但这些方法可能会累积错误或缺乏任务协调。
方法:我们提出了统一自主驾驶(UniAD)框架,将感知、预测和规划等全栈驾驶任务整合到一个网络中,通过统一的查询接口进行任务间的通信和协作。
效果:我们在具有挑战性的nuScenes基准上实例化了UniAD,并通过广泛的消融实验证明,这种哲学的有效性在各方面都大大超过了先前最先进的技术。

Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction, and planning. In order to perform a wide diversity of tasks and achieve advanced-level intelligence, contemporary approaches either deploy standalone models for individual tasks, or design a multi-task paradigm with separate heads. However, they might suffer from accumulative errors or deficient task coordination. Instead, we argue that a favorable framework should be devised and optimized in pursuit of the ultimate goal, i.e., planning of the self-driving car. Oriented at this, we revisit the key components within perception and prediction, and prioritize the tasks such that all these tasks contribute to planning. We introduce Unified Autonomous Driving (UniAD), a comprehensive framework up-to-date that incorporates full-stack driving tasks in one network. It is exquisitely devised to leverage advantages of each module, and provide complementary feature abstractions for agent interaction from a global perspective. Tasks are communicated with unified query interfaces to facilitate each other toward planning. We instantiate UniAD on the challenging nuScenes benchmark. With extensive ablations, the effectiveness of using such a philosophy is proven by substantially outperforming previous state-of-the-arts in all aspects. Code and models are public.

LEGO-Net: Learning Regular Rearrangements of Objects in Rooms
Wei, QiuhongAnnaandDing, SijieandPark, JeongJoonandSajnani, RahulandPoulenard, AdrienandSridhar, SrinathandGuibas, Leonidas



研究问题:如何让机器理解并帮助人类整理杂乱的房间,使其符合人类的空间排列规则和审美标准。
动机:由于人类普遍不喜欢清理杂乱的房间,如果机器能帮忙做这件事,就必须理解人类的空间排列规则和审美标准。
方法:本文提出了一种基于数据驱动的迭代方法LEGO-Net,该方法借鉴了扩散模型的思想,通过迭代消除物体位置和方向的噪声,将杂乱的房间逐步整理成符合人类审美规则的状态。
效果:实验结果表明,LEGO-Net能够可靠地重新排列房间场景,并在评估房间布局规则的指标上优于其他方法。

Humans universally dislike the task of cleaning up a messy room. If machines were to help us with this task, they must understand human criteria for regular arrangements, such as several types of symmetry, co-linearity or co-circularity, spacing uniformity in linear or circular patterns, and further inter-object relationships that relate to style and functionality. Previous approaches for this task relied on human input to explicitly specify goal state, or synthesized scenes from scratch--but such methods do not address the rearrangement of existing messy scenes without providing a goal state. In this paper, we present LEGO-Net, a data-driven transformer-based iterative method for LEarning reGular rearrangement of Objects in messy rooms. LEGO-Net is partly inspired by diffusion models--it starts with an initial messy state and iteratively "de-noises" the position and orientation of objects to a regular state while reducing distance traveled. Given randomly perturbed object positions and orientations in an existing dataset of professionally-arranged scenes, our method is trained to recover a regular re-arrangement. Results demonstrate that our method is able to reliably rearrange room scenes and outperform other methods. We additionally propose a metric for evaluating regularity in room arrangements using number-theoretic machinery.

Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking
Cao, JinkunandPang, JiangmiaoandWeng, XinshuoandKhirodkar, RawalandKitani, Kris



研究问题:多目标跟踪(MOT)中,卡尔曼滤波器(KF)的线性运动假设在长时间遮挡下可能导致高度不准确的估计。
动机:当没有测量来更新卡尔曼滤波器参数时,通常依赖先验状态估计进行后验更新,这会导致遮挡期间误差的累积。
方法:我们提出一种基于观察的SORT方法(OC-SORT),使用物体观测(即物体检测器的测量)来计算遮挡期间的虚拟轨迹,以修复滤波器参数的误差累积。
效果:OC-SORT方法简单、在线、实时,并在遮挡和非线性运动方面提高了鲁棒性。在多个数据集上,包括MOT17、MOT20、KITTI、头跟踪和舞蹈跟踪(其中物体运动高度非线性),OC-SORT实现了最先进的性能。

Kalman filter (KF) based methods for multi-object tracking (MOT) make an assumption that objects move linearly. While this assumption is acceptable for very short periods of occlusion, linear estimates of motion for prolonged time can be highly inaccurate. Moreover, when there is no measurement available to update Kalman filter parameters, the standard convention is to trust the priori state estimations for posteriori update. This leads to the accumulation of errors during a period of occlusion. The error causes significant motion direction variance in practice. In this work, we show that a basic Kalman filter can still obtain state-of-the-art tracking performance if proper care is taken to fix the noise accumulated during occlusion. Instead of relying only on the linear state estimate (i.e., estimation-centric approach), we use object observations (i.e., the measurements by object detector) to compute a virtual trajectory over the occlusion period to fix the error accumulation of filter parameters. This allows more time steps to correct errors accumulated during occlusion. We name our method Observation-Centric SORT (OC-SORT). It remains Simple, Online, and Real-Time but improves robustness during occlusion and non-linear motion. Given off-the-shelf detections as input, OC-SORT runs at 700+ FPS on a single CPU. It achieves state-of-the-art on multiple datasets, including MOT17, MOT20, KITTI, head tracking, and especially DanceTrack where the object motion is highly non-linear. The code and models are available at https://github.com/noahcao/OC_SORT.

UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy
Xu, YinzhenandWan, WeikangandZhang, JialiangandLiu, HaoranandShan, ZikangandShen, HaoandWang, RuichengandGeng, HaoranandWeng, YijiaandChen, JiayiandLiu, TengyuandYi, LiandWang, He



研究问题:本文旨在解决在桌面环境下,从点云观察中学习通用机器人灵巧抓取的问题。
动机:为了实现对各种类别的物体进行高质量、多样化的抓取和提起,并推广到数百个类别甚至未见过的对象上。
方法:受到并行夹爪抓取成功管道的启发,将任务分为两个阶段:1)抓取姿态生成;2)目标条件抓取执行。对于第一阶段,提出了一种新颖的概率模型,该模型根据点云观察来生成抓取姿态,并将旋转从平移和关节运动中分离出来。在大规模灵巧抓取数据集上训练后,该模型能够为物体点云生成多样化、高质量的灵巧抓取姿态。第二阶段,由于灵巧抓取执行的复杂性,提出用目标条件抓取策略取代并行夹爪抓取中使用的运动规划。
效果:通过整合两个阶段,我们的最终流程首次实现了通用的灵巧抓取泛化,在数千个物体实例上的平均成功率超过60%,显著优于所有基线,同时展示出最小的泛化差距。

In this work, we tackle the problem of learning universal robotic dexterous grasping from a point cloud observation under a table-top setting. The goal is to grasp and lift up objects in high-quality and diverse ways and generalize across hundreds of categories and even the unseen. Inspired by successful pipelines used in parallel gripper grasping, we split the task into two stages: 1) grasp proposal (pose) generation and 2) goal-conditioned grasp execution. For the first stage, we propose a novel probabilistic model of grasp pose conditioned on the point cloud observation that factorizes rotation from translation and articulation. Trained on our synthesized large-scale dexterous grasp dataset, this model enables us to sample diverse and high-quality dexterous grasp poses for the object point cloud. For the second stage, we propose to replace the motion planning used in parallel gripper grasping with a goal-conditioned grasp policy, due to the complexity involved in dexterous grasping execution. Note that it is very challenging to learn this highly generalizable grasp policy that only takes realistic inputs without oracle states. We thus propose several important innovations, including state canonicalization, object curriculum, and teacher-student distillation. Integrating the two stages, our final pipeline becomes the first to achieve universal generalization for dexterous grasping, demonstrating an average success rate of more than 60% on thousands of object instances, which significantly outperforms all baselines, meanwhile showing only a minimal generalization gap.

Crowd3D: Towards Hundreds of People Reconstruction From a Single Image
Wen, HaoandHuang, JingandCui, HuiliandLin, HaozheandLai, Yu-KunandFang, LuandLi, Kun



研究问题:如何从大规模场景图像中重建数百人的3D姿态、形状和位置,以实现人群分析和安全警报。
动机:现有的方法无法处理包含数百人的大型场景,面临人数众多、人体比例变化大和空间分布复杂等挑战。
方法:提出Crowd3D,首个能从单张大型场景图像中全局一致地重建数百人3D姿态、形状和位置的框架。通过新定义的“人-场景虚拟交互点”(HVIP)将复杂的人群定位问题转化为像素定位问题,并设计了基于HVIP的渐进式重建网络以及适应不同人体尺寸的人本主义裁剪方案。
效果:实验结果表明,该方法有效。建立了大型场景下的人群重建基准数据集LargeCrowd,代码和数据集可在http://cic.tju.edu.cn/faculty/likun/projects/Crowd3D获取。

Image-based multi-person reconstruction in wide-field large scenes is critical for crowd analysis and security alert. However, existing methods cannot deal with large scenes containing hundreds of people, which encounter the challenges of large number of people, large variations in human scale, and complex spatial distribution. In this paper, we propose Crowd3D, the first framework to reconstruct the 3D poses, shapes and locations of hundreds of people with global consistency from a single large-scene image. The core of our approach is to convert the problem of complex crowd localization into pixel localization with the help of our newly defined concept, Human-scene Virtual Interaction Point (HVIP). To reconstruct the crowd with global consistency, we propose a progressive reconstruction network based on HVIP by pre-estimating a scene-level camera and a ground plane. To deal with a large number of persons and various human sizes, we also design an adaptive human-centric cropping scheme. Besides, we contribute a benchmark dataset, LargeCrowd, for crowd reconstruction in a large scene. Experimental results demonstrate the effectiveness of the proposed method. The code and the dataset are available at http://cic.tju.edu.cn/faculty/likun/projects/Crowd3D.

Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction
Cai, ShaofeiandWang, ZihaoandMa, XiaojianandLiu, AnjiandLiang, Yitao



研究问题:在Minecraft中学习目标条件策略的问题。
动机:由于场景的多样性和部分可观察性导致的环境动态非稳定性,使得在Minecraft中学习目标条件策略存在两个主要挑战。
方法:提出了Goal-Sensitive Backbone(GSB)来鼓励目标相关视觉状态表示的出现,并通过自适应视野预测模块来解决第二个挑战。
效果:在20个Minecraft任务上进行的实验表明,该方法显著优于目前最好的基线;在许多任务中,性能提高了一倍。此外,还发现该方法具有零样本泛化到新场景的能力。

We study the problem of learning goal-conditioned policies in Minecraft, a popular, widely accessible yet challenging open-ended environment for developing human-level multi-task agents. We first identify two main challenges of learning such policies: 1) the indistinguishability of tasks from the state distribution, due to the vast scene diversity, and 2) the non-stationary nature of environment dynamics caused by the partial observability. To tackle the first challenge, we propose Goal-Sensitive Backbone (GSB) for the policy to encourage the emergence of goal-relevant visual state representations. To tackle the second challenge, the policy is further fueled by an adaptive horizon prediction module that helps alleviate the learning uncertainty brought by the non-stationary dynamics. Experiments on 20 Minecraft tasks show that our method significantly outperforms the best baseline so far; in many of them, we double the performance. Our ablation and exploratory studies then explain how our approach beat the counterparts and also unveil the surprising bonus of zero-shot generalization to new scenes (biomes). We hope our agent could help shed some light on learning goal-conditioned, multi-task agents in challenging, open-ended environments like Minecraft.

Visual Localization Using Imperfect 3D Models From the Internet
Panek, VojtechandKukelova, ZuzanaandSattler, Torsten



研究问题:如何利用互联网上的3D模型进行视觉定位,并研究这些模型的不完美性对定位精度的影响。
动机:目前的视觉定位算法需要大量的数据和复杂的计算过程,而互联网上的3D模型可以作为现成的场景表示,避免了这些步骤。然而,这些模型往往存在不完美性,这对定位精度产生影响。
方法:通过构建一个新的基准测试,对多种3D模型进行详细的实验评估,研究这些模型的不完美性对定位精度的影响。
效果:研究发现,互联网上的3D模型具有潜力作为易获取的场景表示。同时,视觉定位流程还有很大的改进空间。

Visual localization is a core component in many applications, including augmented reality (AR). Localization algorithms compute the camera pose of a query image w.r.t. a scene representation, which is typically built from images. This often requires capturing and storing large amounts of data, followed by running Structure-from-Motion (SfM) algorithms. An interesting, and underexplored, source of data for building scene representations are 3D models that are readily available on the Internet, e.g., hand-drawn CAD models, 3D models generated from building footprints, or from aerial images. These models allow to perform visual localization right away without the time-consuming scene capturing and model building steps. Yet, it also comes with challenges as the available 3D models are often imperfect reflections of reality. E.g., the models might only have generic or no textures at all, might only provide a simple approximation of the scene geometry, or might be stretched. This paper studies how the imperfections of these models affect localization accuracy. We create a new benchmark for this task and provide a detailed experimental evaluation based on multiple 3D models per scene. We show that 3D models from the Internet show promise as an easy-to-obtain scene representation. At the same time, there is significant room for improvement for visual localization pipelines. To foster research on this interesting and challenging task, we release our benchmark at v-pnk.github.io/cadloc.

Benchmarking Robustness of 3D Object Detection to Common Corruptions
Dong, YinpengandKang, CaixinandZhang, JinlaiandZhu, ZijianandWang, YikaiandYang, XiaoandSu, HangandWei, XingxingandZhu, Jun



研究问题:现有的3D物体检测器在面对不利天气和传感器噪声等现实世界的干扰时,缺乏稳健性,这引发了对自动驾驶系统的安全性和可靠性的关注。
动机:为了全面严格地评估3D检测器的抗干扰能力,我们设计了27种常见的针对激光雷达和相机输入的干扰类型,以考虑真实的驾驶场景。
方法:通过在公共数据集上合成这些干扰,我们建立了三个抗干扰基准测试——KITTI-C、nuScenes-C和Waymo-C。然后,我们在24个不同的3D物体检测模型上进行大规模的实验,以评估它们的抗干扰能力。
效果:实验结果表明,运动级别的干扰是最具威胁性的,会导致所有模型的性能大幅下降;激光雷达和相机融合的模型显示出更好的稳健性;仅依赖相机的模型对图像干扰极其敏感,显示了LiDAR点云的必要性。

3D object detection is an important task in autonomous driving to perceive the surroundings. Despite the excellent performance, the existing 3D detectors lack the robustness to real-world corruptions caused by adverse weathers, sensor noises, etc., provoking concerns about the safety and reliability of autonomous driving systems. To comprehensively and rigorously benchmark the corruption robustness of 3D detectors, in this paper we design 27 types of common corruptions for both LiDAR and camera inputs considering real-world driving scenarios. By synthesizing these corruptions on public datasets, we establish three corruption robustness benchmarks---KITTI-C, nuScenes-C, and Waymo-C. Then, we conduct large-scale experiments on 24 diverse 3D object detection models to evaluate their corruption robustness. Based on the evaluation results, we draw several important findings, including: 1) motion-level corruptions are the most threatening ones that lead to significant performance drop of all models; 2) LiDAR-camera fusion models demonstrate better robustness; 3) camera-only models are extremely vulnerable to image corruptions, showing the indispensability of LiDAR point clouds. We release the benchmarks and codes at https://github.com/thu-ml/3D_Corruptions_AD to be helpful for future studies.

OrienterNet: Visual Localization in 2D Public Maps With Neural Matching
Sarlin, Paul-EdouardandDeTone, DanielandYang, Tsun-YiandAvetisyan, ArmenandStraub, JulianandMalisiewicz, TomaszandBul\`o, SamuelRotaandNewcombe, RichardandKontschieder, PeterandBalntas, Vasileios



研究问题:如何让算法像人类一样,使用2D地图在3D环境中进行定位?
动机:目前的视觉定位算法大多依赖昂贵的3D点云,而我们的目标是让算法能够使用人类使用的2D语义地图进行定位。
方法:我们提出了OrienterNet,这是一个能够使用2D语义地图进行定位的深度神经网络。OrienterNet通过匹配神经鸟瞰图和全球可用的OpenStreetMap地图,来估计查询图像的位置和方向。
效果:OrienterNet可以在12个城市的各种视角下进行图像捕获,并推广到新的数据集,将机器人和增强现实场景中的状态推向了前沿。

Humans can orient themselves in their 3D environments using simple 2D maps. Differently, algorithms for visual localization mostly rely on complex 3D point clouds that are expensive to build, store, and maintain over time. We bridge this gap by introducing OrienterNet, the first deep neural network that can localize an image with sub-meter accuracy using the same 2D semantic maps that humans use. OrienterNet estimates the location and orientation of a query image by matching a neural Bird's-Eye View with open and globally available maps from OpenStreetMap, enabling anyone to localize anywhere such maps are available. OrienterNet is supervised only by camera poses but learns to perform semantic matching with a wide range of map elements in an end-to-end manner. To enable this, we introduce a large crowd-sourced dataset of images captured across 12 cities from the diverse viewpoints of cars, bikes, and pedestrians. OrienterNet generalizes to new datasets and pushes the state of the art in both robotics and AR scenarios. The code is available at https://github.com/facebookresearch/OrienterNet

Pix2map: Cross-Modal Retrieval for Inferring Street Maps From Images
Wu, XindiandLau, KwunFungandFerroni, FrancescoandO\v{s



研究问题:如何直接从自我视角图像中推断城市街道地图的拓扑结构,以持续更新和扩展现有地图。
动机:现有的地图需要不断更新和扩展,而直接从原始图像数据中推断复杂的城市道路拓扑结构是一项具有挑战性的任务。
方法:通过学习图像和现有地图(表示为编码视觉周围布局的离散图形)的联合、跨模态嵌入空间,将此问题表述为跨模态检索。
效果:实验结果表明,仅使用图像数据就可以准确检索到对应已见和未见道路的街道地图。此外,我们的检索地图可以用于更新或扩展现有地图,甚至展示了从空间图进行视觉定位和图像检索的概念验证结果。

Self-driving vehicles rely on urban street maps for autonomous navigation. In this paper, we introduce Pix2Map, a method for inferring urban street map topology directly from ego-view images, as needed to continually update and expand existing maps. This is a challenging task, as we need to infer a complex urban road topology directly from raw image data. The main insight of this paper is that this problem can be posed as cross-modal retrieval by learning a joint, cross-modal embedding space for images and existing maps, represented as discrete graphs that encode the topological layout of the visual surroundings. We conduct our experimental evaluation using the Argoverse dataset and show that it is indeed possible to accurately retrieve street maps corresponding to both seen and unseen roads solely from image data. Moreover, we show that our retrieved maps can be used to update or expand existing maps and even show proof-of-concept results for visual localization and image retrieval from spatial graphs.

Affordances From Human Videos as a Versatile Representation for Robotics
Bahl, ShikharandMendonca, RussellandChen, LiliandJain, UnnatandPathak, Deepak



研究问题:如何将现有的模型直接应用于机器人上,使机器人通过观察人类进行理解和学习交互。
动机:尽管在静态数据集上取得了一些成功的结果,但目前还不清楚如何以环境为中心的方式利用互联网上的视频来训练一个视觉适应性模型。
方法:利用互联网上的视频对人类行为进行训练,建立一个视觉适应性模型,估计人类在场景中的互动位置和方式。
效果:我们的方法被称为视觉机器人桥(VRB),在4个真实世界中的环境中,超过10个不同的任务和2个在野外操作的机器人平台上展示了其有效性。

Building a robot that can understand and learn to interact by watching humans has inspired several vision problems. However, despite some successful results on static datasets, it remains unclear how current models can be used on a robot directly. In this paper, we aim to bridge this gap by leveraging videos of human interactions in an environment centric manner. Utilizing internet videos of human behavior, we train a visual affordance model that estimates where and how in the scene a human is likely to interact. The structure of these behavioral affordances directly enables the robot to perform many complex tasks. We show how to seamlessly integrate our affordance model with four robot learning paradigms including offline imitation learning, exploration, goal-conditioned learning, and action parameterization for reinforcement learning. We show the efficacy of our approach, which we call Vision-Robotics Bridge (VRB) across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.

Toward RAW Object Detection: A New Benchmark and a New Model
Xu, RuikangandChen, ChangandPeng, JingyangandLi, ChengandHuang, YibinandSong, FenglongandYan, YouliangandXiong, Zhiwei



研究问题:如何在无需额外设备成本的情况下,让物体检测算法处理各种光照条件,如强光。
动机:在许多计算机视觉应用中(例如机器人和自动驾驶),高动态范围(HDR)数据对于物体检测算法处理各种光照条件是必要的。
方法:构建了一个名为ROD的新颖RAW传感器数据集,用于基于深度神经网络(DNNs)的物体检测算法应用于HDR数据。该数据集包含大量标注的日间和夜间驾驶场景的24位动态范围实例。
效果:实验表明,RAW传感器数据上的检测性能在不同情况下明显优于标准动态范围(SDR)数据。此外,我们还分析了输入数据的纹理信息和像素分布对基于DNNs的检测器性能的影响。

In many computer vision applications (e.g., robotics and autonomous driving), high dynamic range (HDR) data is necessary for object detection algorithms to handle a variety of lighting conditions, such as strong glare. In this paper, we aim to achieve object detection on RAW sensor data, which naturally saves the HDR information from image sensors without extra equipment costs. We build a novel RAW sensor dataset, named ROD, for Deep Neural Networks (DNNs)-based object detection algorithms to be applied to HDR data. The ROD dataset contains a large amount of annotated instances of day and night driving scenes in 24-bit dynamic range. Based on the dataset, we first investigate the impact of dynamic range for DNNs-based detectors and demonstrate the importance of dynamic range adjustment for detection on RAW sensor data. Then, we propose a simple and effective adjustment method for object detection on HDR RAW sensor data, which is image adaptive and jointly optimized with the downstream detector in an end-to-end scheme. Extensive experiments demonstrate that the performance of detection on RAW sensor data is significantly superior to standard dynamic range (SDR) data in different situations. Moreover, we analyze the influence of texture information and pixel distribution of input data on the performance of the DNNs-based detector.

Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving
Liang, XiwenandNiu, MinzheandHan, JianhuaandXu, HangandXu, ChunjingandLiang, Xiaodan



研究问题:目前多任务学习算法在自动驾驶领域的表现与单任务基准存在较大差距。
动机:为了解决这一问题,我们提出了一种有效的多任务框架VE-Prompt,通过引入视觉范例来引导模型学习高质量的特定任务表示。
方法:我们在BDD100K数据集上实施了VE-Prompt,该数据集覆盖了对象检测、语义分割、可驾驶区域分割和车道检测等四种常见感知任务。
效果:实验结果表明,VE-Prompt不仅提高了多任务基线的性能,而且超过了单任务模型,显示出其在自动驾驶领域的有效性。

Multi-task learning has emerged as a powerful paradigm to solve a range of tasks simultaneously with good efficiency in both computation resources and inference time. However, these algorithms are designed for different tasks mostly not within the scope of autonomous driving, thus making it hard to compare multi-task methods in autonomous driving. Aiming to enable the comprehensive evaluation of present multi-task learning methods in autonomous driving, we extensively investigate the performance of popular multi-task methods on the large-scale driving dataset, which covers four common perception tasks, i.e., object detection, semantic segmentation, drivable area segmentation, and lane detection. We provide an in-depth analysis of current multi-task learning methods under different common settings and find out that the existing methods make progress but there is still a large performance gap compared with single-task baselines. To alleviate this dilemma in autonomous driving, we present an effective multi-task framework, VE-Prompt, which introduces visual exemplars via task-specific prompting to guide the model toward learning high-quality task-specific representations. Specifically, we generate visual exemplars based on bounding boxes and color-based markers, which provide accurate visual appearances of target categories and further mitigate the performance gap. Furthermore, we bridge transformer-based encoders and convolutional layers for efficient and accurate unified perception in autonomous driving. Comprehensive experimental results on the diverse self-driving dataset BDD100K show that the VE-Prompt improves the multi-task baseline and further surpasses single-task models.

Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation
Otani, MayuandTogashi, RikuandSawai, YuandIshigami, RyosukeandNakashima, YutaandRahtu, EsaandHeikkil\"a, JanneandSatoh, Shin{\textquoteright



研究问题:如何进行有效的文本到图像生成模型的人类评估,以验证其性能。
动机:当前许多研究仅依赖自动测量或描述不清、不可靠的人工评估,这导致无法验证和重复的结果。
方法:本文提出了一种标准化和明确定义的人工评估协议,以促进未来工作的可验证和可重复的人工评估。
效果:实验结果显示,当前的自动测量与人类在评估文本到图像生成结果的性能方面的认知不兼容。同时,为设计可靠和决定性的人工评估实验提供了见解,并公开了一些资源供社区使用,以便于快速实施。

Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standardized and well-defined human evaluation protocol to facilitate verifiable and reproducible human evaluation in future works. In our pilot data collection, we experimentally show that the current automatic measures are incompatible with human perception in evaluating the performance of the text-to-image generation results. Furthermore, we provide insights for designing human evaluation experiments reliably and conclusively. Finally, we make several resources publicly available to the community to facilitate easy and fast implementations.

BAEFormer: Bi-Directional and Early Interaction Transformers for Bird's Eye View Semantic Segmentation
Pan, CongandHe, YonghaoandPeng, JunranandZhang, QianandSui, WeiandZhang, Zhaoxiang



研究问题:如何将透视图转换为鸟瞰图语义分割,以解决自动驾驶中的关键任务。
动机:现有的基于Transformer的方法在将透视图转换为鸟瞰图时面临困难,因为它们的单向和后向交互机制。
方法:提出了一种名为BAEFormer的新型双向早期交互Transformers框架,包括(i)早期交互的透视图-鸟瞰图管道和(ii)双向交叉注意力机制。
效果:在nuScenes数据集上,所提出的方法在实时推理速度方面达到了最先进的性能,即在单个A100 GPU上以45 FPS的速度达到38.9 mIoU。

Bird's Eye View (BEV) semantic segmentation is a critical task in autonomous driving. However, existing Transformer-based methods confront difficulties in transforming Perspective View (PV) to BEV due to their unidirectional and posterior interaction mechanisms. To address this issue, we propose a novel Bi-directional and Early Interaction Transformers framework named BAEFormer, consisting of (i) an early-interaction PV-BEV pipeline and (ii) a bi-directional cross-attention mechanism. Moreover, we find that the image feature maps' resolution in the cross-attention module has a limited effect on the final performance. Under this critical observation, we propose to enlarge the size of input images and downsample the multi-view image features for cross-interaction, further improving the accuracy while keeping the amount of computation controllable. Our proposed method for BEV semantic segmentation achieves state-of-the-art performance in real-time inference speed on the nuScenes dataset, i.e., 38.9 mIoU at 45 FPS on a single A100 GPU.

3D-POP - An Automated Annotation Approach to Facilitate Markerless 2D-3D Tracking of Freely Moving Birds With Marker-Based Motion Capture
Naik, HemalandChan, AlexHoiHangandYang, JunranandDelacoux, MathildeandCouzin, IainD.andKano, FumihiroandNagy, M\'at\'e



研究问题:现有的动物行为跟踪技术需要标记,缺乏大数据集。
动机:利用机器学习和计算机视觉的进步,开发一种无需标记即可追踪动物姿势和位置的方法。
方法:提出一种使用运动捕捉系统获取动物运动和姿势(2D和3D)的半自动标注大量数据的方法,通过提取形态关键点相对于动物身上标记的位置的3D位置。
效果:创建了一个新的数据集3D-POP,包含约30万个标注帧(400万个实例),以视频形式记录了在3.6m x 4.2m区域内从四个不同相机视角自由移动的一至十只鸟。这是首个具有精确关键点2D和3D注释以及边界框和个人身份识别的群集鸟类数据集,将有助于解决鸟类的2D到3D无标记姿态、轨迹跟踪和识别问题。

Recent advances in machine learning and computer vision are revolutionizing the field of animal behavior by enabling researchers to track the poses and locations of freely moving animals without any marker attachment. However, large datasets of annotated images of animals for markerless pose tracking, especially high-resolution images taken from multiple angles with accurate 3D annotations, are still scant. Here, we propose a method that uses a motion capture (mo-cap) system to obtain a large amount of annotated data on animal movement and posture (2D and 3D) in a semi-automatic manner. Our method is novel in that it extracts the 3D positions of morphological keypoints (e.g eyes, beak, tail) in reference to the positions of markers attached to the animals. Using this method, we obtained, and offer here, a new dataset - 3D-POP with approximately 300k annotated frames (4 million instances) in the form of videos having groups of one to ten freely moving birds from 4 different camera views in a 3.6m x 4.2m area. 3D-POP is the first dataset of flocking birds with accurate keypoint annotations in 2D and 3D along with bounding box and individual identities and will facilitate the development of solutions for problems of 2D to 3D markerless pose, trajectory tracking, and identification in birds.

Policy Adaptation From Foundation Model Feedback
Ge, YuyingandMacaluso, AnnabellaandLi, LiErranandLuo, PingandWang, Xiaolong



研究问题:如何提高视觉-语言基础模型在未见过的任务或环境中的表现。
动机:尽管现有的预训练模型在处理不同对象和任务时具有泛化能力,但在面对未见过的任务或环境时,其表现往往不佳。
方法:提出一种从基础模型反馈中进行策略调整的方法(PAFF)。当将训练好的策略部署到新任务或新环境中时,首先让策略与随机生成的指令一起运行以记录演示。然后使用预训练的基础模型提供反馈来重新标记这些演示,从而自动为策略微调提供新的演示-指令数据对。
效果:在广泛的实验中评估了该方法,重点关注在未见过的物体、任务、环境和模拟到现实的转换上的表现。结果显示,PAFF在所有情况下都能大幅提高基线性能。

Recent progress on vision-language foundation models have brought significant advancement to building general-purpose robots. By using the pre-trained models to encode the scene and instructions as inputs for decision making, the instruction-conditioned policy can generalize across different objects and tasks. While this is encouraging, the policy still fails in most cases given an unseen task or environment. In this work, we propose Policy Adaptation from Foundation model Feedback (PAFF). When deploying the trained policy to a new task or a new environment, we first let the policy play with randomly generated instructions to record the demonstrations. While the execution could be wrong, we can use the pre-trained foundation models to provide feedback to relabel the demonstrations. This automatically provides new pairs of demonstration-instruction data for policy fine-tuning. We evaluate our method on a broad range of experiments with the focus on generalization on unseen objects, unseen tasks, unseen environments, and sim-to-real transfer. We show PAFF improves baselines by a large margin in all cases.

Infinite Photorealistic Worlds Using Procedural Generation
Raistrick, AlexanderandLipson, LahavandMa, ZeyuandMei, LingjieandWang, MingzheandZuo, YimingandKayan, KarhanandWen, HongyuandHan, BeiningandWang, YihanandNewell, AlejandroandLaw, HeiandGoyal, AnkitandYang, KaiyuandDeng, Jia



研究问题:如何利用程序生成器生成逼真的自然世界3D场景。
动机:为了提供无限的、多样化的计算机视觉训练数据,以应对各种计算机视觉任务。
方法:开发了一个名为Infinigen的程序生成器,通过随机化的数学规则从零开始生成所有的资产,包括形状和纹理,无需使用任何外部资源,允许无限的变化和组合。
效果:Infinigen可以生成广泛的自然世界中的对象和场景,如植物、动物、地形以及火、云、雨、雪等自然现象。它可以用于生成无限的、多样化的训练数据,适用于对象检测、语义分割、光流和3D重建等多种计算机视觉任务。

We introduce Infinigen, a procedural generator of photorealistic 3D scenes of the natural world. Infinigen is entirely procedural: every asset, from shape to texture, is generated from scratch via randomized mathematical rules, using no external source and allowing infinite variation and composition. Infinigen offers broad coverage of objects and scenes in the natural world including plants, animals, terrains, and natural phenomena such as fire, cloud, rain, and snow. Infinigen can be used to generate unlimited, diverse training data for a wide range of computer vision tasks including object detection, semantic segmentation, optical flow, and 3D reconstruction. We expect Infinigen to be a useful resource for computer vision research and beyond. Please visit https://infinigen.org for videos, code and pre-generated data.

KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
Li, XiangyangandWang, ZihanandYang, JiahaoandWang, YaoweiandJiang, Shuqiang



研究问题:本文旨在解决视觉-语言导航任务中,如何利用知识提高代理的导航能力。
动机:现有的方法主要使用整个特征或以物体为中心的特征来表示可导航的候选对象,但这些表示方式对代理执行到达目标位置的动作来说效率不够高。
方法:本文提出了一种基于知识的增强推理模型(KERM),通过从构建的知识库中检索与导航视图相关的事实(即由语言描述的知识)来提升代理的导航能力。
效果:实验结果表明,提出的KERM在REVERIE、R2R和SOON等数据集上取得了良好的效果,能够自动选择和收集关键和相关的线索,进行更准确的行动预测。

Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes. Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates. However, these representations are not efficient enough for an agent to perform actions to arrive the target location. As knowledge provides crucial information which is complementary to visible content, in this paper, we propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability. Specifically, we first retrieve facts (i.e., knowledge described by language descriptions) for the navigation views based on local regions from the constructed knowledge base. The retrieved facts range from properties of a single object (e.g., color, shape) to relationships between objects (e.g., action, spatial position), providing crucial information for VLN. We further present the KERM which contains the purification, fact-aware interaction, and instruction-guided aggregation modules to integrate visual, history, instruction, and fact features. The proposed KERM can automatically select and gather crucial and relevant cues, obtaining more accurate action prediction. Experimental results on the REVERIE, R2R, and SOON datasets demonstrate the effectiveness of the proposed method. The source code is available at https://github.com/XiangyangLi20/KERM.

LiDAR-in-the-Loop Hyperparameter Optimization
Goudreault, F\'elixandScheuble, DominikandBijelic, MarioandRobidoux, NicolasandHeide, Felix



研究问题:如何优化激光雷达(LiDAR)系统参数以提高三维视觉任务的性能。
动机:现有的LiDAR系统通常被视为黑箱,其参数调整主要依赖于人工专家经验,缺乏系统性的优化方法。
方法:提出一种直接优化LiDAR传感和数字信号处理(DSP)参数的方法,通过非线性多目标优化问题求解,并使用0阶随机算法进行优化。
效果:在自动驾驶三维物体检测模型上,该方法比人工专家调优提高了39.5%的平均精度(mAP)。

LiDAR has become a cornerstone sensing modality for 3D vision. LiDAR systems emit pulses of light into the scene, take measurements of the returned signal, and rely on hardware digital signal processing (DSP) pipelines to construct 3D point clouds from these measurements. The resulting point clouds output by these DSPs are input to downstream 3D vision models -- both, in the form of training datasets or as input at inference time. Existing LiDAR DSPs are composed of cascades of parameterized operations; modifying configuration parameters results in significant changes in the point clouds and consequently the output of downstream methods. Existing methods treat LiDAR systems as fixed black boxes and construct downstream task networks more robust with respect to measurement fluctuations. Departing from this approach, the proposed method directly optimizes LiDAR sensing and DSP parameters for downstream tasks. To investigate the optimization of LiDAR system parameters, we devise a realistic LiDAR simulation method that generates raw waveforms as input to a LiDAR DSP pipeline. We optimize LiDAR parameters for both 3D object detection IoU losses and depth error metrics by solving a nonlinear multi-objective optimization problem with a 0th-order stochastic algorithm. For automotive 3D object detection models, the proposed method outperforms manual expert tuning by 39.5% mean Average Precision (mAP).

BEVHeight: A Robust Framework for Vision-Based Roadside 3D Object Detection
Yang, LeiandYu, KaichengandTang, TaoandLi, JunandYuan, KunandWang, LiandZhang, XinyuandChen, Peng



研究问题:如何利用智能路旁摄像头来扩展感知能力,超越视觉范围。
动机:目前自动驾驶系统主要关注车辆传感器的感知方法,而忽视了利用智能路旁摄像头作为另一种可能的感知方式。
方法:提出一种名为BEVHeight的简单有效方法,通过预测地面高度而非像素级深度,实现距离无关的优化过程,以解决现有视觉中心的方法在远距离时深度差异快速缩小的问题。
效果:在流行的路旁摄像头3D检测基准测试中,该方法大幅超越了所有先前的视觉中心方法。

While most recent autonomous driving system focuses on developing perception methods on ego-vehicle sensors, people tend to overlook an alternative approach to leverage intelligent roadside cameras to extend the perception ability beyond the visual range. We discover that the state-of-the-art vision-centric bird's eye view detection methods have inferior performances on roadside cameras. This is because these methods mainly focus on recovering the depth regarding the camera center, where the depth difference between the car and the ground quickly shrinks while the distance increases. In this paper, we propose a simple yet effective approach, dubbed BEVHeight, to address this issue. In essence, instead of predicting the pixel-wise depth, we regress the height to the ground to achieve a distance-agnostic formulation to ease the optimization process of camera-only perception methods. On popular 3D detection benchmarks of roadside cameras, our method surpasses all previous vision-centric methods by a significant margin. The code is available at https://github.com/ADLab-AutoDrive/BEVHeight.

MVImgNet: A Large-Scale Dataset of Multi-View Images
Yu, XianggangandXu, MutianandZhang, YidanandLiu, HaolinandYe, ChongjieandWu, YushuangandYan, ZizhengandZhu, ChenmingandXiong, ZhangyangandLiang, TianyouandChen, GuanyingandCui, ShuguangandHan, Xiaoguang



研究问题:由于真实世界3D数据的收集困难,目前还没有一个与ImageNet在2D视觉中的地位相当的通用数据集。
动机:为了解决这个问题,我们引入了MVImgNet,这是一个大规模的多视角图像数据集,通过拍摄现实生活中物体的视频可以很容易地获取。
方法:MVImgNet包含来自219,188个视频的650万个帧,涵盖了238个类别的对象,具有丰富的对象掩码、相机参数和点云注释。其多视角属性使我们的数据集具有3D感知信号,成为2D和3D视觉之间的软桥梁。
效果:我们在多种3D和2D视觉任务上进行了初步研究,包括辐射场重建、多视图立体视觉和视图一致的图像理解,其中MVImgNet表现出了有希望的性能,为未来的探索留下了大量的可能性。此外,通过在MVImgNet上的密集重建,我们还得到了一个名为MVPNet的3D物体点云数据集,包含了来自150个类别的87,200个样本,每个点云都有类标签。实验表明,MVPNet可以促进真实世界的3D物体分类,同时对点云理解提出了新的挑战。

Being data-driven is one of the most iconic properties of deep learning algorithms. The birth of ImageNet drives a remarkable trend of "learning from large-scale data" in computer vision. Pretraining on ImageNet to obtain rich universal representations has been manifested to benefit various 2D visual tasks, and becomes a standard in 2D vision. However, due to the laborious collection of real-world 3D data, there is yet no generic dataset serving as a counterpart of ImageNet in 3D vision, thus how such a dataset can impact the 3D community is unraveled. To remedy this defect, we introduce MVImgNet, a large-scale dataset of multi-view images, which is highly convenient to gain by shooting videos of real-world objects in human daily life. It contains 6.5 million frames from 219,188 videos crossing objects from 238 classes, with rich annotations of object masks, camera parameters, and point clouds. The multi-view attribute endows our dataset with 3D-aware signals, making it a soft bridge between 2D and 3D vision. We conduct pilot studies for probing the potential of MVImgNet on a variety of 3D and 2D visual tasks, including radiance field reconstruction, multi-view stereo, and view-consistent image understanding, where MVImgNet demonstrates promising performance, remaining lots of possibilities for future explorations. Besides, via dense reconstruction on MVImgNet, a 3D object point cloud dataset is derived, called MVPNet, covering 87,200 samples from 150 categories, with the class label on each point cloud. Experiments show that MVPNet can benefit the real-world 3D object classification while posing new challenges to point cloud understanding. MVImgNet and MVPNet will be publicly available, hoping to inspire the broader vision community.

A New Benchmark: On the Utility of Synthetic Data With Blender for Bare Supervised Learning and Downstream Domain Adaptation
Tang, HuiandJia, Kui



研究问题:本文旨在解决深度学习在计算机视觉领域的问题,如大规模标注训练数据的获取困难、数据收集过程中的非独立同分布(IID)问题等。
动机:由于高昂的人工成本和标签准确性无法保证,全面的数据标注对于所有感兴趣的领域来说都是不现实的。此外,不可控制的数据收集过程可能会产生非IID的训练和测试数据,其中可能存在不希望出现的重复。所有这些问题都可能阻碍对典型理论的验证和新发现的出现。
方法:本文通过使用3D渲染生成合成数据并实施领域随机化,以规避这些问题。在可控的、由3D渲染实现的IID数据设置下,我们系统地验证了典型的、重要的学习见解,例如捷径学习,并发现了各种数据模式和网络架构在泛化中的新规律。
效果:我们还研究了图像形成因素对泛化的影响,例如3D场景中的对象比例、材料纹理、照明、相机视角和背景。此外,我们将模拟到现实的适应作为下游任务,比较预训练时合成数据和真实数据的可转移性,结果表明,合成数据预训练也有望提高真实测试结果。最后,为了推动未来的研究,我们开发了一个名为S2RDA的新的大型合成到真实基准测试集,用于图像分类,为从模拟到现实的转换提供了更大的挑战。

Deep learning in computer vision has achieved great success with the price of large-scale labeled training data. However, exhaustive data annotation is impracticable for each task of all domains of interest, due to high labor costs and unguaranteed labeling accuracy. Besides, the uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist. All these nuisances may hinder the verification of typical theories and exposure to new findings. To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization. We in this work push forward along this line by doing profound and extensive research on bare supervised learning and downstream domain adaptation. Specifically, under the well-controlled, IID data setting enabled by 3D rendering, we systematically verify the typical, important learning insights, e.g., shortcut learning, and discover the new laws of various data regimes and network architectures in generalization. We further investigate the effect of image formation factors on generalization, e.g., object scale, material texture, illumination, camera viewpoint, and background in a 3D scene. Moreover, we use the simulation-to-reality adaptation as a downstream task for comparing the transferability between synthetic and real data when used for pre-training, which demonstrates that synthetic data pre-training is also promising to improve real test results. Lastly, to promote future research, we develop a new large-scale synthetic-to-real benchmark for image classification, termed S2RDA, which provides more significant challenges for transfer from simulation to reality. The code and datasets are available at https://github.com/huitangtang/On_the_Utility_of_Synthetic_Data.

Temporal Consistent 3D LiDAR Representation Learning for Semantic Perception in Autonomous Driving
Nunes, LucasandWiesmann, LouisandMarcuzzi, RodrigoandChen, XieyuanliandBehley, JensandStachniss, Cyrill



研究问题:如何有效地利用大规模文本语料库和知识图谱训练语言表示模型,并充分利用词汇、句法和知识信息。
动机:目前的预训练语言模型缺乏对丰富的结构化知识的利用,通过结合知识图谱中的有信息量的实体,可以增强语言表示。
方法:采用大规模文本语料库和知识图谱进行联合训练,训练出ERNIE模型,该模型能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

Semantic perception is a core building block in autonomous driving, since it provides information about the drivable space and location of other traffic participants. For learning-based perception, often a large amount of diverse training data is necessary to achieve high performance. Data labeling is usually a bottleneck for developing such methods, especially for dense prediction tasks, e.g., semantic segmentation or panoptic segmentation. For 3D LiDAR data, the annotation process demands even more effort than for images. Especially in autonomous driving, point clouds are sparse, and objects appearance depends on its distance from the sensor, making it harder to acquire large amounts of labeled training data. This paper aims at taking an alternative path proposing a self-supervised representation learning method for 3D LiDAR data. Our approach exploits the vehicle motion to match objects across time viewed in different scans. We then train a model to maximize the point-wise feature similarities from points of the associated object in different scans, which enables to learn a consistent representation across time. The experimental results show that our approach performs better than previous state-of-the-art self-supervised representation learning methods when fine-tuning to different downstream tasks. We furthermore show that with only 10% of labeled data, a network pre-trained with our approach can achieve better performance than the same network trained from scratch with all labels for semantic segmentation on SemanticKITTI.

Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention
Mondal, SounakandYang, ZhiboandAhn, SeoyoungandSamaras, DimitrisandZelinsky, GregoryandHoai, Minh



研究问题:预测人类视线在人机交互中的重要性,但现有的模型需要训练目标检测器和大量的人类视线数据,且速度慢、准确性不高。
动机:为了解决这些问题,我们提出了一种新的任务——ZeroGaze,并开发了一种新的模型Gazeformer。
方法:Gazeformer使用自然语言模型对目标进行编码,利用语义相似性进行扫描路径预测,采用基于变压器的编码-解码架构。
效果:实验结果表明,Gazeformer在ZeroGaze设置上比其他模型有显著的改进(19% - 70%),并在标准视线预测的目标存在和缺失搜索任务上都优于现有的目标检测模型。此外,Gazeformer的速度比最先进的目标存在视觉搜索模型快五倍以上。

Predicting human gaze is important in Human-Computer Interaction (HCI). However, to practically serve HCI applications, gaze prediction models must be scalable, fast, and accurate in their spatial and temporal gaze predictions. Recent scanpath prediction models focus on goal-directed attention (search). Such models are limited in their application due to a common approach relying on trained target detectors for all possible objects, and the availability of human gaze data for their training (both not scalable). In response, we pose a new task called ZeroGaze, a new variant of zero-shot learning where gaze is predicted for never-before-searched objects, and we develop a novel model, Gazeformer, to solve the ZeroGaze problem. In contrast to existing methods using object detector modules, Gazeformer encodes the target using a natural language model, thus leveraging semantic similarities in scanpath prediction. We use a transformer-based encoder-decoder architecture because transformers are particularly useful for generating contextual representations. Gazeformer surpasses other models by a large margin (19% - 70%) on the ZeroGaze setting. It also outperforms existing target-detection models on standard gaze prediction for both target-present and target-absent search tasks. In addition to its improved performance, Gazeformer is more than five times faster than the state-of-the-art target-present visual search model.

MammalNet: A Large-Scale Video Benchmark for Mammal Recognition and Behavior Understanding
Chen, JunandHu, MingandCoker, DarrenJ.andBerumen, MichaelL.andCostelloe, BlairandBeery, SaraandRohrbach, AnnaandElhoseiny, Mohamed



研究问题:如何有效地监测动物行为以促进保护工作?
动机:现有的视频数据集无法满足大规模动物行为识别的需求,缺乏适当的标签和分类。
方法:开发了一种新的大型动物行为数据集MammalNet,包含超过18K的视频,覆盖了17个目、69个科和173种哺乳动物类别,以及12种常见的高级动物行为。
效果:MammalNet比现有的任何动物行为数据集都要大十倍,可以用于进行大规模的动物行为识别研究。

Monitoring animal behavior can facilitate conservation efforts by providing key insights into wildlife health, population status, and ecosystem function. Automatic recognition of animals and their behaviors is critical for capitalizing on the large unlabeled datasets generated by modern video devices and for accelerating monitoring efforts at scale. However, the development of automated recognition systems is currently hindered by a lack of appropriately labeled datasets. Existing video datasets 1) do not classify animals according to established biological taxonomies; 2) are too small to facilitate large-scale behavioral studies and are often limited to a single species; and 3) do not feature temporally localized annotations and therefore do not facilitate localization of targeted behaviors within longer video sequences. Thus, we propose MammalNet, a new large-scale animal behavior dataset with taxonomy-guided annotations of mammals and their common behaviors. MammalNet contains over 18K videos totaling 539 hours, which is 10 times larger than the largest existing animal behavior dataset. It covers 17 orders, 69 families, and 173 mammal categories for animal categorization and captures 12 high-level animal behaviors that received focus in previous animal behavior studies. We establish three benchmarks on MammalNet: standard animal and behavior recognition, compositional low-shot animal and behavior recognition, and behavior detection. Our dataset and code have been made available at: https://mammal-net.github.io.

ReasonNet: End-to-End Driving With Temporal and Global Reasoning
Shao, HaoandWang, LetianandChen, RuobingandWaslander, StevenL.andLi, HongshengandLiu, Yu



研究问题:如何预测城市密集交通场景的未来演变和物体行为,以及处理罕见的负面事件,如突然出现的遮挡物体。
动机:当前自动驾驶车辆的大规模部署尚未实现,其中主要的挑战之一在于城市密集交通场景。在这些情况下,预测场景的未来演变和物体的行为,以及处理罕见的负面事件,如突然出现的遮挡物体,仍然具有挑战性。
方法:本文提出了一种名为ReasonNet的新型端到端驾驶框架,该框架广泛利用了驾驶场景的时空信息。通过推理物体的时间行为,该方法可以有效地处理不同帧中特征之间的交互和关系。对场景全局信息的推理也可以提高整体感知性能,并有利于负面事件的检测,特别是对遮挡物体潜在危险的预期。
效果:在多个CARLA基准上进行了广泛的实验,其中我们的模型在所有先前的方法中表现最好,在公共CARLA排行榜的传感器轨迹上排名第一。

The large-scale deployment of autonomous vehicles is yet to come, and one of the major remaining challenges lies in urban dense traffic scenarios. In such cases, it remains challenging to predict the future evolution of the scene and future behaviors of objects, and to deal with rare adverse events such as the sudden appearance of occluded objects. In this paper, we present ReasonNet, a novel end-to-end driving framework that extensively exploits both temporal and global information of the driving scene. By reasoning on the temporal behavior of objects, our method can effectively process the interactions and relationships among features in different frames. Reasoning about the global information of the scene can also improve overall perception performance and benefit the detection of adverse events, especially the anticipation of potential danger from occluded objects. For comprehensive evaluation on occlusion events, we also release publicly a driving simulation benchmark DriveOcclusionSim consisting of diverse occlusion events. We conduct extensive experiments on multiple CARLA benchmarks, where our model outperforms all prior methods, ranking first on the sensor track of the public CARLA Leaderboard.

SLOPER4D: A Scene-Aware Dataset for Global 4D Human Pose Estimation in Urban Environments
Dai, YudiandLin, YitaiandLin, XipingandWen, ChengluandXu, LanandYi, HongweiandShen, SiqiandMa, YuexinandWang, Cheng



研究问题:本文旨在开发一种新的场景感知数据集SLOPER4D,以促进在野外进行全球人体姿态估计(GHPE)和人与场景互动的研究。
动机:通过使用集成了激光雷达和相机的头戴式设备,记录了12个主题在10个不同的城市环境中的活动,从自我中心的视角出发,为研究提供了丰富的动态场景数据。
方法:提出了一种联合优化方法,将局部SMPL网格拟合到场景中,并在动态运动的每一帧中微调相机校准,从而得到可信且符合场景的自然3D人体姿态。
效果:SLOPER4D数据集包含15个人类运动序列,每个序列的轨迹长度超过200米(最长达到1300米),覆盖面积超过200平方米(最大达到3万平方米)。该数据集对现有的方法提出了重大挑战,并为研究提供了巨大的机会。

We present SLOPER4D, a novel scene-aware dataset collected in large urban environments to facilitate the research of global human pose estimation (GHPE) with human-scene interaction in the wild. Employing a head-mounted device integrated with a LiDAR and camera, we record 12 human subjects' activities over 10 diverse urban scenes from an egocentric view. Frame-wise annotations for 2D key points, 3D pose parameters, and global translations are provided, together with reconstructed scene point clouds. To obtain accurate 3D ground truth in such large dynamic scenes, we propose a joint optimization method to fit local SMPL meshes to the scene and fine-tune the camera calibration during dynamic motions frame by frame, resulting in plausible and scene-natural 3D human poses. Eventually, SLOPER4D consists of 15 sequences of human motions, each of which has a trajectory length of more than 200 meters (up to 1,300 meters) and covers an area of more than 200 square meters (up to 30,000 square meters), including more than 100k LiDAR frames, 300k video frames, and 500K IMU-based motion frames. With SLOPER4D, we provide a detailed and thorough analysis of two critical tasks, including camera-based 3D HPE and LiDAR-based 3D HPE in urban environments, and benchmark a new task, GHPE. The in-depth analysis demonstrates SLOPER4D poses significant challenges to existing methods and produces great research opportunities. The dataset and code are released at http://www.lidarhumanmotion.net/sloper4d/.

SketchXAI: A First Look at Explainability for Human Sketches
Qu, ZhiyuandGryaditskaya, YuliaandLi, KeandPang, KaiyueandXiang, TaoandSong, Yi-Zhe



研究问题:本文首次将人类草图引入可解释人工智能(XAI)领域,探讨如何通过草图这种"以人为本"的数据形式来研究可解释性。
动机:草图作为一种具有灵活性的构建和操作对象的数据形式,可以作为研究可解释性的自然接口。
方法:首先,将草图中的笔划识别为一个独特的构建块,然后设计了一个简单且易于解释的草图编码器,以适应笔划的形状、位置和顺序等内在属性。接着,定义了第一个针对草图的XAI任务——笔划位置反转(SLI),即要求网络恢复未见过草图的笔划位置。
效果:实验结果表明,由于其针对草图的设计,该草图编码器不仅在草图识别准确性上取得了迄今为止最好的结果,而且参数数量也最少。

This paper, for the very first time, introduces human sketches to the landscape of XAI (Explainable Artificial Intelligence). We argue that sketch as a "human-centred" data form, represents a natural interface to study explainability. We focus on cultivating sketch-specific explainability designs. This starts by identifying strokes as a unique building block that offers a degree of flexibility in object construction and manipulation impossible in photos. Following this, we design a simple explainability-friendly sketch encoder that accommodates the intrinsic properties of strokes: shape, location, and order. We then move on to define the first ever XAI task for sketch, that of stroke location inversion SLI. Just as we have heat maps for photos, and correlation matrices for text, SLI offers an explainability angle to sketch in terms of asking a network how well it can recover stroke locations of an unseen sketch. We offer qualitative results for readers to interpret as snapshots of the SLI process in the paper, and as GIFs on the project page. A minor but interesting note is that thanks to its sketch-specific design, our sketch encoder also yields the best sketch recognition accuracy to date while having the smallest number of parameters. The code is available at https://sketchxai.github.io.

Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild
Brazil, GarrickandKumar, AbhinavandStraub, JulianandRavi, NikhilaandJohnson, JustinandGkioxari, Georgia



研究问题:如何从单张图像中识别3D场景和物体,以实现计算机视觉在机器人和AR/VR中的应用。
动机:尽管2D识别已经取得了显著的进步,但3D识别由于数据集小、方法专一等问题,其进展有限。
方法:通过重新利用和组合现有的数据集,构建了一个大规模的3D物体检测基准Omni3D,包含234k张图片和超过300万个实例和98个类别的标注。同时,提出了Cube R-CNN模型,该模型能够统一处理不同相机和场景类型的问题。
效果:实验证明,Cube R-CNN在更大的Omni3D上的表现优于现有方法,Omni3D是一个强大的3D物体识别数据集,可以提高单一数据集的性能,并通过预训练加速新小型数据集的学习。

Recognizing scenes and objects in 3D from a single image is a longstanding goal of computer vision with applications in robotics and AR/VR. For 2D recognition, large datasets and scalable solutions have led to unprecedented advances. In 3D, existing benchmarks are small in size and approaches specialize in few object categories and specific domains, e.g. urban driving scenes. Motivated by the success of 2D recognition, we revisit the task of 3D object detection by introducing a large benchmark, called Omni3D. Omni3D re-purposes and combines existing datasets resulting in 234k images annotated with more than 3 million instances and 98 categories. 3D detection at such scale is challenging due to variations in camera intrinsics and the rich diversity of scene and object types. We propose a model, called Cube R-CNN, designed to generalize across camera and scene types with a unified approach. We show that Cube R-CNN outperforms prior works on the larger Omni3D and existing benchmarks. Finally, we prove that Omni3D is a powerful dataset for 3D object recognition and show that it improves single-dataset performance and can accelerate learning on new smaller datasets via pre-training.

UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird's-Eye View
Zhou, ShengchaoandLiu, WeizhouandHu, ChenandZhou, ShuchangandMa, Chao



研究问题:在自动驾驶的3D物体检测领域,多模态和单模态的传感器组合多样且复杂。由于多模态方法系统复杂性高,而单模态方法的准确性相对较低,如何在这两者之间进行权衡是困难的。
动机:为了提高单模态检测器的性能,我们提出了一种通用跨模态知识蒸馏框架(UniDistill)。
方法:在训练过程中,UniDistill将教师和学生检测器的特征投影到鸟瞰图(BEV)中,这是一种对不同模态友好的表示。然后,计算三种蒸馏损失来稀疏地对齐前景特征,帮助学生在不引入额外推理成本的情况下从教师那里学习。
效果:通过利用不同检测器在BEV中的相似检测范式,UniDistill可以轻松支持LiDAR到相机、相机到LiDAR、融合到LiDAR和融合到相机的蒸馏路径。此外,这三种蒸馏损失可以过滤误对准的背景信息的影响,并平衡不同大小的对象,从而提高蒸馏的效果。在nuScenes上的大量实验表明,UniDistill有效地提高了学生检测器的mAP和NDS 2.0% - 3.2%。

In the field of 3D object detection for autonomous driving, the sensor portfolio including multi-modality and single-modality is diverse and complex. Since the multi-modal methods have system complexity while the accuracy of single-modal ones is relatively low, how to make a tradeoff between them is difficult. In this work, we propose a universal cross-modality knowledge distillation framework (UniDistill) to improve the performance of single-modality detectors. Specifically, during training, UniDistill projects the features of both the teacher and the student detector into Bird's-Eye-View (BEV), which is a friendly representation for different modalities. Then, three distillation losses are calculated to sparsely align the foreground features, helping the student learn from the teacher without introducing additional cost during inference. Taking advantage of the similar detection paradigm of different detectors in BEV, UniDistill easily supports LiDAR-to-camera, camera-to-LiDAR, fusion-to-LiDAR and fusion-to-camera distillation paths. Furthermore, the three distillation losses can filter the effect of misaligned background information and balance between objects of different sizes, improving the distillation effectiveness. Extensive experiments on nuScenes demonstrate that UniDistill effectively improves the mAP and NDS of student detectors by 2.0% 3.2%.

JRDB-Pose: A Large-Scale Dataset for Multi-Person Pose Estimation and Tracking
Vendrow, EdwardandLe, DuyThoandCai, JianfeiandRezatofighi, Hamid



研究问题:在人类环境中运行的自主机器人系统必须理解周围环境,以做出准确和安全的决定。
动机:现有的从机器人平台捕获的数据集要么没有提供姿势注释,要么没有反映社交机器人的场景分布。
方法:介绍了JRDB-Pose,这是一个大规模的多人体姿势估计和跟踪数据集和基准。JRDB-Pose扩展了现有的JRDB,包括在大学校园环境中由社交导航机器人捕获的视频,包含具有挑战性的场景,如拥挤的室内和室外位置以及各种规模和遮挡类型。
效果:我们在JRDB-Pose上对最先进的多人体姿势估计和跟踪方法进行了全面实验研究,表明我们的数据集对现有方法提出了新的挑战。

Autonomous robotic systems operating in human environments must understand their surroundings to make accurate and safe decisions. In crowded human scenes with close-up human-robot interaction and robot navigation, a deep understanding of surrounding people requires reasoning about human motion and body dynamics over time with human body pose estimation and tracking. However, existing datasets captured from robot platforms either do not provide pose annotations or do not reflect the scene distribution of social robots. In this paper, we introduce JRDB-Pose, a large-scale dataset and benchmark for multi-person pose estimation and tracking. JRDB-Pose extends the existing JRDB which includes videos captured from a social navigation robot in a university campus environment, containing challenging scenes with crowded indoor and outdoor locations and a diverse range of scales and occlusion types. JRDB-Pose provides human pose annotations with per-keypoint occlusion labels and track IDs consistent across the scene and with existing annotations in JRDB. We conduct a thorough experimental study of state-of-the-art multi-person pose estimation and tracking methods on JRDB-Pose, showing that our dataset imposes new challenges for the existing methods. JRDB-Pose is available at https://jrdb.erc.monash.edu/.

Galactic: Scaling End-to-End Reinforcement Learning for Rearrangement at 100k Steps-per-Second
Berges, Vincent-PierreandSzot, AndrewandChaplot, DevendraSinghandGokaslan, AaronandMottaghi, RoozbehandBatra, DhruvandUndersander, Eric



研究问题:如何提高机器人在室内环境中的移动操作能力?
动机:现有的模拟和强化学习框架在速度和效率上存在限制,无法满足大规模实验的需求。
方法:开发了一个名为Galactic的大型模拟和强化学习框架,用于机器人在室内环境中的移动操作。该框架通过优化渲染、物理和强化学习之间的交互,实现了极高的模拟和学习速度。
效果:Galactic在模拟速度和模拟+学习速度上都大大超过了现有的框架Habitat 2.0。利用Galactic,研究人员在短短的时间内训练出了高精度的移动拾取技能,并在大规模实验中实现了85%的成功率。

We present Galactic, a large-scale simulation and reinforcement-learning (RL) framework for robotic mobile manipulation in indoor environments. Specifically, a Fetch robot (equipped with a mobile base, 7DoF arm, RGBD camera, egomotion, and onboard sensing) is spawned in a home environment and asked to rearrange objects -- by navigating to an object, picking it up, navigating to a target location, and then placing the object at the target location. Galactic is fast. In terms of simulation speed (rendering + physics), Galactic achieves over 421,000 steps-per-second (SPS) on an 8-GPU node, which is 54x faster than Habitat 2.0 (7699 SPS). More importantly, Galactic was designed to optimize the entire rendering+physics+RL interplay since any bottleneck in the interplay slows down training. In terms of simulation+RL speed (rendering + physics + inference + learning), Galactic achieves over 108,000 SPS, which 88x faster than Habitat 2.0 (1243 SPS). These massive speed-ups not only drastically cut the wall-clock training time of existing experiments, but also unlock an unprecedented scale of new experiments. First, Galactic can train a mobile pick skill to >80% accuracy in under 16 minutes, a 100x speedup compared to the over 24 hours it takes to train the same skill in Habitat 2.0. Second, we use Galactic to perform the largest-scale experiment to date for rearrangement using 5B steps of experience in 46 hours, which is equivalent to 20 years of robot experience. This scaling results in a single neural network composed of task-agnostic components achieving 85% success in GeometricGoal rearrangement, compared to 0% success reported in Habitat 2.0 for the same approach. The code is available at github.com/facebookresearch/galactic.

CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation
Gadre, SamirYitzhakandWortsman, MitchellandIlharco, GabrielandSchmidt, LudwigandSong, Shuran



研究问题:如何让机器人在没有昂贵导航训练的情况下,通过人类的语言描述找到任意对象(即零射击推理)。
动机:受到最近开放词汇模型在图像分类中成功的启发,研究人员探索了一种直接的框架——CLIP on Wheels (CoW),以适应这种任务而无需微调。
方法:采用大规模文本语料库和知识图谱来训练ERNIE模型,将KG中的知识与文本语料库进行联合训练,ERNIE能够更好地捕捉语义模式。
效果:实验结果表明,ERNIE在各种知识驱动任务上取得了显著改进,并且在其他常见的NLP任务上与最先进的BERT模型相媲美。

For robots to be generally useful, they must be able to find arbitrary objects described by people (i.e., be language-driven) even without expensive navigation training on in-domain data (i.e., perform zero-shot inference). We explore these capabilities in a unified setting: language-driven zero-shot object navigation (L-ZSON). Inspired by the recent success of open-vocabulary models for image classification, we investigate a straightforward framework, CLIP on Wheels (CoW), to adapt open-vocabulary models to this task without fine-tuning. To better evaluate L-ZSON, we introduce the Pasture benchmark, which considers finding uncommon objects, objects described by spatial and appearance attributes, and hidden objects described relative to visible objects. We conduct an in-depth empirical study by directly deploying 22 CoW baselines across Habitat, RoboTHOR, and Pasture. In total we evaluate over 90k navigation episodes and find that (1) CoW baselines often struggle to leverage language descriptions, but are surprisingly proficient at finding uncommon objects. (2) A simple CoW, with CLIP-based object localization and classical exploration---and no additional training---matches the navigation efficiency of a state-of-the-art ZSON method trained for 500M steps on Habitat MP3D data. This same CoW provides a 15.6 percentage point improvement in success over a state-of-the-art RoboTHOR ZSON model.

GINA-3D: Learning To Generate Implicit Neural Assets in the Wild
Shen, BokuiandYan, XinchenandQi, CharlesR.andNajibi, MahyarandDeng, BoyangandGuibas, LeonidasandZhou, YinandAnguelov, Dragomir



研究问题:如何利用传感器数据对3D世界进行建模,以模拟机器人学习问题(如自动驾驶)的测试和验证环境?
动机:手动创建或重建类似真实世界的3D环境既困难又昂贵,且无法扩展。现有的生成模型技术通过仅使用丰富的2D图像来学习3D资产,但仍然受到限制,因为它们要么依赖于人类策划的图像数据集,要么依赖于手动创建的合成3D环境的渲染。
方法:我们引入GINA-3D,一种使用来自相机和激光雷达传感器的真实世界驾驶数据的生成模型,用于创建多样化车辆和行人的逼真3D隐式神经资产。
效果:与现有图像数据集相比,真实世界驾驶设置由于遮挡、光照变化和长尾分布而带来新的挑战。GINA-3D通过将表示学习和生成建模分为两个阶段并用学习的三平面潜在结构来解决这些挑战,这受到了最近图像生成建模进展的启发。通过构建包含超过520K张车辆和行人图像的大规模面向对象的数据集以及80K张诸如施工设备、垃圾车和缆车的长尾实例新图像集,我们的方法在生成的图像和几何形状的质量与多样性方面均实现了最先进的性能。

Modeling the 3D world from sensor data for simulation is a scalable way of developing testing and validation environments for robotic learning problems such as autonomous driving. However, manually creating or re-creating real-world-like environments is difficult, expensive, and not scalable. Recent generative model techniques have shown promising progress to address such challenges by learning 3D assets using only plentiful 2D images -- but still suffer limitations as they leverage either human-curated image datasets or renderings from manually-created synthetic 3D environments. In this paper, we introduce GINA-3D, a generative model that uses real-world driving data from camera and LiDAR sensors to create photo-realistic 3D implicit neural assets of diverse vehicles and pedestrians. Compared to the existing image datasets, the real-world driving setting poses new challenges due to occlusions, lighting-variations and long-tail distributions. GINA-3D tackles these challenges by decoupling representation learning and generative modeling into two stages with a learned tri-plane latent structure, inspired by recent advances in generative modeling of images. To evaluate our approach, we construct a large-scale object-centric dataset containing over 520K images of vehicles and pedestrians from the Waymo Open Dataset, and a new set of 80K images of long-tail instances such as construction equipment, garbage trucks, and cable cars. We compare our model with existing approaches and demonstrate that it achieves state-of-the-art performance in quality and diversity for both generated images and geometries.

Consistent Direct Time-of-Flight Video Depth Super-Resolution
Sun, ZhanghaoandYe, WeiandXiong, JinhuiandChoe, GyeongminandWang, JialiangandSu, ShuochenandRanjan, Rakesh



研究问题:如何提高直接飞行时间(dToF)传感器的空间分辨率。
动机:由于制造能力的限制,dToF数据的空间分辨率较低,需要进行超分辨率处理才能用于后续任务。
方法:通过将低分辨率的dToF数据与相应的高分辨率RGB指导信息进行融合,提出首个多帧融合方案以减轻由低分辨率dToF成像引起的空间模糊。同时,利用dToF传感器为每个局部区域提供独特的深度直方图信息,进一步缓解空间模糊。
效果:在复杂的动态室内环境中评估模型性能,并引入DyDToF,第一个具有动态对象和现实dToF模拟器的合成RGB-dToF视频数据集。这些方法和数据集对广大社区有益,因为dToF深度传感正在成为移动设备的主流。

Direct time-of-flight (dToF) sensors are promising for next-generation on-device 3D sensing. However, limited by manufacturing capabilities in a compact module, the dToF data has low spatial resolution (e.g., 20x30 for iPhone dToF), and it requires a super-resolution step before being passed to downstream tasks. In this paper, we solve this super-resolution problem by fusing the low-resolution dToF data with the corresponding high-resolution RGB guidance. Unlike the conventional RGB-guided depth enhancement approaches which perform the fusion in a per-frame manner, we propose the first multi-frame fusion scheme to mitigate the spatial ambiguity resulting from the low-resolution dToF imaging. In addition, dToF sensors provide unique depth histogram information for each local patch, and we incorporate this dToF-specific feature in our network design to further alleviate spatial ambiguity. To evaluate our models on complex dynamic indoor environments and to provide a large-scale dToF sensor dataset, we introduce DyDToF, the first synthetic RGB-dToF video dataset that features dynamic objects and a realistic dToF simulator following the physical imaging process. We believe the methods and dataset are beneficial to a broad community as dToF depth sensing is becoming mainstream on mobile devices. Our code and data are publicly available. https://github.com/facebookresearch/DVSR/

Understanding the Robustness of 3D Object Detection With Bird's-Eye-View Representations in Autonomous Driving
Zhu, ZijianandZhang, YichiandChen, HaiandDong, YinpengandZhao, ShuandDing, WenboandZhong, JiachenandZheng, Shibao



研究问题:本文旨在评估各种代表性的视觉依赖BEV模型在广泛设置下的自然和对抗鲁棒性,以全面了解它们与没有BEV的特征相比,受显式BEV特征影响的行为。
动机:虽然鸟瞰图(BEV)表示法已经显著提高了基于相机输入的3D检测器在流行基准上的性能,但目前仍缺乏对这种视觉依赖的BEV模型稳健性的系统理解,这与自动驾驶系统的安全性密切相关。
方法:通过在广泛的设置下评估各种代表性模型的自然和对抗鲁棒性,来充分理解它们受显式BEV特征影响的行为。此外,还提出了一种3D一致补丁攻击,通过在3D空间应用对抗性补丁来保证时空一致性,这对于自动驾驶场景来说更为现实。
效果:实验结果表明,1) BEV模型在不同自然条件和常见损坏下比之前的方法更稳定,这是由于其富有表现力的空间表示;2) BEV模型更容易受到对抗性噪声的影响,这主要是由于冗余的BEV特征;3) 具有多模态输入的相机-激光雷达融合模型在不同的设置下表现出优越的性能,但BEV融合模型仍然容易受到点云和图像的对抗性噪声的影响。这些发现提醒了BEV检测器应用中的安全问题,并有助于开发更鲁棒的模型。

3D object detection is an essential perception task in autonomous driving to understand the environments. The Bird's-Eye-View (BEV) representations have significantly improved the performance of 3D detectors with camera inputs on popular benchmarks. However, there still lacks a systematic understanding of the robustness of these vision-dependent BEV models, which is closely related to the safety of autonomous driving systems. In this paper, we evaluate the natural and adversarial robustness of various representative models under extensive settings, to fully understand their behaviors influenced by explicit BEV features compared with those without BEV. In addition to the classic settings, we propose a 3D consistent patch attack by applying adversarial patches in the 3D space to guarantee the spatiotemporal consistency, which is more realistic for the scenario of autonomous driving. With substantial experiments, we draw several findings: 1) BEV models tend to be more stable than previous methods under different natural conditions and common corruptions due to the expressive spatial representations; 2) BEV models are more vulnerable to adversarial noises, mainly caused by the redundant BEV features; 3) Camera-LiDAR fusion models have superior performance under different settings with multi-modal inputs, but BEV fusion model is still vulnerable to adversarial noises of both point cloud and image. These findings alert the safety issue in the applications of BEV detectors and could facilitate the development of more robust models.

Anchor3DLane: Learning To Regress 3D Anchors for Monocular 3D Lane Detection
Huang, ShaofeiandShen, ZhenweiandHuang, ZehaoandDing, Zi-hanandDai, JiaoandHan, JizhongandWang, NaiyanandLiu, Si



研究问题:单目三维车道检测由于缺乏深度信息,是一个具有挑战性的任务。
动机:目前的前视图像或特征转换为鸟瞰图空间的方法依赖于平坦地面的假设和上下文信息的丢失,使得从鸟瞰图表示恢复3D信息不准确。
方法:我们定义了三维车道锚点,并提出了一种名为Anchor3DLane的无需鸟瞰图的方法,直接从前视表示预测三维车道。三维车道锚点被投影到前视特征上以提取其包含良好结构和上下文信息的特征,以进行准确的预测。此外,我们还开发了一种全局优化方法,利用车道之间的等宽属性来减少预测的横向误差。
效果:我们在三个流行的三维车道检测基准上进行了广泛的实验,结果显示我们的Anchor3DLane优于先前的基于鸟瞰图的方法,并实现了最先进的性能。

Monocular 3D lane detection is a challenging task due to its lack of depth information. A popular solution is to first transform the front-viewed (FV) images or features into the bird-eye-view (BEV) space with inverse perspective mapping (IPM) and detect lanes from BEV features. However, the reliance of IPM on flat ground assumption and loss of context information make it inaccurate to restore 3D information from BEV representations. An attempt has been made to get rid of BEV and predict 3D lanes from FV representations directly, while it still underperforms other BEV-based methods given its lack of structured representation for 3D lanes. In this paper, we define 3D lane anchors in the 3D space and propose a BEV-free method named Anchor3DLane to predict 3D lanes directly from FV representations. 3D lane anchors are projected to the FV features to extract their features which contain both good structural and context information to make accurate predictions. In addition, we also develop a global optimization method that makes use of the equal-width property between lanes to reduce the lateral error of predictions. Extensive experiments on three popular 3D lane detection benchmarks show that our Anchor3DLane outperforms previous BEV-based methods and achieves state-of-the-art performances. The code is available at: https://github.com/tusen-ai/Anchor3DLane.

V2V4Real: A Real-World Large-Scale Dataset for Vehicle-to-Vehicle Cooperative Perception
Xu, RunshengandXia, XinandLi, JinlongandLi, HanzhaoandZhang, ShuoandTu, ZhengzhongandMeng, ZonglinandXiang, HaoandDong, XiaoyuandSong, RuiandYu, HongkaiandZhou, BoleiandMa, Jiaqi



研究问题:自动驾驶车辆的现代感知系统对遮挡敏感,且缺乏长感知范围的能力,这是阻碍五级自动驾驶的关键瓶颈之一。
动机:最近的研究表明,车对车(V2V)协同感知系统具有改变自动驾驶行业的潜力,但缺乏真实世界数据集阻碍了该领域的发展。
方法:我们提出了V2V4Real,这是第一个用于V2V感知的大型真实世界多模态数据集。数据由两辆配备多模态传感器的车辆通过各种场景共同驾驶收集。我们的V2V4Real数据集覆盖了410公里的驾驶区域,包括20K个激光雷达帧,40K个RGB帧,24万个标注的5类3D边界框和覆盖所有驾驶路线的高分辨率地图。
效果:我们在三个任务上对最新的协同感知算法进行了全面基准测试,包括协同3D目标检测、协同3D目标跟踪和协同感知的Sim2Real领域适应。V2V4Real数据集可以在research.seas.ucla.edu/mobility-lab/v2v4real/找到。

Modern perception systems of autonomous vehicles are known to be sensitive to occlusions and lack the capability of long perceiving range. It has been one of the key bottlenecks that prevents Level 5 autonomy. Recent research has demonstrated that the Vehicle-to-Vehicle (V2V) cooperative perception system has great potential to revolutionize the autonomous driving industry. However, the lack of a real-world dataset hinders the progress of this field. To facilitate the development of cooperative perception, we present V2V4Real, the first large-scale real-world multi-modal dataset for V2V perception. The data is collected by two vehicles equipped with multi-modal sensors driving together through diverse scenarios. Our V2V4Real dataset covers a driving area of 410 km, comprising 20K LiDAR frames, 40K RGB frames, 240K annotated 3D bounding boxes for 5 classes, and HDMaps that cover all the driving routes. V2V4Real introduces three perception tasks, including cooperative 3D object detection, cooperative 3D object tracking, and Sim2Real domain adaptation for cooperative perception. We provide comprehensive benchmarks of recent cooperative perception algorithms on three tasks. The V2V4Real dataset can be found at research.seas.ucla.edu/mobility-lab/v2v4real/.

ViP3D: End-to-End Visual Trajectory Prediction via 3D Agent Queries
Gu, JunruandHu, ChenxuandZhang, TianyuanandChen, XuanyaoandWang, YilunandWang, YueandZhao, Hang



研究问题:现有的自动驾驶系统中,感知和预测是两个独立的模块,通过手工挑选的特征如代理边界框和轨迹进行交互。由于这种分离,作为下游模块的预测只能从感知模块接收有限的信息,而且感知模块的错误可能会传播和累积,对预测结果产生负面影响。
动机:为了解决上述问题,本文提出了ViP3D,一种基于查询的视频视觉轨迹预测管道,利用原始视频中的丰富信息直接预测场景中代理的未来轨迹。
方法:ViP3D在整个管道中使用稀疏的代理查询进行检测、跟踪和预测,使其成为首个全可微的视觉轨迹预测方法。与使用历史特征图和轨迹不同,ViP3D将来自先前时间戳的有用信息编码在代理查询中,使其成为一种简洁的流预测方法。
效果:在nuScenes数据集上的大量实验结果表明,ViP3D在传统管道和先前的端到端模型上表现出强大的视觉预测性能。

Perception and prediction are two separate modules in the existing autonomous driving systems. They interact with each other via hand-picked features such as agent bounding boxes and trajectories. Due to this separation, prediction, as a downstream module, only receives limited information from the perception module. To make matters worse, errors from the perception modules can propagate and accumulate, adversely affecting the prediction results. In this work, we propose ViP3D, a query-based visual trajectory prediction pipeline that exploits rich information from raw videos to directly predict future trajectories of agents in a scene. ViP3D employs sparse agent queries to detect, track, and predict throughout the pipeline, making it the first fully differentiable vision-based trajectory prediction approach. Instead of using historical feature maps and trajectories, useful information from previous timestamps is encoded in agent queries, which makes ViP3D a concise streaming prediction method. Furthermore, extensive experimental results on the nuScenes dataset show the strong vision-based prediction performance of ViP3D over traditional pipelines and previous end-to-end models.

Command-Driven Articulated Object Understanding and Manipulation
Chu, RuihangandLiu, ZhengzheandYe, XiaoqingandTan, XiaoandQi, XiaojuanandFu, Chi-WingandJia, Jiaya



研究问题:如何通过人类命令进行铰接物体的操作。
动机:现有的研究主要关注于推断铰接结构,我们进一步支持根据简单的命令模板操作铰接形状。
方法:我们提出了一种新的方法Cart,该方法利用对象结构的预测将视觉观察与用户命令连接起来,以实现有效的操作。
效果:对于各种丰富的对象类别,Cart可以准确地操作对象形状,并在理解内在铰接结构方面超越最先进的方法。此外,它还可以很好地推广到未见过的对象类别和真实世界的物体上。我们希望Cart能为指导机器操作铰接物体开辟新的方向。

We present Cart, a new approach towards articulated-object manipulations by human commands. Beyond the existing work that focuses on inferring articulation structures, we further support manipulating articulated shapes to align them subject to simple command templates. The key of Cart is to utilize the prediction of object structures to connect visual observations with user commands for effective manipulations. It is achieved by encoding command messages for motion prediction and a test-time adaptation to adjust the amount of movement from only command supervision. For a rich variety of object categories, Cart can accurately manipulate object shapes and outperform the state-of-the-art approaches in understanding the inherent articulation structures. Also, it can well generalize to unseen object categories and real-world objects. We hope Cart could open new directions for instructing machines to operate articulated objects.

Unicode Analogies: An Anti-Objectivist Visual Reasoning Challenge
Spratley, StevenandEhinger, KristaA.andMiller, Tim



研究问题:现有的PMP方法在评估计算机视觉中的类比推理能力时,难以暴露出研究问题:现有的PMP方法在评估计算机视觉中的类比推理能力时,难以暴露出求解器缺乏有意义的泛化的问题,并且强化了客观主义的观点,即物体只能以一种方式被看到。
动机:本文提出了Unicode类比挑战,通过多义、基于字符的PMP来评估视觉系统中流畅的概念化能力。
方法:我们设计了一个框架,通过呈现更难以完成的任务来挑战模型,这些任务需要强大的特征提取才能完成,但对人类参与者来说仍然可以解决。
效果:我们认为Unicode类比优雅地捕捉并测试了当前一代AI严重缺乏的人的视觉推理方面。

Analogical reasoning enables agents to extract relevant information from scenes, and efficiently navigate them in familiar ways. While progressive-matrix problems (PMPs) are becoming popular for the development and evaluation of analogical reasoning in computer vision, we argue that the dominant methodology in this area struggles to expose the lack of meaningful generalisation in solvers, and reinforces an objectivist stance on perception -- that objects can only be seen one way -- which we believe to be counter-productive. In this paper, we introduce the Unicode Analogies challenge, consisting of polysemic, character-based PMPs to benchmark fluid conceptualisation ability in vision systems. Writing systems have evolved characters at multiple levels of abstraction, from iconic through to symbolic representations, producing both visually interrelated yet exceptionally diverse images when compared to those exhibited by existing PMP datasets. Our framework has been designed to challenge models by presenting tasks much harder to complete without robust feature extraction, while remaining largely solvable by human participants. We therefore argue that Unicode Analogies elegantly captures and tests for a facet of human visual reasoning that is severely lacking in current-generation AI.

MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors
Zhang, YuangandWang, TiancaiandZhang, Xiangyu



研究问题:本文旨在提出一种简单而有效的方法,通过预训练的目标检测器启动端到端的多目标跟踪。
动机:现有的端到端方法,如MOTR和TrackFormer,由于其较差的检测性能,不如基于检测的跟踪方法。
方法:我们首先采用查询的锚定公式,然后使用额外的目标检测器生成提案作为锚点,为MOTR提供检测前处理。这种简单的修改大大缓解了MOTR中联合学习检测和关联任务之间的冲突。
效果:实验结果表明,MOTRv2在DanceTrack数据集上实现了所有现有方法中的最高性能(73.4% HOTA)。此外,MOTRv2在BDD100K数据集上达到了最先进的性能。我们希望这个简单而有效的管道能为端到端的MOT社区提供一些新的启示。代码将在不久的将来发布。

In this paper, we propose MOTRv2, a simple yet effective pipeline to bootstrap end-to-end multi-object tracking with a pretrained object detector. Existing end-to-end methods, e.g. MOTR and TrackFormer are inferior to their tracking-by-detection counterparts mainly due to their poor detection performance. We aim to improve MOTR by elegantly incorporating an extra object detector. We first adopt the anchor formulation of queries and then use an extra object detector to generate proposals as anchors, providing detection prior to MOTR. The simple modification greatly eases the conflict between joint learning detection and association tasks in MOTR. MOTRv2 keeps the end-to-end feature and scales well on large-scale benchmarks. MOTRv2 achieves the top performance (73.4% HOTA) among all existing methods on the DanceTrack dataset. Moreover, MOTRv2 reaches state-of-the-art performance on the BDD100K dataset. We hope this simple and effective pipeline can provide some new insights to the end-to-end MOT community. The code will be released in the near future.

Improving Vision-and-Language Navigation by Generating Future-View Image Semantics
Li, JialuandBansal, Mohit



研究问题:本文旨在探索在视觉-语言导航任务中,通过生成未来可能的视图,是否可以提高导航效果。
动机:人类在理解自然语言指令和周围环境的基础上,会对未来环境有一个预期,这有助于正确的导航。因此,作者提出让代理模型也具备这种能力。
方法:作者首先提出了三个代理模型在领域内预训练时的代理任务:Masked Panorama Modeling(MPM),Masked Trajectory Modeling(MTM)和Action Prediction with Image Generation(APIG)。然后,作者在视觉-语言导航任务上对代理模型进行微调,并使用一个辅助损失函数来最小化生成的视图语义与下一步的真实视图语义之间的差异。
效果:实验结果表明,该方法在Room-to-Room数据集和CVDN数据集上都取得了新的最先进的成果。此外,代理模型还能预测出未来视图中的缺失部分,提高了预测动作的解释性,并在更长的路径上表现更好。

Vision-and-Language Navigation (VLN) is the task that requires an agent to navigate through the environment based on natural language instructions. At each step, the agent takes the next action by selecting from a set of navigable locations. In this paper, we aim to take one step further and explore whether the agent can benefit from generating the potential future view during navigation. Intuitively, humans will have an expectation of how the future environment will look like, based on the natural language instructions and surrounding views, which will aid correct navigation. Hence, to equip the agent with this ability to generate the semantics of future navigation views, we first propose three proxy tasks during the agent's in-domain pre-training: Masked Panorama Modeling (MPM), Masked Trajectory Modeling (MTM), and Action Prediction with Image Generation (APIG). These three objectives teach the model to predict missing views in a panorama (MPM), predict missing steps in the full trajectory (MTM), and generate the next view based on the full instruction and navigation history (APIG), respectively. We then fine-tune the agent on the VLN task with an auxiliary loss that minimizes the difference between the view semantics generated by the agent and the ground truth view semantics of the next step. Empirically, our VLN-SIG achieves the new state-of-the-art on both the Room-to-Room dataset and the CVDN dataset. We further show that our agent learns to fill in missing patches in future views qualitatively, which brings more interpretability over agents' predicted actions. Lastly, we demonstrate that learning to predict future view semantics also enables the agent to have better performance on longer paths.

CIMI4D: A Large Multimodal Climbing Motion Dataset Under Human-Scene Interactions
Yan, MingandWang, XinandDai, YudiandShen, SiqiandWen, ChengluandXu, LanandMa, YuexinandWang, Cheng



研究问题:运动捕捉是一个长期存在的问题,尤其是对于地面以上的动作如攀岩等,由于其复杂的后部姿势、复杂的人与场景交互和困难的全局定位,目前的研究还非常有限。
动机:攀岩动作是体育和消防领域的重要动作,但由于缺乏特定的数据集,研究社区对其理解不深。为了解决这个问题,我们收集了CIMI4D数据集。
方法:我们从12个人在13个不同的攀岩墙上攀岩的录像中收集了大量的数据,包括约180,000帧的姿势惯性测量、激光雷达点云、RGB视频、高精度静态点云场景和重建的场景网格。我们还逐帧标注了接触岩石支撑物的位置,以便于详细探索人与场景的交互。
效果:通过四个任务(包括有人/无人场景约束的人姿估计、姿态预测和姿态生成)的实验结果证明,CIMI4D对现有方法提出了巨大的挑战,并为进一步研究提供了广阔的机会。

Motion capture is a long-standing research problem. Although it has been studied for decades, the majority of research focus on ground-based movements such as walking, sitting, dancing, etc. Off-grounded actions such as climbing are largely overlooked. As an important type of action in sports and firefighting field, the climbing movements is challenging to capture because of its complex back poses, intricate human-scene interactions, and difficult global localization. The research community does not have an in-depth understanding of the climbing action due to the lack of specific datasets. To address this limitation, we collect CIMI4D, a large rock ClImbing MotIon on dataset from 12 persons climbing 13 different climbing walls. The dataset consists of around 180,000 frames of pose inertial measurements, LiDAR point clouds, RGB videos, high-precision static point cloud scenes, and reconstructed scene meshes. Moreover, we frame-wise annotate touch rock holds to facilitate a detailed exploration of human-scene interaction. The core of this dataset is a blending optimization process, which corrects for the pose as it drifts and is affected by the magnetic conditions. To evaluate the merit of CIMI4D, we perform four tasks which include human pose estimations (with/without scene constraints), pose prediction, and pose generation. The experimental results demonstrate that CIMI4D presents great challenges to existing methods and enables extensive research opportunities. We share the dataset with the research community in http://www.lidarhumanmotion.net/cimi4d/.